Preprocess#
- class ingredient_parser.en.preprocess.PreProcessor(input_sentence: str, custom_units: dict[str, str] | None = None)[source]#
Recipe ingredient sentence PreProcessor class.
Performs the necessary preprocessing on a sentence to generate the features required for the ingredient parser model.
Each input sentence goes through a cleaning process to tidy up the input into a standardised form.
- Parameters:
- Attributes:
Methods
Return dict of features for each token in sentence.
Notes
The cleaning steps are as follows
- Replace all en-dashes and em-dashes with hyphens.
- Replace numbers given as words with the numeric equivalent.e.g. one >> 1
- Replace fractions given in html markup with the unicode representation.e.g. ½ >> ½
- Replace unicode fractions with the equivalent decimal form. Decimals arerounded to a maximum of 3 decimal places.e.g. ½ >> 0.5
- Identify fractions represented by 1/2, 2/3 etc. by replaceing the slash with $and the prepending # in front of the fraction e.g. #1$2e.g. 1/2 >> 0.5
- A space is enforced between quantities and units
- Remove trailing periods from unitse.g. tsp. >> tsp
- Numeric ranges indicated in words using “to” or “or” are replaced with astandard numeric forme.g. 1 or 2 >> 1-2; 10 to 12 >> 10-12
- Units are made singular. This step uses a predefined list of plural units andtheir singular form.
Following the cleaning of the input sentence, it is tokenized into a list of tokens.
Each token is one of the following
A word, including most punctuation marks
Opening or closing parentheses, braces, brackets; comma; speech marks
The features for each token are computed on demand using the
sentence_featuresmethod, which returns a list of dictionaries. Each dictionary is the feature set for each token.The sentence features can then be passed to the CRF model which will generate the parsed output.