PreProcess#
- class ingredient_parser.preprocess.PreProcessor(input_sentence: str, defer_pos_tagging: bool = False, show_debug_output: bool = False)[source]#
Recipe ingredient sentence PreProcessor class.
Performs the necessary preprocessing on a sentence to generate the features required for the ingredient parser model.
Each input sentence goes through a cleaning process to tidy up the input into a standardised form.
Notes
The cleaning steps are as follows
- Replace all en-dashes and em-dashes with hyphens.
- Replace numbers given as words with the numeric equivalent.e.g. one >> 1
- Replace fractions given in html markup with the unicode representation.e.g. ½ >> ½
- Replace unicode fractions with the equivalent decimal form. Decimals arerounded to a maximum of 3 decimal places.e.g. ½ >> 0.5
- Replace “fake” fractions represented by 1/2, 2/3 etc. with the equivalentdecimal forme.g. 1/2 >> 0.5
- A space is enforced between quantities and units
- Remove trailing periods from unitse.g. tsp. >> tsp
- Numeric ranges indicated in words using “to” or “or” are replaced with astandard numeric forme.g. 1 or 2 >> 1-2; 10 to 12 >> 10-12
- Units are made singular. This step uses a predefined list of plural units andtheir singular form.
Following the cleaning of the input sentence, it is tokenized into a list of tokens.
Each token is one of the following
A word, including most punctuation marks
Opening or closing parentheses, braces, brackets; comma; speech marks
The features for each token are computed on demand using the
sentence_features
method, which returns a list of dictionaries. Each dictionary is the feature set for each token.The sentence features can then be passed to the CRF model which will generate the parsed output.
- Parameters:
input_sentence (str) – Input ingredient sentence.
defer_pos_tagging (bool) – Defer part of speech tagging until feature generation. Part of speech tagging is an expensive operation and it’s not always needed when using this class.
show_debug_output (bool, optional) – If True, print out each stage of the sentence normalisation
- singularised_indices#
Indices of tokens in tokenised sentence that have been converted from plural to singular
- ingredient_parser.preprocess.stem(token: str) str [source]#
Stem function with cache to improve performance. The stem of a word output by the PorterStemmer is always the same, so we can cache the result the first time and return that for subsequent future calls without the need to do all the processing again.
- Parameters:
token (str) – Token to stem
- Returns:
str – Stem of token
- ingredient_parser.preprocess.tokenize(sentence: str) list[str] [source]#
Tokenise an ingredient sentence. The sentence is split on whitespace characters into a list of tokens. If any of these tokens contains of the punctuation marks captured by PUNCTUATION_TOKENISER, these are then split and isolated as a seperate token.
The returned list of tokens has any empty tokens removed.
- Parameters:
sentence (str) – Ingredient sentence to tokenize
- Returns:
list[str] – List of tokens from sentence.
Examples
>>> tokenize("2 cups (500 ml) milk") ["2", "cups", "(", "500", "ml", ")", "milk"]
>>> tokenize("1-2 mashed bananas: as ripe as possible") ["1-2", "mashed", "bananas", ":", "as", "ripe", "as", "possible"]