PreProcess#

class ingredient_parser.preprocess.PreProcessor(input_sentence: str, defer_pos_tagging: bool = False, show_debug_output: bool = False)[source]#

Recipe ingredient sentence PreProcessor class.

Performs the necessary preprocessing on a sentence to generate the features required for the ingredient parser model.

Each input sentence goes through a cleaning process to tidy up the input into a standardised form.

Notes

The cleaning steps are as follows

  1. Replace all en-dashes and em-dashes with hyphens.
  2. Replace numbers given as words with the numeric equivalent.
    e.g. one >> 1
  3. Replace fractions given in html markup with the unicode representation.
    e.g. ½ >> ½
  4. Replace unicode fractions with the equivalent decimal form. Decimals are
    rounded to a maximum of 3 decimal places.
    e.g. ½ >> 0.5
  5. Replace “fake” fractions represented by 1/2, 2/3 etc. with the equivalent
    decimal form
    e.g. 1/2 >> 0.5
  6. A space is enforced between quantities and units
  7. Remove trailing periods from units
    e.g. tsp. >> tsp
  8. Numeric ranges indicated in words using “to” or “or” are replaced with a
    standard numeric form
    e.g. 1 or 2 >> 1-2; 10 to 12 >> 10-12
  9. Units are made singular. This step uses a predefined list of plural units and
    their singular form.

Following the cleaning of the input sentence, it is tokenized into a list of tokens.

Each token is one of the following

  • A word, including most punctuation marks

  • Opening or closing parentheses, braces, brackets; comma; speech marks

The features for each token are computed on demand using the sentence_features method, which returns a list of dictionaries. Each dictionary is the feature set for each token.

The sentence features can then be passed to the CRF model which will generate the parsed output.

Parameters:
  • input_sentence (str) – Input ingredient sentence.

  • defer_pos_tagging (bool) – Defer part of speech tagging until feature generation. Part of speech tagging is an expensive operation and it’s not always needed when using this class.

  • show_debug_output (bool, optional) – If True, print out each stage of the sentence normalisation

defer_pos_tagging#

Defer part of speech tagging until feature generation

Type:

bool

show_debug_output#

If True, print out each stage of the sentence normalisation

Type:

bool

input#

Input ingredient sentence.

Type:

str

pos_tags#

Part of speech tag for each token in the tokenized sentence.

Type:

list[str]

sentence#

Input ingredient sentence, cleaned to standardised form.

Type:

str

singularised_indices#

Indices of tokens in tokenised sentence that have been converted from plural to singular

Type:

list[int]

tokenized_sentence#

Tokenised ingredient sentence.

Type:

list[str]

sentence_features() list[dict[str, str | bool]][source]#

Return features for all tokens in sentence

Returns:

list[dict[str, str | bool]] – List of features for each token in sentence

ingredient_parser.preprocess.stem(token: str) str[source]#

Stem function with cache to improve performance. The stem of a word output by the PorterStemmer is always the same, so we can cache the result the first time and return that for subsequent future calls without the need to do all the processing again.

Parameters:

token (str) – Token to stem

Returns:

str – Stem of token

ingredient_parser.preprocess.tokenize(sentence: str) list[str][source]#

Tokenise an ingredient sentence. The sentence is split on whitespace characters into a list of tokens. If any of these tokens contains of the punctuation marks captured by PUNCTUATION_TOKENISER, these are then split and isolated as a seperate token.

The returned list of tokens has any empty tokens removed.

Parameters:

sentence (str) – Ingredient sentence to tokenize

Returns:

list[str] – List of tokens from sentence.

Examples

>>> tokenize("2 cups (500 ml) milk")
["2", "cups", "(", "500", "ml", ")", "milk"]
>>> tokenize("1-2 mashed bananas: as ripe as possible")
["1-2", "mashed", "bananas", ":", "as", "ripe", "as", "possible"]