Preprocess#

class ingredient_parser.en.preprocess.PreProcessor(input_sentence: str, custom_units: dict[str, str] | None = None)[source]#

Recipe ingredient sentence PreProcessor class.

Performs the necessary preprocessing on a sentence to generate the features required for the ingredient parser model.

Each input sentence goes through a cleaning process to tidy up the input into a standardised form.

Parameters:
input_sentencestr

Input ingredient sentence.

custom_unitsdict[str, str] | None, optional

Dict of plural-singular pairs of custom units.

Attributes:
inputstr

Input ingredient sentence.

sentencestr

Input ingredient sentence, cleaned to standardised form.

singularised_indiceslist[int]

Indices of tokens in tokenized sentence that have been converted from plural to singular

tokenized_sentencelist[Token]

Tokenised ingredient sentence.

Methods

sentence_features()

Return dict of features for each token in sentence.

Notes

The cleaning steps are as follows

  1. Replace all en-dashes and em-dashes with hyphens.
  2. Replace numbers given as words with the numeric equivalent.
    e.g. one >> 1
  3. Replace fractions given in html markup with the unicode representation.
    e.g. ½ >> ½
  4. Replace unicode fractions with the equivalent decimal form. Decimals are
    rounded to a maximum of 3 decimal places.
    e.g. ½ >> 0.5
  5. Identify fractions represented by 1/2, 2/3 etc. by replaceing the slash with $
    and the prepending # in front of the fraction e.g. #1$2
    e.g. 1/2 >> 0.5
  6. A space is enforced between quantities and units
  7. Remove trailing periods from units
    e.g. tsp. >> tsp
  8. Numeric ranges indicated in words using “to” or “or” are replaced with a
    standard numeric form
    e.g. 1 or 2 >> 1-2; 10 to 12 >> 10-12
  9. Units are made singular. This step uses a predefined list of plural units and
    their singular form.

Following the cleaning of the input sentence, it is tokenized into a list of tokens.

Each token is one of the following

  • A word, including most punctuation marks

  • Opening or closing parentheses, braces, brackets; comma; speech marks

The features for each token are computed on demand using the sentence_features method, which returns a list of dictionaries. Each dictionary is the feature set for each token.

The sentence features can then be passed to the CRF model which will generate the parsed output.

sentence_features() list[dict[str, str | bool]][source]#

Return dict of features for each token in sentence.

Returns:
list[FeatureDict]

List of feature dicts for each token in sentence.