Preprocess

Preprocess#

class ingredient_parser.en.preprocess.PreProcessor(input_sentence: str, custom_units: dict[str, str] | None = None)[source]#

Recipe ingredient sentence PreProcessor class.

Performs the necessary preprocessing on a sentence to generate the features required for the ingredient parser model.

Each input sentence goes through a cleaning process to tidy up the input into a standardised form.

Parameters:

input_sentencestr: Input ingredient sentence.
custom_unitsdict[str, str] | None, optional: Dict of plural-singular pairs of custom units.

Attributes:

inputstr: Input ingredient sentence.
sentencestr: Input ingredient sentence, cleaned to standardised form.
singularised_indiceslist[int]: Indices of tokens in tokenized sentence that have been converted from plural to singular
tokenized_sentencelist[Token]: Tokenised ingredient sentence.

Methods

sentence_features()

Return dict of features for each token in sentence.

Notes

The cleaning steps are as follows

Replace all en-dashes and em-dashes with hyphens.
Replace numbers given as words with the numeric equivalent.

e.g. one >> 1
Replace fractions given in html markup with the unicode representation.

e.g. ½ >> ½
Replace unicode fractions with the equivalent decimal form. Decimals are

rounded to a maximum of 3 decimal places.

e.g. ½ >> 0.5
Identify fractions represented by 1/2, 2/3 etc. by replaceing the slash with $

and the prepending # in front of the fraction e.g. #1$2

e.g. 1/2 >> 0.5
A space is enforced between quantities and units
Remove trailing periods from units

e.g. tsp. >> tsp
Numeric ranges indicated in words using “to” or “or” are replaced with a

standard numeric form

e.g. 1 or 2 >> 1-2; 10 to 12 >> 10-12
Units are made singular. This step uses a predefined list of plural units and

their singular form.

Following the cleaning of the input sentence, it is tokenized into a list of tokens.

Each token is one of the following

A word, including most punctuation marks
Opening or closing parentheses, braces, brackets; comma; speech marks

The features for each token are computed on demand using the sentence_features method, which returns a list of dictionaries. Each dictionary is the feature set for each token.

The sentence features can then be passed to the CRF model which will generate the parsed output.

sentence_features() → list[dict[str, str | bool]][source]#

Return dict of features for each token in sentence.

Returns:

list[FeatureDict]: List of feature dicts for each token in sentence.

Preprocess

Contents

Preprocess#