Sentence Normalisation#
Normalisation is the process of transforming the sentences to ensure that particular features of the sentence have a standardised form. This pre-processing step is there to remove as much of the variation in the data that can be reasonably foreseen, so that the model is presented with tidy and consistent data and therefore has an easier time assigning the correct labels.
The PreProcessor class handles the sentence normalisation.
>>> from Preprocess import PreProcessor
>>> p = PreProcessor("1/2 cup orange juice, freshly squeezed")
>>> p.sentence
'#1$2 cup orange juice, freshly squeezed'
The normalisation of the input sentence is done on initialisation of a PreProcessor object. The _normalise() method of the PreProcessor class is called, which executes a number of steps to clean up the input sentence.
def _normalise(self, sentence: str) -> str:
"""Normalise sentence prior to feature extraction.
Parameters
----------
sentence : str
Ingredient sentence.
Returns
-------
str
Normalised ingredient sentence.
"""
# List of functions to apply to sentence
# Note that the order matters
funcs = [
self._remove_price_annotations,
self._replace_en_em_dash,
self._replace_html_fractions,
self._replace_unicode_fractions,
combine_quantities_split_by_and,
self._identify_fractions,
self._split_quantity_and_units,
self._remove_unit_trailing_period,
replace_string_range,
self._replace_dupe_units_ranges,
self._merge_quantity_x,
self._collapse_ranges,
]
for func in funcs:
sentence = func(sentence)
logger.debug(f"{func.__name__}: {sentence}")
return sentence.strip()
Tip
By setting show_debug_output=True when instantiating a PreProcessor object, the sentence will be printed out at each step of the normalisation process.
Each of the normalisation steps is described below.
_remove_price_annotationsPrice annotations, typically at the end of ingredient sentences such as
($1.99), are removed._replace_en_em_dashEn-dashes (–) and em-dashes (—) are replaced with hyphens (-). This makes identification of ranges of quantities easier.
_replace_html_fractionsFractions written as html entities (e.g.
½for 0.5) are replaced with Unicode equivalents (e.g. ½). This is done using the standard library’shtml.unescape()function._replace_unicode_fractionsFractions represented by Unicode fractions are replaced a textual format (.e.g ½ as 1/2), as defined by the dictionary in this function. Because we replaced the html fractions in the previous step, these are also converted here too.
There are two cases to consider: where the character before the unicode fraction is a hyphen and where it is not.
In the second case, we insert a space before the replacement so we don’t accidentally merge with the character before. For example we want 1½ to become 1 1/2 and not 11/2.
However, if the character before is a hyphen, we don’t want to do this because we could end up splitting a range up. For example, we want ½-¾ to become 1/2-3/4 and not 1/2- 3/4 (note the space before the 3).
combine_quantities_split_by_andFractional quantities split by ‘and’ e.g. 1 and 1/2 are converted to the format described in the next step. We do this now instead of later to avoid treating the 1/2 on it’s own.
_identify_fractionsAll remaining fractions are modified so that they survive tokenisation as a single token. This is necessary so that we can convert them to
fractions.Fractionobjects later.For fractions less than 1, the forward slash is replaced by
$and a#is prepended e.g. 1/2 becomes #1$2.For fractions greater than 1, the forward slash is replaced by
$and a#is inserted between the integer and the fraction e.g. 2 3/4 becomes 2#3$4._split_quantity_and_unitsA space is enforced between quantities and units to make sure they are tokenized to separate tokens. If a quantity and unit are joined by a hyphen, this is also replaced by a space. This takes into account certain strings that aren’t technically units, but we want to treat in the same way here, for example x in the context 1x or 2x.
_remove_unit_trailing_periodUnits with a trailing period have the period removed. This is only done for a subset of units where this has been observed in the model training data.
replace_string_rangeRanges are replaced with a standardised form of X-Y. A regular expression searches for ranges in the sentence that match anything in the following forms:
1 to 2
1- to 2-
1 or 2
1- or 2-
where the numbers 1 and 2 represent any decimal value or fraction as modified above.
The purpose of this is to ensure the range is kept as a single token.
_replace_dupe_units_rangesRanges where the unit is given for both quantities are replaced with the standardised range format, e.g. 5 oz - 8 oz is replaced by 5-8 oz. Cases where the same unit is used but in different forms (e.g. 5 oz - 8 ounce) are also considered for the unit synonyms defined in the
UNIT_SYNONYMSconstant._merge_quantity_xQuantities followed by an “x” are merged together so they form a single token, for example:
1 x -> 1x
0.5 x -> 0.5x
_collapse_rangesRemove any white space surrounding the hyphen in a range
Singularising units#
Units are converted to their singular form, using a predefined list of plural units and their singular form. This step is actually performed after tokenisation so that we can keep track of the index of each token that has been modified. This is so we can automatically re-pluralise only the tokens that were singularised after the labelling by the model.