Sentence Normalisation

Sentence Normalisation#

Normalisation is the process of transforming the sentences to ensure that particular features of the sentence have a standardised form. This pre-processing step is there to remove as much of the variation in the data that can be reasonably foreseen, so that the model is presented with tidy and consistent data and therefore has an easier time assigning the correct labels.

The PreProcessor class handles the sentence normalisation.

>>> from Preprocess import PreProcessor
>>> p = PreProcessor("1/2 cup orange juice, freshly squeezed")
>>> p.sentence
'#1$2 cup orange juice, freshly squeezed'

The normalisation of the input sentence is done on initialisation of a PreProcessor object. The _normalise() method of the PreProcessor class is called, which executes a number of steps to clean up the input sentence.

def _normalise(self, sentence: str) -> str:
    """Normalise sentence prior to feature extraction.

    Parameters
    ----------
    sentence : str
        Ingredient sentence.

    Returns
    -------
    str
        Normalised ingredient sentence.
    """
    # List of functions to apply to sentence
    # Note that the order matters
    funcs = [
        self._remove_price_annotations,
        self._replace_en_em_dash,
        self._replace_html_fractions,
        self._replace_unicode_fractions,
        combine_quantities_split_by_and,
        self._identify_fractions,
        self._split_quantity_and_units,
        self._remove_unit_trailing_period,
        replace_string_range,
        self._replace_dupe_units_ranges,
        self._merge_quantity_x,
        self._collapse_ranges,
    ]

    for func in funcs:
        sentence = func(sentence)
        logger.debug(f"{func.__name__}: {sentence}")

    return sentence.strip()

Tip

By setting show_debug_output=True when instantiating a PreProcessor object, the sentence will be printed out at each step of the normalisation process.

Each of the normalisation steps is described below.

_remove_price_annotations

Price annotations, typically at the end of ingredient sentences such as ($1.99), are removed.
_replace_en_em_dash

En-dashes (–) and em-dashes (—) are replaced with hyphens (-). This makes identification of ranges of quantities easier.
_replace_html_fractions

Fractions written as html entities (e.g. ½ for 0.5) are replaced with Unicode equivalents (e.g. ½). This is done using the standard library’s html.unescape() function.
_replace_unicode_fractions

Fractions represented by Unicode fractions are replaced a textual format (.e.g ½ as 1/2), as defined by the dictionary in this function. Because we replaced the html fractions in the previous step, these are also converted here too.

There are two cases to consider: where the character before the unicode fraction is a hyphen and where it is not.

In the second case, we insert a space before the replacement so we don’t accidentally merge with the character before. For example we want 1½ to become 1 1/2 and not 11/2.

However, if the character before is a hyphen, we don’t want to do this because we could end up splitting a range up. For example, we want ½-¾ to become 1/2-3/4 and not 1/2- 3/4 (note the space before the 3).
combine_quantities_split_by_and

Fractional quantities split by ‘and’ e.g. 1 and 1/2 are converted to the format described in the next step. We do this now instead of later to avoid treating the 1/2 on it’s own.
_identify_fractions

All remaining fractions are modified so that they survive tokenisation as a single token. This is necessary so that we can convert them to fractions.Fraction objects later.

For fractions less than 1, the forward slash is replaced by $ and a # is prepended e.g. 1/2 becomes #1$2.

For fractions greater than 1, the forward slash is replaced by $ and a # is inserted between the integer and the fraction e.g. 2 3/4 becomes 2#3$4.
_split_quantity_and_units

A space is enforced between quantities and units to make sure they are tokenized to separate tokens. If a quantity and unit are joined by a hyphen, this is also replaced by a space. This takes into account certain strings that aren’t technically units, but we want to treat in the same way here, for example x in the context 1x or 2x.
_remove_unit_trailing_period

Units with a trailing period have the period removed. This is only done for a subset of units where this has been observed in the model training data.
replace_string_range

Ranges are replaced with a standardised form of X-Y. A regular expression searches for ranges in the sentence that match anything in the following forms:
- 1 to 2
- 1- to 2-
- 1 or 2
- 1- or 2-
where the numbers 1 and 2 represent any decimal value or fraction as modified above.

The purpose of this is to ensure the range is kept as a single token.
_replace_dupe_units_ranges

Ranges where the unit is given for both quantities are replaced with the standardised range format, e.g. 5 oz - 8 oz is replaced by 5-8 oz. Cases where the same unit is used but in different forms (e.g. 5 oz - 8 ounce) are also considered for the unit synonyms defined in the UNIT_SYNONYMS constant.
_merge_quantity_x

Quantities followed by an “x” are merged together so they form a single token, for example:
- 1 x -> 1x
- 0.5 x -> 0.5x
_collapse_ranges

Remove any white space surrounding the hyphen in a range

Singularising units#

Units are converted to their singular form, using a predefined list of plural units and their singular form. This step is actually performed after tokenisation so that we can keep track of the index of each token that has been modified. This is so we can automatically re-pluralise only the tokens that were singularised after the labelling by the model.

Sentence Normalisation

Contents

Sentence Normalisation#

Singularising units#