Post-processing#

The output from the model is a list of labels and a list of scores, one of each for every token in the input sentence. This needs to be turned into a more useful data structure so that the output can be used by the users of this library.

The ParsedIngredient class defines the structure of the returned information from the parse_ingredient function:

@dataclass
class ParsedIngredient:
    """Dataclass for holding the parsed values for an input sentence.

    Attributes
    ----------
    name : list[IngredientText]
        List of IngredientText objects, each representing an ingreident name parsed from
        input sentence.
        If no ingredient names are found, this is an empty list.
    size : IngredientText | None
        Size modifier of ingredients, such as small or large.
        If no size modifier, this is None.
    amount : List[IngredientAmount | CompositeIngredientAmount]
        List of IngredientAmount objects, each representing a matching quantity and
        unit pair parsed from the sentence.
        If no ingredient amounts are found, this is an empty list.
    preparation : IngredientText | None
        Ingredient preparation instructions parsed from sentence.
        If no ingredient preparation instruction was found, this is None.
    comment : IngredientText | None
        Ingredient comment parsed from input sentence.
        If no ingredient comment was found, this is None.
    purpose : IngredientText | None
        The purpose of the ingredient parsed from the sentence.
        If no purpose was found, this is None.
    foundation_foods : list[FoundationFood]
        List of foundation foods from the parsed sentence.
    sentence : str
        Normalised input sentence
    """

    name: list[IngredientText]
    size: IngredientText | None
    amount: list[IngredientAmount | CompositeIngredientAmount]
    preparation: IngredientText | None
    comment: IngredientText | None
    purpose: IngredientText | None
    foundation_foods: list[FoundationFood]
    sentence: str

Each of the fields in the dataclass has to be determined from the output of the model. The PostProcessor class handles this for us.

Size, Preparation, Purpose, Comment#

For each of the labels SIZE, PREP, PURPOSE and COMMENT, the associated tokens are combined into an IngredientText object.

@dataclass
class IngredientText:
    """Dataclass for holding a parsed ingredient string.

    Attributes
    ----------
    text : str
        Parsed text from ingredient.
        This is comprised of all tokens with the same label.
    confidence : float
        Confidence of parsed ingredient text, between 0 and 1.
        This is the average confidence of all tokens that contribute to this object.
    starting_index : int
        Index of token in sentence that starts this text
    """

    text: str
    confidence: float
    starting_index: int

The post-processing steps are as follows:

  1. Find the indices for the label under consideration, plus the PUNC label.

  2. Group these indices into lists of consecutive indices.

  3. Join the tokens corresponding to each group of consecutive indices with a space.

  4. If discard_isolated_stop_words is True, discard any groups that just comprise a word from the list of stop words.

  5. Average the confidence scores for each the tokens in each group consecutive indices.

  6. Remove any isolated or invalid punctuation and any consecutive tokens that are identical.

  7. Join all the groups together with a comma and fix any weird punctuation this causes.

  8. Re-pluralise units that were made singular during pre-processing.

  9. Average the confidence scores across all groups.

The output of this processing is an IngredientText object for each label, which contains the text string, the confidence score, and the starting index of the string in the ingredient sentence.

Name#

Note

If separate_names is set to False, then all the NAME_* label types are treated as a single NAME label and the post-processing is the same for the SIZE, PREP, PURPOSE and COMMENT labels. This will return a list containing a single IngredientText object.

The post-processing to obtain the ingredient names is similar to above, but with a couple of extra steps before the steps listed above used to identify the different ingredient names.

Ingredient name post-processing steps.

The first three steps are unique to the post-processing of ingredient names and are described in more detail below. The fourth step is the same as the non-name labels described above, using the groups of indices output from step 3.

We will use the sentence 8 ounces whole yellow or red bell pepper as an example to show how the ingredient name post-processing works.

This sentence has the following tokens and labels:

Index

0

1

2

3

4

5

6

7

Token

8

ounces

whole

yellow

or

red

bell

pepper

Label

QTY

UNIT

NAME_MOD

NAME_VAR

NAME_SEP

NAME_VAR

B_NAME_TOK

I_NAME_TOK

Extract NAME labels#

This is a straight forward step that finds that indices of all tokens that have been given one of the following labels: B_NAME_TOK, I_NAME_TOK, NAME_VAR, NAME_MOD, NAME_SEP, PUNC.

This results in the following indices:

[2, 3, 4, 5, 6, 7]

Group tokens by NAME label#

Iterate over the extract NAME labels and group consecutive labels of the same type together.

  • Consecutive NAME_MOD labels are grouped together.

  • Consecutive NAME_VAR labels are grouped together.

  • Consecutive B_NAME_TOK, I_NAME_TOK and PUNC labels are grouped together.

  • NAME_SEP labels are used to force the start of a new grouping.

When grouping the token together, we store the index and label of the tokens.

Note

The indices here are the indices of elements from the extracted NAME labels i.e. an index of 0 here is the first element of the extracted NAME labels which refers to the token at index 2 of the whole sentence.

For the example sentence, we get the following groups:

[
  [(0, 'NAME_MOD')],
  [(1, 'NAME_VAR')],
  [(3, 'NAME_VAR')],
  [(4, 'B_NAME_TOK'), (5, 'I_NAME_TOK')]
]

Construct names from NAME groups#

From the name groups, we construct the ingredient names. This is most easily done by iterating in reverse over the groups and applying the following logic:

  • Each group starting with B_NAME_TOK forms a new name.

  • Each NAME_VAR group is prepended to the beginning of the most recent name.

  • Each NAME_MOD group is prepended to all previous names.

The output from this construction step are groups of indices, where each group represents an ingredient name.

For the example sentence, the constructed groups of indices are:

[
  (0, 1, 4, 5), # whole yellow bell pepper
  (0, 3, 4, 5)  # whole red bell pepper
]

Create IngredientText objects#

With the groups of indices obtained from the previous step, we can convert to token indices and then follow the steps used to post-process the SIZE, PREP, PURPOSE, COMMENT labels described above.

Once the IngredientText objects have been obtained, we perform one final post-processing step. If there are multiple names, we check the part of speech tag for the last token in each name. If the part of speech tag is IN, DT or JJ, we merge the name with the next name. This merging of ingredient names is necessary to mitigate against mislabelling of tokens by the model, which can happen if a name is split by a token with another label.

Amount#

The QTY and UNIT labels are combined into an IngredientAmount object.

@dataclass
class IngredientAmount:
    """Dataclass for holding a parsed ingredient amount.

    On instantiation, the unit is made plural if necessary.

    Attributes
    ----------
    quantity : Fraction | str
        Parsed ingredient quantity, as a Fraction where possible, otherwise a string.
        If the amount if a range, this is the lower limit of the range.
    quantity_max : Fraction | str
        If the amount is a range, this is the upper limit of the range.
        Otherwise, this is the same as the quantity field.
        This is set automatically depending on the type of quantity.
    unit : str | pint.Unit
        Unit of parsed ingredient quantity.
        If the quantity is recognised in the pint unit registry, a pint.Unit
        object is used.
    text : str
        String describing the amount e.g. "1 cup", "8 oz"
    confidence : float
        Confidence of parsed ingredient amount, between 0 and 1.
        This is the average confidence of all tokens that contribute to this object.
    starting_index : int
        Index of token in sentence that starts this amount
    unit_system : UnitSystem
        Unit system (e.g. metric) that the unit of the amount belongs to.
    APPROXIMATE : bool, optional
        When True, indicates that the amount is approximate.
        Default is False.
    SINGULAR : bool, optional
        When True, indicates if the amount refers to a singular item of the ingredient.
        Default is False.
    RANGE : bool, optional
        When True, indicates the amount is a range e.g. 1-2.
        Default is False.
    MULTIPLIER : bool, optional
        When True, indicates the amount is a multiplier e.g. 1x, 2x.
        Default is False.
    PREPARED_INGREDIENT : bool, optional
        When True, indicates the amount applies to the prepared ingredient.
        When False, indicates the amount applies to the ingredient before preparation.
        Default is False.
    """

    quantity: Fraction | str
    quantity_max: Fraction | str
    unit: str | pint.Unit
    text: str
    confidence: float
    starting_index: int
    unit_system: UnitSystem = field(init=False)
    APPROXIMATE: bool = False
    SINGULAR: bool = False
    RANGE: bool = False
    MULTIPLIER: bool = False
    PREPARED_INGREDIENT: bool = False

For most cases, the amounts are determined by combining a QTY label with the following UNIT labels, up to the next QTY which becomes a new amount. For example:

>>> p = PreProcessor("3/4 cup (170g) heavy cream")
>>> [t.text for t in p.tokenized_sentence]
['#3$4', 'cup', '(', '170', 'g', ')', 'heavy', 'cream']
...
>>> parsed = PostProcessor(sentence, labelled_tokens).parsed()
>>> amounts = parsed.amount
[
    IngredientAmount(quantity=Fraction(3, 4),
              quantity_max=Fraction(3, 4),
              unit=<Unit('cup')>,
              text='0.75 cups',
              confidence=0.999881,
              starting_index=0,
              unit_system=<UnitSystem.US_CUSTOMARY: 'us_customary'>,
              APPROXIMATE=False,
              SINGULAR=False,
              RANGE=False,
              MULTIPLIER=False,
              PREPARED_INGREDIENT=False),
    IngredientAmount(quantity=Fraction(170, 1),
              quantity_max=Fraction(170, 1),
              unit=<Unit('gram')>,
              text='170 g',
              confidence=0.995941,
              starting_index=3,
              unit_system=<UnitSystem.METRIC: 'metric'>,
              APPROXIMATE=False,
              SINGULAR=False,
              RANGE=False,
              MULTIPLIER=False,
              PREPARED_INGREDIENT=False)
]

Quantities#

Quantities are returned as fractions.Fraction objects, or str for non-numeric quantities (e.g. dozen).

>>> parsed = parse_ingredient("1/3 cup oil", quantity_fractions=True)
>>> parsed.amount.quantity
Fraction(1, 3)

Note

Conversion of quantities to float or int is left to the end users of this library.

Tokens with the QTY label that are numbers represented in textual form e.g. “one”, “two” are replaced with numeric forms. The replacements are predefined in the STRING_NUMBERS constant. For performance reasons, the regular expressions used to substitute the text with the number are pre-compiled and provided in the STRING_NUMBERS_REGEXES constant, which is a dictionary where the value is a tuple of (pre-compiled regular expression, substitute value).

# Strings and their numeric representation
STRING_NUMBERS = {
    "one-quarter": "1/4",
    "one-half": "1/2",
    "three-quarter": "3/4",
    "three-quarters": "3/4",
    "one": "1",
    "two": "2",
    "three": "3",
    "four": "4",
    "five": "5",
    "six": "6",
    "seven": "7",
    "eight": "8",
    "nine": "9",
    "ten": "10",
    "eleven": "11",
    "twelve": "12",
    "thirteen": "13",
    "fourteen": "14",
    "fifteen": "15",
    "sixteen": "16",
    "seventeen": "17",
    "eighteen": "18",
    "nineteen": "19",
}
# Precompile the regular expressions for matching the string numbers
STRING_NUMBERS_REGEXES = {}
for s, n in STRING_NUMBERS.items():
    # This is case insensitive so it replace e.g. "one" and "One"
    # Only match if the string is preceded by a non-word character or is at
    # the start of the sentence
    STRING_NUMBERS_REGEXES[s] = (re.compile(rf"\b({s})\b", flags=re.IGNORECASE), n)

Implicit quantities#

In some sentences, the quantity of the unit is not explicitly stated but is implied by the units in the sentence. For example in the sentence 15 oz can black beans, there is implicitly 1 can of beans (that contains 15 oz). Another example commonly seen pattern is Rosemary sprig (optional), where there is an implicit quantity of 1 due to “sprig” being singular.

In these cases, the quantity is set explicitly to 1 in the IngredientAmount object. To guard against incorrectly assigning the quantity, the unit is checked to make sure it is not plural and the sentence prior to the unit is checked to make sure that it does not include any indefinite quantifiers (e.g. few, some).

Units#

Note

The use of pint.Unit objects can be disabled by setting string_units=True in the parse_ingredient function. When this is True, units will be returned as strings, correctly pluralised for the quantity.

The Pint library is used to standardise the units where possible. If the unit in a parsed IngredientAmount can be matched to a unit in the Pint Unit Registry, then a pint.Unit object is used in place of the unit string.

This has the benefit of standardising units that can be represented in different formats, for example a gram could be represented in the sentence as g, gram, grams. These will all be represented using the same <Unit('gram')> object in the parsed information.

By default, US customary units are used for volumetric measurements that have multiple defintions (e.g. cup, tablespoon etc.). This can be changed to use other unit systems using the volumetric_units_system keyword argument in the parse_ingredient function call. See Options for the available options.

>>> parse_ingredient("3/4 cup heavy cream", volumetric_units_system="us_customary")  # Default
ParsedIngredient(
    name=IngredientText(text='heavy cream', confidence=0.997513),
    size=None,
    amount=[IngredientAmount(quantity=Fraction(3, 4),
                             quantity_max=Fraction(3, 4),
                             unit=<Unit('cup')>,
                             text='0.75 cups',
                             confidence=0.999926,
                             starting_index=0,
                             unit_system=<UnitSystem.US_CUSTOMARY: 'us_customary'>,
                             APPROXIMATE=False,
                             SINGULAR=False,
                             RANGE=False,
                             MULTIPLIER=False,
                             PREPARED_INGREDIENT=False)],
    preparation=None,
    comment=None,
    sentence='3/4 cup heavy cream'
)

>>> parse_ingredient("3/4 cup heavy cream", volumetric_units_system="imperial")
ParsedIngredient(
    name=IngredientText(text='heavy cream', confidence=0.997513),
    size=None,
    amount=[IngredientAmount(quantity=Fraction(3, 4),
                             quantity_max=Fraction(3, 4),
                             unit=<Unit('imperial_cup')>,
                             text='0.75 cups',
                             confidence=0.999926,
                             starting_index=0,
                             unit_system=<UnitSystem.IMPERIAL: 'imperial'>,
                             APPROXIMATE=False,
                             SINGULAR=False,
                             RANGE=False,
                             MULTIPLIER=False,
                             PREPARED_INGREDIENT=False)],
    preparation=None,
    comment=None,
    sentence='3/4 cup heavy cream'
)

Tip

The use of pint.Unit objects means that the ingredient amounts can easily be converted to different units. See the Convert between units how-to guide.

IngredientAmount flags#

IngredientAmount objects have a number of flags that are set to provide additional information about the amount.

Flag

Description

APPROXIMATE

This is set to True when the QTY is preceded by a word such as about, approximately and indicates if the amount is approximate.

SINGULAR

This is set to True when the amount is followed by a word such as each and indicates that the amount refers to a singular item of the ingredient.

There is also a special case (below), where an inner amount that inside a QTY-UNIT pair will be marked as SINGULAR.

RANGE

This is set to True with the amount if a range of values, e.g. 1-2, 300-400. In these cases, the quantity field of the IngredientAmount object is set to the lower value in the range and quantity_max is the upper end of the range.

MULTUPLIER

This is set to True when the amount is represented as a multiple such as 1x. The quantity field in set to the value of the multiplier (e.g. for 1x the quantity is 1).

PREPARED_INGREDIENT

This is set to True when the amount refers to the ingredient after any preparation instructions in the ingredient sentence have been followed.

For example in the sentence 1 tbsp chopped nuts, we would want 1 tablespoon of nuts, measured after they have been chopped. If the sentence was 1 tbsp nuts, chopped, we would want to chop the nuts after we have measured 1 tablespoon.

Special cases for amounts#

There are some particular cases where the combination of QTY and UNIT labels that make up an amount are not straightforward. For example, consider the sentence 2 14 ounce cans coconut milk. In this case there are two amounts: 2 cans and 14 ounce, where the latter is marked as SINGULAR because it applies to each of the 2 cans.

>>> parsed = parse_ingredient("2 14 ounce cans coconut milk")
>>> parsed.amount
[IngredientAmount(quantity=Fraction(2, 1),
                  quantity_max=Fraction(2, 1),
                  unit='cans',
                  text='2 cans',
                  confidence=0.999897,
                  starting_index=0,
                  unit_system=<UnitSystem.OTHER: 'other'>,
                  APPROXIMATE=False,
                  SINGULAR=False,
                  RANGE=False,
                  MULTIPLIER=False,
                  PREPARED_INGREDIENT=False),
 IngredientAmount(quantity=Fraction(14, 1),
                  quantity_max=Fraction(14, 1),
                  unit=<Unit('ounce')>,
                  text='14 ounces',
                  confidence=0.998793,
                  starting_index=1,
                  unit_system=<UnitSystem.US_CUSTOMARY: 'us_customary'>,
                  APPROXIMATE=False,
                  SINGULAR=True,
                  RANGE=False,
                  MULTIPLIER=False,
                  PREPARED_INGREDIENT=False)]

Identifying and handling this pattern of QTY and UNIT labels is done by the PostProcessor._sizable_unit_pattern() function.

A second case is where the full amount is made up of more than one adjacent quantity-unit pair. This is particularly common with US customary units such as pounds and ounces, or pints and fluid ounces. In these cases, a CompositeIngredientAmount is returned. For example

>>> parsed = parse_ingredient("1lb 2oz pecorino romano cheese")
>>> parsed.amount
[CompositeIngredientAmount(
    amounts=[
        IngredientAmount(quantity=Fraction(1, 1),
                         quantity_max=Fraction(1, 1),
                         unit=<Unit('pound')>,
                         text='1 lb',
                         confidence=0.999923,
                         starting_index=0,
                         unit_system=<UnitSystem.US_CUSTOMARY: 'us_customary'>,
                         APPROXIMATE=False,
                         SINGULAR=False,
                         RANGE=False,
                         MULTIPLIER=False,
                         PREPARED_INGREDIENT=False),
        IngredientAmount(quantity=Fraction(1, 1),
                         quantity_max=Fraction(1, 1),
                         unit=<Unit('ounce')>,
                         text='2 oz',
                         confidence=0.998968,
                         starting_index=2,
                         unit_system=<UnitSystem.US_CUSTOMARY: 'us_customary'>,
                         APPROXIMATE=False,
                         SINGULAR=False,
                         RANGE=False,
                         MULTIPLIER=False,
                         PREPARED_INGREDIENT=False)],
    join='',
    subtractive=False,
    text='1 lb 2 oz',
    confidence=0.9994455,
    starting_index=0
    unit_system=<UnitSystem.US_CUSTOMARY: 'us_customary'>,
)]