Post-processing the model output#

The output from the model is a list of labels and scores, one for each token in the input sentence. This needs to be turned into a more useful data structure so that the output can be used by the users of this library.

The following dataclass is defined which will be output from the parse_ingredient function:

@dataclass
class ParsedIngredient:
    """Dataclass for holding the parsed values for an input sentence.

    Attributes
    ----------

    name : IngredientText | None
        Ingredient name parsed from input sentence.
        If no ingredient name was found, this is None.
    size : IngredientText | None
        Size modifer of ingredients, such as small or large.
        If no size modifier, this is None.
    amount : List[IngredientAmount]
        List of IngredientAmount objects, each representing a matching quantity and
        unit pair parsed from the sentence.
    preparation : IngredientText | None
        Ingredient preparation instructions parsed from sentence.
        If no ingredient preparation instruction was found, this is None.
    comment : IngredientText | None
        Ingredient comment parsed from input sentence.
        If no ingredient comment was found, this is None.
    sentence : str
        Normalised input sentence
    """

    name: IngredientText | None
    size: IngredientText | None
    amount: list[IngredientAmount]
    preparation: IngredientText | None
    comment: IngredientText | None
    sentence: str

Each of the fields in the dataclass has to be determined from the output of the model. The PostProcessor class handles this for us.

Name, Size, Preparation, Comment#

For each of the labels NAME, SIZE, PREP, and COMMENT, the process of combining the tokens for each labels is the same.

The general steps are as follows:

  1. Find the indices of the labels under consideration.

  2. Group these indices into lists of consecutive indices.

  3. Join the tokens corresponding to each group of consecutive indices with a space.

  4. If discard_isolated_stop_words is True, discard any groups that just comprise a word from the list of stop words.

  5. Average the confidence scores for each the tokens in each group consecutive indices.

  6. Remove any isolated punctuation or any consecutive tokens that are identical.

  7. Join all the groups together with a comma and fix any weird punctuation this causes.

  8. Average the confidence scores across all groups.

    def _postprocess(self, selected: str) -> IngredientText | None:
        """Process tokens, labels and scores with selected label into an
        IngredientText object.

        Parameters
        ----------
        selected : str
            Label of tokens to postprocess

        Returns
        -------
        IngredientText
            Object containing ingredient comment text and confidence
        """
        # Select indices of tokens, labels and scores for selected label
        # Do not include tokens, labels and scores in self.consumed
        idx = [
            i
            for i, label in enumerate(self.labels)
            if label in [selected, "PUNC"] and i not in self.consumed
        ]

        # Join consecutive tokens together and average their score
        parts = []
        confidence_parts = []
        for group in self._group_consecutive_idx(idx):
            idx = list(group)
            joined = " ".join([self.tokens[i] for i in idx])
            confidence = mean([self.scores[i] for i in idx])

            if self.discard_isolated_stop_words and joined in STOP_WORDS:
                # Discard part if it's a stop word
                continue

            parts.append(joined)
            confidence_parts.append(confidence)

        # Find the indices of the joined tokens list where the element
        # if a single punctuation mark or is the same as the previous element
        # in the list
        keep_idx = self._remove_isolated_punctuation_and_duplicate_indices(parts)
        parts = [parts[i] for i in keep_idx]
        confidence_parts = [confidence_parts[i] for i in keep_idx]

        # Join all the parts together into a single string and fix any
        # punctuation weirdness as a result.
        text = ", ".join(parts)
        text = self._fix_punctuation(text)

        if len(parts) == 0:
            return None

        return IngredientText(
            text=text,
            confidence=round(mean(confidence_parts), 6),
        )

The output of this function is an IngredientText object:

@dataclass
class IngredientText:
    """Dataclass for holding a parsed ingredient string, comprising the following
    attributes.

    Attributes
    ----------
    text : str
        Parsed text from ingredient.
        This is comprised of all tokens with the same label.
    confidence : float
        Confidence of parsed ingredient amount, between 0 and 1.
        This is the average confidence of all tokens that contribute to this object.
    """

    text: str
    confidence: float

Amount#

The QTY and UNIT labels are combined into an IngredientAmount object

@dataclass
class IngredientAmount:
    """Dataclass for holding a parsed ingredient amount.

    On instantiation, the unit is made plural if necessary.

    Attributes
    ----------
    quantity : float | str
        Parsed ingredient quantity, as a float where possible, otherwise a string.
        If the amount if a range, this is the lower limit of the range.
    quantity_max : float | str
        If the amount is a range, this is the upper limit of the range.
        Otherwise, this is the same as the quantity field.
        This is set automatically depending on the type of quantity.
    unit : str | pint.Unit
        Unit of parsed ingredient quantity.
        If the quantity is recognised in the pint unit registry, a pint.Unit
        object is used.
    text : str
        String describing the amount e.g. "1 cup"
    confidence : float
        Confidence of parsed ingredient amount, between 0 and 1.
        This is the average confidence of all tokens that contribute to this object.
    APPROXIMATE : bool, optional
        When True, indicates that the amount is approximate.
        Default is False.
    SINGULAR : bool, optional
        When True, indicates if the amount refers to a singular item of the ingredient.
        Default is False.
    RANGE : bool, optional
        When True, indicates the amount is a range e.g. 1-2.
        Default is False.
    MULTIPLIER : bool, optional
        When True, indicates the amount is a multiplier e.g. 1x, 2x.
        Default is False.
    """

    quantity: float | str
    quantity_max: float | str = field(init=False)
    unit: str | pint.Unit
    text: str
    confidence: float
    starting_index: InitVar[int]
    APPROXIMATE: bool = False
    SINGULAR: bool = False
    RANGE: bool = False
    MULTIPLIER: bool = False

    def __post_init__(self, starting_index):
        """
        If required make the unit plural convert.
        Set the value for quantity_max and set the RANGE and MULTIPLIER flags
        as required by the type of quantity.
        """
        if is_float(self.quantity):
            # If float, set quantity_max = quantity
            self.quantity = float(self.quantity)
            self.quantity_max = self.quantity
        elif is_range(self.quantity):
            # If range, set quantity to min of range, set quantity_max to max
            # of range, set RANGE flag to True
            range_parts = [float(x) for x in self.quantity.split("-")]
            self.quantity = min(range_parts)
            self.quantity_max = max(range_parts)
            self.RANGE = True
        elif self.quantity.endswith("x"):
            # If multiplier, set quantity and quantity_max to value without 'x', and
            # set MULTIPLER flag.
            self.quantity = float(self.quantity[:-1])
            self.quantity_max = self.quantity
            self.MULTIPLIER = True
        else:
            # Fallback to setting quantity_max to quantity
            self.quantity_max = self.quantity

        # Pluralise unit as necessary
        if self.quantity != 1 and self.quantity != "":
            self.text = pluralise_units(self.text)
            if isinstance(self.unit, str):
                self.unit = pluralise_units(self.unit)

        # Assign starting_index to _starting_index
        self._starting_index = starting_index

For most cases, the amounts are determined by combining a QTY label with the following UNIT labels, up to the next QTY which becomes a new amount. For example:

>>> p = PreProcessor("3/4 cup (170g) heavy cream")
>>> p.tokenized_sentence
['0.75', 'cup', '(', '170', 'g', ')', 'heavy', 'cream']
...
>>> parsed = PostProcessor(sentence, tokens, labels, scores).parsed()
>>> amounts = parsed.amount
[IngredientAmount(quantity='0.75', unit=<Unit('cup')>, text='0.75 cups', confidence=0.999921, APPROXIMATE=False, SINGULAR=False),
IngredientAmount(quantity='170', unit=<Unit('gram')>, text='170 g', confidence=0.996724, APPROXIMATE=False, SINGULAR=False)]

There are two amounts identified: 0.75 cups and 170 g.

Units#

Note

The use of pint.Unit objects can be disabled by setting string_units=True in the parse_ingredient function. When this is True, units will be returned as strings, correctly pluralised for the quantity.

The pint library is used to standardise the units where possible. If the unit in a parsed IngredientAmount can be matched to a unit in the pint Unit Registry, then a pint.Unit object is used in place of the unit string.

This has the benefit of standardising units that can be represented in different formats, for example a gram could be represented in the sentence as g, gram, grams. These will all be represented using the same <Unit('gram')> object in the parsed information.

This has benefits if you wish to use the parsed information to convert between different units. For example:

>>> p = parse_ingredient("3/4 cup heavy cream")
>>> q = float(p.amount[0].quantity) * p.amount[0].unit
>>> q
0.75 <Unit('cup')>
>>> q.to("ml")
177.44117737499994 <Unit('milliliter')>

By default, US customary version of units are used where a unit has more than one definition. This can be changed to use the Imperial definition by setting imperial_units=True in the parse_ingredient function call.

>>> parse_ingredient("3/4 cup heavy cream", imperial_units=False)  # Default
ParsedIngredient(
    name=IngredientText(text='heavy cream', confidence=0.998078),
    amount=[IngredientAmount(quantity=0.75,
                             unit=<Unit('cup')>,
                             text='0.75 cups',
                             confidence=0.99993,
                             APPROXIMATE=False,
                             SINGULAR=False)],
    preparation=None,
    comment=None,
    sentence='3/4 cup heavy cream'
)
>>> parse_ingredient("3/4 cup heavy cream", imperial_units=True)
ParsedIngredient(
    name=IngredientText(text='heavy cream', confidence=0.998078),
    amount=[IngredientAmount(quantity=0.75,
                             unit=<Unit('imperial_cup')>,
                             text='0.75 cups',
                             confidence=0.99993,
                             APPROXIMATE=False,
                             SINGULAR=False)],
    preparation=None,
    comment=None,
    sentence='3/4 cup heavy cream'
)

Tip

The use of pint.Unit objects means that the ingredient amounts can easily be converted to different units.

>>> parsed = parse_ingredient("3 pounds beef brisket")
>>> # Create a pint.Quantity object from the quantity and unit
>>> q = parsed.amount[0].quantity * parsed.amount[0].unit
>>> q
3.0 <Unit('pound')>

>>> # Convert to kg
>>> q.to("kg")
1.3607771100000003 <Unit('kilogram')>

IngredientAmount flags#

IngredientAmount objects have a number of flags that can be set.

APPROXIMATE

This is set to True when the QTY is preceded by a word such as about, approximately and indicates if the amount is approximate.

SINGULAR

This is set to True when the amount is followed by a word such as each and indicates that the amount refers to a singular item of the ingredient.

There is also a special case (below), where an inner amount that inside a QTY-UNIT pair will be marked as SINGULAR.

RANGE

This is set to True with the amount if a range of values, e.g. 1-2, 300-400. In these cases, the quantity field of the IngredientAmount object is set to the lower value in the range and quantity_max is the upper end of the range.

MULTIPLIER

This is set to True when the amount is represented as a multiple such as 1x. The quantity field in set to the value of the multiplier (1x to 1).

Special cases for amounts#

There are some particular cases where the combination of QTY and UNIT labels that make up an amount are not straightforward. For example, consider the sentence 2 14 ounce cans coconut milk. In this case there are two amounts: 2 cans and 14 ounce, where the latter is marked as SINGULAR because it applies to each of the 2 cans.

>>> parsed = parse_ingredient("2 14 ounce cans coconut milk")
>>> parsed.amount
[IngredientAmount(quantity=2.0, unit='cans', text='2 cans', confidence=0.999835, APPROXIMATE=False, SINGULAR=False),
IngredientAmount(quantity=14.0, unit=<Unit('ounce')>, text='14 ounces', confidence=0.998503, APPROXIMATE=False, SINGULAR=True)]

Identifying and handling this pattern of QTY and UNIT labels is done by the PostProcessor._sizable_unit_pattern() function.