Post-processing the model output#
The output from the model is a list of labels and scores, one for each token in the input sentence. This needs to be turned into a more useful data structure so that the output can be used by the users of this library.
The following dataclass is defined which will be output from the parse_ingredient
function:
@dataclass
class ParsedIngredient:
"""Dataclass for holding the parsed values for an input sentence.
Attributes
----------
name : IngredientText | None
Ingredient name parsed from input sentence.
If no ingredient name was found, this is None.
size : IngredientText | None
Size modifer of ingredients, such as small or large.
If no size modifier, this is None.
amount : List[IngredientAmount]
List of IngredientAmount objects, each representing a matching quantity and
unit pair parsed from the sentence.
preparation : IngredientText | None
Ingredient preparation instructions parsed from sentence.
If no ingredient preparation instruction was found, this is None.
comment : IngredientText | None
Ingredient comment parsed from input sentence.
If no ingredient comment was found, this is None.
sentence : str
Normalised input sentence
"""
name: IngredientText | None
size: IngredientText | None
amount: list[IngredientAmount]
preparation: IngredientText | None
comment: IngredientText | None
sentence: str
Each of the fields in the dataclass has to be determined from the output of the model. The PostProcessor
class handles this for us.
Name, Size, Preparation, Comment#
For each of the labels NAME, SIZE, PREP, and COMMENT, the process of combining the tokens for each labels is the same.
The general steps are as follows:
Find the indices of the labels under consideration.
Group these indices into lists of consecutive indices.
Join the tokens corresponding to each group of consecutive indices with a space.
If
discard_isolated_stop_words
is True, discard any groups that just comprise a word from the list of stop words.Average the confidence scores for each the tokens in each group consecutive indices.
Remove any isolated punctuation or any consecutive tokens that are identical.
Join all the groups together with a comma and fix any weird punctuation this causes.
Average the confidence scores across all groups.
def _postprocess(self, selected: str) -> IngredientText | None:
"""Process tokens, labels and scores with selected label into an
IngredientText object.
Parameters
----------
selected : str
Label of tokens to postprocess
Returns
-------
IngredientText
Object containing ingredient comment text and confidence
"""
# Select indices of tokens, labels and scores for selected label
# Do not include tokens, labels and scores in self.consumed
idx = [
i
for i, label in enumerate(self.labels)
if label in [selected, "PUNC"] and i not in self.consumed
]
# Join consecutive tokens together and average their score
parts = []
confidence_parts = []
for group in self._group_consecutive_idx(idx):
idx = list(group)
joined = " ".join([self.tokens[i] for i in idx])
confidence = mean([self.scores[i] for i in idx])
if self.discard_isolated_stop_words and joined in STOP_WORDS:
# Discard part if it's a stop word
continue
parts.append(joined)
confidence_parts.append(confidence)
# Find the indices of the joined tokens list where the element
# if a single punctuation mark or is the same as the previous element
# in the list
keep_idx = self._remove_isolated_punctuation_and_duplicate_indices(parts)
parts = [parts[i] for i in keep_idx]
confidence_parts = [confidence_parts[i] for i in keep_idx]
# Join all the parts together into a single string and fix any
# punctuation weirdness as a result.
text = ", ".join(parts)
text = self._fix_punctuation(text)
if len(parts) == 0:
return None
return IngredientText(
text=text,
confidence=round(mean(confidence_parts), 6),
)
The output of this function is an IngredientText
object:
@dataclass
class IngredientText:
"""Dataclass for holding a parsed ingredient string, comprising the following
attributes.
Attributes
----------
text : str
Parsed text from ingredient.
This is comprised of all tokens with the same label.
confidence : float
Confidence of parsed ingredient amount, between 0 and 1.
This is the average confidence of all tokens that contribute to this object.
"""
text: str
confidence: float
Amount#
The QTY and UNIT labels are combined into an IngredientAmount
object
@dataclass
class IngredientAmount:
"""Dataclass for holding a parsed ingredient amount.
On instantiation, the unit is made plural if necessary.
Attributes
----------
quantity : float | str
Parsed ingredient quantity, as a float where possible, otherwise a string.
If the amount if a range, this is the lower limit of the range.
quantity_max : float | str
If the amount is a range, this is the upper limit of the range.
Otherwise, this is the same as the quantity field.
This is set automatically depending on the type of quantity.
unit : str | pint.Unit
Unit of parsed ingredient quantity.
If the quantity is recognised in the pint unit registry, a pint.Unit
object is used.
text : str
String describing the amount e.g. "1 cup"
confidence : float
Confidence of parsed ingredient amount, between 0 and 1.
This is the average confidence of all tokens that contribute to this object.
APPROXIMATE : bool, optional
When True, indicates that the amount is approximate.
Default is False.
SINGULAR : bool, optional
When True, indicates if the amount refers to a singular item of the ingredient.
Default is False.
RANGE : bool, optional
When True, indicates the amount is a range e.g. 1-2.
Default is False.
MULTIPLIER : bool, optional
When True, indicates the amount is a multiplier e.g. 1x, 2x.
Default is False.
"""
quantity: float | str
quantity_max: float | str = field(init=False)
unit: str | pint.Unit
text: str
confidence: float
starting_index: InitVar[int]
APPROXIMATE: bool = False
SINGULAR: bool = False
RANGE: bool = False
MULTIPLIER: bool = False
def __post_init__(self, starting_index):
"""
If required make the unit plural convert.
Set the value for quantity_max and set the RANGE and MULTIPLIER flags
as required by the type of quantity.
"""
if is_float(self.quantity):
# If float, set quantity_max = quantity
self.quantity = float(self.quantity)
self.quantity_max = self.quantity
elif is_range(self.quantity):
# If range, set quantity to min of range, set quantity_max to max
# of range, set RANGE flag to True
range_parts = [float(x) for x in self.quantity.split("-")]
self.quantity = min(range_parts)
self.quantity_max = max(range_parts)
self.RANGE = True
elif self.quantity.endswith("x"):
# If multiplier, set quantity and quantity_max to value without 'x', and
# set MULTIPLER flag.
self.quantity = float(self.quantity[:-1])
self.quantity_max = self.quantity
self.MULTIPLIER = True
else:
# Fallback to setting quantity_max to quantity
self.quantity_max = self.quantity
# Pluralise unit as necessary
if self.quantity != 1 and self.quantity != "":
self.text = pluralise_units(self.text)
if isinstance(self.unit, str):
self.unit = pluralise_units(self.unit)
# Assign starting_index to _starting_index
self._starting_index = starting_index
For most cases, the amounts are determined by combining a QTY label with the following UNIT labels, up to the next QTY which becomes a new amount. For example:
>>> p = PreProcessor("3/4 cup (170g) heavy cream")
>>> p.tokenized_sentence
['0.75', 'cup', '(', '170', 'g', ')', 'heavy', 'cream']
...
>>> parsed = PostProcessor(sentence, tokens, labels, scores).parsed()
>>> amounts = parsed.amount
[IngredientAmount(quantity='0.75', unit=<Unit('cup')>, text='0.75 cups', confidence=0.999921, APPROXIMATE=False, SINGULAR=False),
IngredientAmount(quantity='170', unit=<Unit('gram')>, text='170 g', confidence=0.996724, APPROXIMATE=False, SINGULAR=False)]
There are two amounts identified: 0.75 cups and 170 g.
Units#
Note
The use of pint.Unit
objects can be disabled by setting string_units=True
in the parse_ingredient
function. When this is True, units will be returned as strings, correctly pluralised for the quantity.
The pint library is used to standardise the units where possible. If the unit in a parsed IngredientAmount
can be matched to a unit in the pint Unit Registry, then a pint.Unit
object is used in place of the unit string.
This has the benefit of standardising units that can be represented in different formats, for example a gram could be represented in the sentence as g, gram, grams. These will all be represented using the same <Unit('gram')>
object in the parsed information.
This has benefits if you wish to use the parsed information to convert between different units. For example:
>>> p = parse_ingredient("3/4 cup heavy cream")
>>> q = float(p.amount[0].quantity) * p.amount[0].unit
>>> q
0.75 <Unit('cup')>
>>> q.to("ml")
177.44117737499994 <Unit('milliliter')>
By default, US customary version of units are used where a unit has more than one definition. This can be changed to use the Imperial definition by setting imperial_units=True
in the parse_ingredient
function call.
>>> parse_ingredient("3/4 cup heavy cream", imperial_units=False) # Default
ParsedIngredient(
name=IngredientText(text='heavy cream', confidence=0.998078),
amount=[IngredientAmount(quantity=0.75,
unit=<Unit('cup')>,
text='0.75 cups',
confidence=0.99993,
APPROXIMATE=False,
SINGULAR=False)],
preparation=None,
comment=None,
sentence='3/4 cup heavy cream'
)
>>> parse_ingredient("3/4 cup heavy cream", imperial_units=True)
ParsedIngredient(
name=IngredientText(text='heavy cream', confidence=0.998078),
amount=[IngredientAmount(quantity=0.75,
unit=<Unit('imperial_cup')>,
text='0.75 cups',
confidence=0.99993,
APPROXIMATE=False,
SINGULAR=False)],
preparation=None,
comment=None,
sentence='3/4 cup heavy cream'
)
Tip
The use of pint.Unit
objects means that the ingredient amounts can easily be converted to different units.
>>> parsed = parse_ingredient("3 pounds beef brisket")
>>> # Create a pint.Quantity object from the quantity and unit
>>> q = parsed.amount[0].quantity * parsed.amount[0].unit
>>> q
3.0 <Unit('pound')>
>>> # Convert to kg
>>> q.to("kg")
1.3607771100000003 <Unit('kilogram')>
IngredientAmount flags#
IngredientAmount
objects have a number of flags that can be set.
APPROXIMATE
This is set to True when the QTY is preceded by a word such as about, approximately and indicates if the amount is approximate.
SINGULAR
This is set to True when the amount is followed by a word such as each and indicates that the amount refers to a singular item of the ingredient.
There is also a special case (below), where an inner amount that inside a QTY-UNIT pair will be marked as SINGULAR.
RANGE
This is set to True with the amount if a range of values, e.g. 1-2, 300-400. In these cases, the quantity
field of the IngredientAmount
object is set to the lower value in the range and quantity_max
is the upper end of the range.
MULTIPLIER
This is set to True when the amount is represented as a multiple such as 1x. The quantity
field in set to the value of the multiplier (1x to 1).
Special cases for amounts#
There are some particular cases where the combination of QTY and UNIT labels that make up an amount are not straightforward. For example, consider the sentence 2 14 ounce cans coconut milk. In this case there are two amounts: 2 cans and 14 ounce, where the latter is marked as SINGULAR because it applies to each of the 2 cans.
>>> parsed = parse_ingredient("2 14 ounce cans coconut milk")
>>> parsed.amount
[IngredientAmount(quantity=2.0, unit='cans', text='2 cans', confidence=0.999835, APPROXIMATE=False, SINGULAR=False),
IngredientAmount(quantity=14.0, unit=<Unit('ounce')>, text='14 ounces', confidence=0.998503, APPROXIMATE=False, SINGULAR=True)]
Identifying and handling this pattern of QTY and UNIT labels is done by the PostProcessor._sizable_unit_pattern()
function.