Post-processing#
The output from the model is a list of labels and a list of scores, one of each for every token in the input sentence. This needs to be turned into a more useful data structure so that the output can be used by the users of this library.
The ParsedIngredient class defines the structure of the returned information from the parse_ingredient function:
@dataclass
class ParsedIngredient:
"""Dataclass for holding the parsed values for an input sentence.
Attributes
----------
name : list[IngredientText]
List of IngredientText objects, each representing an ingreident name parsed from
input sentence.
If no ingredient names are found, this is an empty list.
size : IngredientText | None
Size modifier of ingredients, such as small or large.
If no size modifier, this is None.
amount : List[IngredientAmount | CompositeIngredientAmount]
List of IngredientAmount objects, each representing a matching quantity and
unit pair parsed from the sentence.
If no ingredient amounts are found, this is an empty list.
preparation : IngredientText | None
Ingredient preparation instructions parsed from sentence.
If no ingredient preparation instruction was found, this is None.
comment : IngredientText | None
Ingredient comment parsed from input sentence.
If no ingredient comment was found, this is None.
purpose : IngredientText | None
The purpose of the ingredient parsed from the sentence.
If no purpose was found, this is None.
foundation_foods : list[FoundationFood]
List of foundation foods from the parsed sentence.
sentence : str
Normalised input sentence
"""
name: list[IngredientText]
size: IngredientText | None
amount: list[IngredientAmount | CompositeIngredientAmount]
preparation: IngredientText | None
comment: IngredientText | None
purpose: IngredientText | None
foundation_foods: list[FoundationFood]
sentence: str
Each of the fields in the dataclass has to be determined from the output of the model. The PostProcessor class handles this for us.
Size, Preparation, Purpose, Comment#
For each of the labels SIZE, PREP, PURPOSE and COMMENT, the associated tokens are combined into an IngredientText object.
@dataclass
class IngredientText:
"""Dataclass for holding a parsed ingredient string.
Attributes
----------
text : str
Parsed text from ingredient.
This is comprised of all tokens with the same label.
confidence : float
Confidence of parsed ingredient text, between 0 and 1.
This is the average confidence of all tokens that contribute to this object.
starting_index : int
Index of token in sentence that starts this text
"""
text: str
confidence: float
starting_index: int
The post-processing steps are as follows:
Find the indices for the label under consideration, plus the PUNC label.
Group these indices into lists of consecutive indices.
Join the tokens corresponding to each group of consecutive indices with a space.
If
discard_isolated_stop_wordsis True, discard any groups that just comprise a word from the list of stop words.Average the confidence scores for each the tokens in each group consecutive indices.
Remove any isolated or invalid punctuation and any consecutive tokens that are identical.
Join all the groups together with a comma and fix any weird punctuation this causes.
Re-pluralise units that were made singular during pre-processing.
Average the confidence scores across all groups.
The output of this processing is an IngredientText object for each label, which contains the text string, the confidence score, and the starting index of the string in the ingredient sentence.
Name#
Note
If separate_names is set to False, then all the NAME_* label types are treated as a single NAME label and the post-processing is the same for the SIZE, PREP, PURPOSE and COMMENT labels.
This will return a list containing a single IngredientText object.
The post-processing to obtain the ingredient names is similar to above, but with a couple of extra steps before the steps listed above used to identify the different ingredient names.
The first three steps are unique to the post-processing of ingredient names and are described in more detail below. The fourth step is the same as the non-name labels described above, using the groups of indices output from step 3.
We will use the sentence 8 ounces whole yellow or red bell pepper as an example to show how the ingredient name post-processing works.
This sentence has the following tokens and labels:
Index |
0 |
1 |
2 |
3 |
4 |
5 |
6 |
7 |
Token |
8 |
ounces |
whole |
yellow |
or |
red |
bell |
pepper |
Label |
QTY |
UNIT |
NAME_MOD |
NAME_VAR |
NAME_SEP |
NAME_VAR |
B_NAME_TOK |
I_NAME_TOK |
Extract NAME labels#
This is a straight forward step that finds that indices of all tokens that have been given one of the following labels: B_NAME_TOK, I_NAME_TOK, NAME_VAR, NAME_MOD, NAME_SEP, PUNC.
This results in the following indices:
[2, 3, 4, 5, 6, 7]
Group tokens by NAME label#
Iterate over the extract NAME labels and group consecutive labels of the same type together.
Consecutive NAME_MOD labels are grouped together.
Consecutive NAME_VAR labels are grouped together.
Consecutive B_NAME_TOK, I_NAME_TOK and PUNC labels are grouped together.
NAME_SEP labels are used to force the start of a new grouping.
When grouping the token together, we store the index and label of the tokens.
Note
The indices here are the indices of elements from the extracted NAME labels i.e. an index of 0 here is the first element of the extracted NAME labels which refers to the token at index 2 of the whole sentence.
For the example sentence, we get the following groups:
[
[(0, 'NAME_MOD')],
[(1, 'NAME_VAR')],
[(3, 'NAME_VAR')],
[(4, 'B_NAME_TOK'), (5, 'I_NAME_TOK')]
]
Construct names from NAME groups#
From the name groups, we construct the ingredient names. This is most easily done by iterating in reverse over the groups and applying the following logic:
Each group starting with B_NAME_TOK forms a new name.
Each NAME_VAR group is prepended to the beginning of the most recent name.
Each NAME_MOD group is prepended to all previous names.
The output from this construction step are groups of indices, where each group represents an ingredient name.
For the example sentence, the constructed groups of indices are:
[
(0, 1, 4, 5), # whole yellow bell pepper
(0, 3, 4, 5) # whole red bell pepper
]
Create IngredientText objects#
With the groups of indices obtained from the previous step, we can convert to token indices and then follow the steps used to post-process the SIZE, PREP, PURPOSE, COMMENT labels described above.
Once the IngredientText objects have been obtained, we perform one final post-processing step. If there are multiple names, we check the part of speech tag for the last token in each name. If the part of speech tag is IN, DT or JJ, we merge the name with the next name. This merging of ingredient names is necessary to mitigate against mislabelling of tokens by the model, which can happen if a name is split by a token with another label.
Amount#
The QTY and UNIT labels are combined into an IngredientAmount object.
@dataclass
class IngredientAmount:
"""Dataclass for holding a parsed ingredient amount.
On instantiation, the unit is made plural if necessary.
Attributes
----------
quantity : Fraction | str
Parsed ingredient quantity, as a Fraction where possible, otherwise a string.
If the amount if a range, this is the lower limit of the range.
quantity_max : Fraction | str
If the amount is a range, this is the upper limit of the range.
Otherwise, this is the same as the quantity field.
This is set automatically depending on the type of quantity.
unit : str | pint.Unit
Unit of parsed ingredient quantity.
If the quantity is recognised in the pint unit registry, a pint.Unit
object is used.
text : str
String describing the amount e.g. "1 cup", "8 oz"
confidence : float
Confidence of parsed ingredient amount, between 0 and 1.
This is the average confidence of all tokens that contribute to this object.
starting_index : int
Index of token in sentence that starts this amount
unit_system : UnitSystem
Unit system (e.g. metric) that the unit of the amount belongs to.
APPROXIMATE : bool, optional
When True, indicates that the amount is approximate.
Default is False.
SINGULAR : bool, optional
When True, indicates if the amount refers to a singular item of the ingredient.
Default is False.
RANGE : bool, optional
When True, indicates the amount is a range e.g. 1-2.
Default is False.
MULTIPLIER : bool, optional
When True, indicates the amount is a multiplier e.g. 1x, 2x.
Default is False.
PREPARED_INGREDIENT : bool, optional
When True, indicates the amount applies to the prepared ingredient.
When False, indicates the amount applies to the ingredient before preparation.
Default is False.
"""
quantity: Fraction | str
quantity_max: Fraction | str
unit: str | pint.Unit
text: str
confidence: float
starting_index: int
unit_system: UnitSystem = field(init=False)
APPROXIMATE: bool = False
SINGULAR: bool = False
RANGE: bool = False
MULTIPLIER: bool = False
PREPARED_INGREDIENT: bool = False
For most cases, the amounts are determined by combining a QTY label with the following UNIT labels, up to the next QTY which becomes a new amount. For example:
>>> p = PreProcessor("3/4 cup (170g) heavy cream")
>>> [t.text for t in p.tokenized_sentence]
['#3$4', 'cup', '(', '170', 'g', ')', 'heavy', 'cream']
...
>>> parsed = PostProcessor(sentence, labelled_tokens).parsed()
>>> amounts = parsed.amount
[
IngredientAmount(quantity=Fraction(3, 4),
quantity_max=Fraction(3, 4),
unit=<Unit('cup')>,
text='0.75 cups',
confidence=0.999881,
starting_index=0,
unit_system=<UnitSystem.US_CUSTOMARY: 'us_customary'>,
APPROXIMATE=False,
SINGULAR=False,
RANGE=False,
MULTIPLIER=False,
PREPARED_INGREDIENT=False),
IngredientAmount(quantity=Fraction(170, 1),
quantity_max=Fraction(170, 1),
unit=<Unit('gram')>,
text='170 g',
confidence=0.995941,
starting_index=3,
unit_system=<UnitSystem.METRIC: 'metric'>,
APPROXIMATE=False,
SINGULAR=False,
RANGE=False,
MULTIPLIER=False,
PREPARED_INGREDIENT=False)
]
Quantities#
Quantities are returned as fractions.Fraction objects, or str for non-numeric quantities (e.g. dozen).
>>> parsed = parse_ingredient("1/3 cup oil", quantity_fractions=True)
>>> parsed.amount.quantity
Fraction(1, 3)
Note
Conversion of quantities to float or int is left to the end users of this library.
Tokens with the QTY label that are numbers represented in textual form e.g. “one”, “two” are replaced with numeric forms.
The replacements are predefined in the STRING_NUMBERS constant.
For performance reasons, the regular expressions used to substitute the text with the number are pre-compiled and provided in the STRING_NUMBERS_REGEXES constant, which is a dictionary where the value is a tuple of (pre-compiled regular expression, substitute value).
# Strings and their numeric representation
STRING_NUMBERS = {
"one-quarter": "1/4",
"one-half": "1/2",
"three-quarter": "3/4",
"three-quarters": "3/4",
"one": "1",
"two": "2",
"three": "3",
"four": "4",
"five": "5",
"six": "6",
"seven": "7",
"eight": "8",
"nine": "9",
"ten": "10",
"eleven": "11",
"twelve": "12",
"thirteen": "13",
"fourteen": "14",
"fifteen": "15",
"sixteen": "16",
"seventeen": "17",
"eighteen": "18",
"nineteen": "19",
}
# Precompile the regular expressions for matching the string numbers
STRING_NUMBERS_REGEXES = {}
for s, n in STRING_NUMBERS.items():
# This is case insensitive so it replace e.g. "one" and "One"
# Only match if the string is preceded by a non-word character or is at
# the start of the sentence
STRING_NUMBERS_REGEXES[s] = (re.compile(rf"\b({s})\b", flags=re.IGNORECASE), n)
Implicit quantities#
In some sentences, the quantity of the unit is not explicitly stated but is implied by the units in the sentence. For example in the sentence 15 oz can black beans, there is implicitly 1 can of beans (that contains 15 oz). Another example commonly seen pattern is Rosemary sprig (optional), where there is an implicit quantity of 1 due to “sprig” being singular.
In these cases, the quantity is set explicitly to 1 in the IngredientAmount object.
To guard against incorrectly assigning the quantity, the unit is checked to make sure it is not plural and the sentence prior to the unit is checked to make sure that it does not include any indefinite quantifiers (e.g. few, some).
Units#
Note
The use of pint.Unit objects can be disabled by setting string_units=True in the parse_ingredient function. When this is True, units will be returned as strings, correctly pluralised for the quantity.
The Pint library is used to standardise the units where possible.
If the unit in a parsed IngredientAmount can be matched to a unit in the Pint Unit Registry, then a pint.Unit object is used in place of the unit string.
This has the benefit of standardising units that can be represented in different formats, for example a gram could be represented in the sentence as g, gram, grams.
These will all be represented using the same <Unit('gram')> object in the parsed information.
By default, US customary units are used for volumetric measurements that have multiple defintions (e.g. cup, tablespoon etc.).
This can be changed to use other unit systems using the volumetric_units_system keyword argument in the parse_ingredient function call.
See Options for the available options.
>>> parse_ingredient("3/4 cup heavy cream", volumetric_units_system="us_customary") # Default
ParsedIngredient(
name=IngredientText(text='heavy cream', confidence=0.997513),
size=None,
amount=[IngredientAmount(quantity=Fraction(3, 4),
quantity_max=Fraction(3, 4),
unit=<Unit('cup')>,
text='0.75 cups',
confidence=0.999926,
starting_index=0,
unit_system=<UnitSystem.US_CUSTOMARY: 'us_customary'>,
APPROXIMATE=False,
SINGULAR=False,
RANGE=False,
MULTIPLIER=False,
PREPARED_INGREDIENT=False)],
preparation=None,
comment=None,
sentence='3/4 cup heavy cream'
)
>>> parse_ingredient("3/4 cup heavy cream", volumetric_units_system="imperial")
ParsedIngredient(
name=IngredientText(text='heavy cream', confidence=0.997513),
size=None,
amount=[IngredientAmount(quantity=Fraction(3, 4),
quantity_max=Fraction(3, 4),
unit=<Unit('imperial_cup')>,
text='0.75 cups',
confidence=0.999926,
starting_index=0,
unit_system=<UnitSystem.IMPERIAL: 'imperial'>,
APPROXIMATE=False,
SINGULAR=False,
RANGE=False,
MULTIPLIER=False,
PREPARED_INGREDIENT=False)],
preparation=None,
comment=None,
sentence='3/4 cup heavy cream'
)
Tip
The use of pint.Unit objects means that the ingredient amounts can easily be converted to different units.
See the Convert between units how-to guide.
IngredientAmount flags#
IngredientAmount objects have a number of flags that are set to provide additional information about the amount.
Flag |
Description |
|---|---|
APPROXIMATE |
This is set to True when the QTY is preceded by a word such as about, approximately and indicates if the amount is approximate. |
SINGULAR |
This is set to True when the amount is followed by a word such as each and indicates that the amount refers to a singular item of the ingredient. There is also a special case (below), where an inner amount that inside a QTY-UNIT pair will be marked as SINGULAR. |
RANGE |
This is set to True with the amount if a range of values, e.g. 1-2, 300-400.
In these cases, the |
MULTUPLIER |
This is set to True when the amount is represented as a multiple such as 1x.
The |
PREPARED_INGREDIENT |
This is set to True when the amount refers to the ingredient after any preparation instructions in the ingredient sentence have been followed. For example in the sentence 1 tbsp chopped nuts, we would want 1 tablespoon of nuts, measured after they have been chopped. If the sentence was 1 tbsp nuts, chopped, we would want to chop the nuts after we have measured 1 tablespoon. |
Special cases for amounts#
There are some particular cases where the combination of QTY and UNIT labels that make up an amount are not straightforward. For example, consider the sentence 2 14 ounce cans coconut milk. In this case there are two amounts: 2 cans and 14 ounce, where the latter is marked as SINGULAR because it applies to each of the 2 cans.
>>> parsed = parse_ingredient("2 14 ounce cans coconut milk")
>>> parsed.amount
[IngredientAmount(quantity=Fraction(2, 1),
quantity_max=Fraction(2, 1),
unit='cans',
text='2 cans',
confidence=0.999897,
starting_index=0,
unit_system=<UnitSystem.OTHER: 'other'>,
APPROXIMATE=False,
SINGULAR=False,
RANGE=False,
MULTIPLIER=False,
PREPARED_INGREDIENT=False),
IngredientAmount(quantity=Fraction(14, 1),
quantity_max=Fraction(14, 1),
unit=<Unit('ounce')>,
text='14 ounces',
confidence=0.998793,
starting_index=1,
unit_system=<UnitSystem.US_CUSTOMARY: 'us_customary'>,
APPROXIMATE=False,
SINGULAR=True,
RANGE=False,
MULTIPLIER=False,
PREPARED_INGREDIENT=False)]
Identifying and handling this pattern of QTY and UNIT labels is done by the PostProcessor._sizable_unit_pattern() function.
A second case is where the full amount is made up of more than one adjacent quantity-unit pair.
This is particularly common with US customary units such as pounds and ounces, or pints and fluid ounces.
In these cases, a CompositeIngredientAmount is returned.
For example
>>> parsed = parse_ingredient("1lb 2oz pecorino romano cheese")
>>> parsed.amount
[CompositeIngredientAmount(
amounts=[
IngredientAmount(quantity=Fraction(1, 1),
quantity_max=Fraction(1, 1),
unit=<Unit('pound')>,
text='1 lb',
confidence=0.999923,
starting_index=0,
unit_system=<UnitSystem.US_CUSTOMARY: 'us_customary'>,
APPROXIMATE=False,
SINGULAR=False,
RANGE=False,
MULTIPLIER=False,
PREPARED_INGREDIENT=False),
IngredientAmount(quantity=Fraction(1, 1),
quantity_max=Fraction(1, 1),
unit=<Unit('ounce')>,
text='2 oz',
confidence=0.998968,
starting_index=2,
unit_system=<UnitSystem.US_CUSTOMARY: 'us_customary'>,
APPROXIMATE=False,
SINGULAR=False,
RANGE=False,
MULTIPLIER=False,
PREPARED_INGREDIENT=False)],
join='',
subtractive=False,
text='1 lb 2 oz',
confidence=0.9994455,
starting_index=0
unit_system=<UnitSystem.US_CUSTOMARY: 'us_customary'>,
)]