Using the model#

With the model trained, it can be used to label the tokens of an ingredient sentence with one of the following labels:

QTY
UNIT
NAME
SIZE
PREP
COMMENT
PUNC

The general process is like so

p = PreProcessor(input_sentence)

tagger = pycrfsuite.Tagger()
tagger.open(model_file)

labels_pred = tagger.tag(p.sentence_features())

The tagger returns a list of labels the same length as the list of sentence tokens. For example, consider the sentence 3/4 cup (170g) heavy cream:

>>> p = PreProcessor("3/4 cup (170g) heavy cream")
>>> p.tokenized_sentence
['0.75', 'cup', '(', '170', 'g', ')', 'heavy', 'cream']

>>> tagger.tag(p.sentence_features())
['QTY', 'UNIT', 'COMMENT', 'QTY', 'UNIT', 'COMMENT', 'NAME', 'NAME']

A confidence score can be calculated for each label too

   >>>[tagger.marginal(label, i) for i, label in enumerate(labels)]
   [0.99969..., 0.9991524..., 0.997019..., 0.907705..., 0.910985..., 0.962122..., 0.998440...,
0.996780...]

The confidence score is a value between 0 and 1 which represents the model’s belief that a given label is correct. The sum of the scores for all labels for a given token is equal to 1.