Model Guide#

This user guide provided more in depth information about the pipelines of training the model and for parsing a sentence.

Training Pipeline#

The training pipeline is shown below.

Training pipeline

Load data#

The data is loaded from an sqlite3 database of labelled sentences.

See Training data for more information.


The input sentences are normalised to clean up particular sentence features into a standard format. The sentence is then tokenised.

See Normalisation for more information.

Extract features#

The features for each token are extracted. These features are used to train the model or, once the model has been trained, label each token.

See Extracting features for more information.


The Conditional Random Fields model is trained on 80% of the training data.


The remaining 20% of the training data is used to evaluate the performance of the model on data the model has not encountered before.

See Training the model for more information.

Parsing Pipeline#

The parsing pipeline is shown below.

Parsing pipeline

The Normalise and Extract features steps are the same as above.


The features for each token in the sentence are fed into the CRF model which returns a label and the confidence for the label for each token in the sentence.

See Using the model for more information.


The token labels go through a post-processing step to build the object that is output from the parse_ingredient function.

See Post-processing for more information.