Feature selection

Feature selection#

Feature calculation is done for each token in the sentence, so first the normalised sentence must be tokenised.


Once the input sentence has been normalised, it can be split into tokens. Each token represents a single unit of the sentence. These are not necessarily the same as a word because we might want to handle punctuation and compound words in particular ways.

The tokenizer in created using NLTK’s Regular Expression tokenizer. The splits an string input according the a regular expression.

The defined tokenizer splits the sentence according the following rules:

# Define regular expressions used by tokenizer.
# Matches one or more whitespace characters
WHITESPACE_TOKENISER = re.compile(r"\S+")
# Matches and captures one of the following: ( ) [ ] { } , " / : ;
PUNCTUATION_TOKENISER = re.compile(r"([\(\)\[\]\{\}\,/:;])")

def tokenize(sentence: str) -> list[str]:
    """Tokenise an ingredient sentence.
    The sentence is split on whitespace characters into a list of tokens.
    If any of these tokens contains of the punctuation marks captured by
    PUNCTUATION_TOKENISER, these are then split and isolated as a seperate

    The returned list of tokens has any empty tokens removed.

    sentence : str
        Ingredient sentence to tokenize

        List of tokens from sentence.

    >>> tokenize("2 cups (500 ml) milk")
    ["2", "cups", "(", "500", "ml", ")", "milk"]

    >>> tokenize("1-2 mashed bananas: as ripe as possible")
    ["1-2", "mashed", "bananas", ":", "as", "ripe", "as", "possible"]
    tokens = [
        for tok in WHITESPACE_TOKENISER.findall(sentence)
    return [tok for tok in chain.from_iterable(tokens) if tok]

This splits the sentence apart into wherever there is white space or a punctuation mark in PUNCTUATION_TOKENISER.

>>> from Preprocess import PreProcessor
>>> p = PreProcessor("1/2 cup orange juice, freshly squeezed")
>>> p.tokenised_sentence
['0.5', 'cup', 'orange', 'juice', ',', 'freshly', 'squeezed']

Features Calculation#

The features for each of each token in each sentence need to be selected and extracted.

There are quite a wide range of features that can be extracted for each token and it can be difficult to tell if a particular feature is useful or not.

The Ingredient Phrase Tagger approach to features was to use the following:

  • The token itself

  • The position of the token in the sentence, as an index

  • The number of tokens in the sentence, but rounded down to the nearest group in [4, 8, 12, 16, 20]

  • Whether the token starts with a capital letter

  • Whether the token is inside parentheses in the sentence

The features used for this model are a little different

  • The stem of the token

  • The part of speech (POS) tag

  • Whether the token is capitalised

  • Whether the token is numeric

  • Whether the token is a unit (determined from the list of units)

  • Whether the token is a punctuation mark

  • Whether the token is an ambiguous unit

  • Whether the token is inside parentheses

  • Whether the token is after a comma

  • Whether the token follows a + symbol

  • Whether the sentence is a short sentence (having less than 3 tokens)

If possible, based on the position of the token in the sentence, the following features are also added

  • The stem of the previous token

  • The POS tag for the previous token combined with the POS tag for the current token

  • The stem of the token before the previous token

  • The POS tag for the token before the previous token combined with the POS tags for the previous and current tokens

  • The stem of the next token

  • The POS tag for the next token combined with the POS tag for the current token

  • The stem of the token after the next token

  • The POS tag for the token after the next token combined with the POS tags for the current and next tokens

The _token_features() function of PreProcessor returns all these features as a dictionary.

def _token_features(self, index: int) -> dict[str, str | bool]:
    """Return the features for each token in the sentence

    index : int
        Index of token to get features for.

    dict[str, str | bool]
        Dictionary of features for token at index
    token = self.tokenized_sentence[index]
    features = {
        "bias": "",
        "stem": stem(token),
        "pos": self.pos_tags[index],
        "is_capitalised": self._is_capitalised(token),
        "is_numeric": self._is_numeric(token),
        "is_unit": self._is_unit(token),
        "is_punc": self._is_punc(token),
        "is_ambiguous": self._is_ambiguous_unit(token),
        "is_in_parens": self._is_inside_parentheses(index),
        "is_after_comma": self._follows_comma(index),
        "is_after_plus": self._follows_plus(index),
        "is_short_phrase": len(self.tokenized_sentence) < 3,

    if token != stem(token):
        features["token"] = token

    if index > 0:
        prev_token = self.tokenized_sentence[index - 1]
        features["prev_pos"] = "+".join(
            (self.pos_tags[index - 1], self.pos_tags[index])
        features["prev_stem"] = stem(prev_token)
        features["prev_is_capitalised"] = self._is_capitalised(prev_token)
        features["prev_is_numeric"] = self._is_numeric(prev_token)
        features["prev_is_unit"] = self._is_unit(prev_token)
        features["prev_is_punc"] = self._is_punc(prev_token)
        features["prev_is_ambiguous"] = self._is_ambiguous_unit(prev_token)
        features["prev_is_in_parens"] = self._is_inside_parentheses(index - 1)
        features["prev_is_after_comma"] = self._follows_comma(index - 1)
        features["prev_is_after_plus"] = self._follows_plus(index - 1)

    if index > 1:
        prev_token2 = self.tokenized_sentence[index - 2]
        features["prev_pos2"] = "+".join(
                self.pos_tags[index - 2],
                self.pos_tags[index - 1],
        features["prev_stem2"] = stem(prev_token2)
        features["prev_is_capitalised2"] = self._is_capitalised(prev_token2)
        features["prev_is_numeric2"] = self._is_numeric(prev_token2)
        features["prev_is_unit2"] = self._is_unit(prev_token2)
        features["prev_is_punc2"] = self._is_punc(prev_token2)
        features["prev_is_ambiguous2"] = self._is_ambiguous_unit(prev_token2)
        features["prev_is_in_parens2"] = self._is_inside_parentheses(index - 2)
        features["prev_is_after_comma2"] = self._follows_comma(index - 2)
        features["prev_is_after_plus2"] = self._follows_plus(index - 2)

    if index < len(self.tokenized_sentence) - 1:
        next_token = self.tokenized_sentence[index + 1]
        features["next_pos"] = "+".join(
            (self.pos_tags[index], self.pos_tags[index + 1])
        features["next_stem"] = stem(next_token)
        features["next_is_capitalised"] = self._is_capitalised(next_token)
        features["next_is_numeric"] = self._is_numeric(next_token)
        features["next_is_unit"] = self._is_unit(next_token)
        features["next_is_punc"] = self._is_punc(next_token)
        features["next_is_ambiguous"] = self._is_ambiguous_unit(next_token)
        features["next_is_in_parens"] = self._is_inside_parentheses(index + 1)
        features["next_is_after_comma"] = self._follows_comma(index + 1)
        features["next_is_after_plus"] = self._follows_plus(index + 1)

    if index < len(self.tokenized_sentence) - 2:
        next_token2 = self.tokenized_sentence[index + 2]
        features["next_pos2"] = "+".join(
                self.pos_tags[index + 2],
                self.pos_tags[index + 1],
        features["next_stem2"] = stem(next_token2)
        features["next_is_capitalised2"] = self._is_capitalised(next_token2)
        features["next_is_numeric2"] = self._is_numeric(next_token2)
        features["next_is_unit2"] = self._is_unit(next_token2)
        features["next_is_punc2"] = self._is_punc(next_token2)
        features["next_is_ambiguous2"] = self._is_ambiguous_unit(next_token2)
        features["next_is_in_parens2"] = self._is_inside_parentheses(index + 2)
        features["next_is_after_comma2"] = self._follows_comma(index + 2)
        features["next_is_after_plus2"] = self._follows_plus(index + 2)

    return features

The sentence_features() function of PreProcessor return the features for all tokens in the sentence in a list.


It is likely that some of these features aren’t necessary. There is a chunk of work for the future to determine the most useful features.