Feature selection

Feature selection#

Feature calculation is done for each token in the sentence, so first the normalised sentence must be tokenised.

Tokenization#

Once the input sentence has been normalised, it can be split into tokens. Each token represents a single unit of the sentence. These are not necessarily the same as a word because we might want to handle punctuation and compound words in particular ways.

The tokenizer in created using NLTK’s Regular Expression tokenizer. The splits an string input according the a regular expression.

The defined tokenizer splits the sentence according the following rules:

# Define regular expressions used by tokenizer.
# Matches one or more whitespace characters
WHITESPACE_TOKENISER = re.compile(r"\S+")
# Matches and captures one of the following: ( ) [ ] { } , " / : ;
PUNCTUATION_TOKENISER = re.compile(r"([\(\)\[\]\{\}\,/:;])")


def tokenize(sentence: str) -> list[str]:
    """Tokenise an ingredient sentence.
    The sentence is split on whitespace characters into a list of tokens.
    If any of these tokens contains of the punctuation marks captured by
    PUNCTUATION_TOKENISER, these are then split and isolated as a seperate
    token.

    The returned list of tokens has any empty tokens removed.

    Parameters
    ----------
    sentence : str
        Ingredient sentence to tokenize

    Returns
    -------
    list[str]
        List of tokens from sentence.

    Examples
    --------
    >>> tokenize("2 cups (500 ml) milk")
    ["2", "cups", "(", "500", "ml", ")", "milk"]

    >>> tokenize("1-2 mashed bananas: as ripe as possible")
    ["1-2", "mashed", "bananas", ":", "as", "ripe", "as", "possible"]
    """
    tokens = [
        PUNCTUATION_TOKENISER.split(tok)
        for tok in WHITESPACE_TOKENISER.findall(sentence)
    ]
    return [tok for tok in chain.from_iterable(tokens) if tok]

This splits the sentence apart into wherever there is white space or a punctuation mark in PUNCTUATION_TOKENISER.

>>> from Preprocess import PreProcessor
>>> p = PreProcessor("1/2 cup orange juice, freshly squeezed")
>>> p.tokenised_sentence
['0.5', 'cup', 'orange', 'juice', ',', 'freshly', 'squeezed']

Features Calculation#

The features for each of each token in each sentence need to be selected and extracted.

There are quite a wide range of features that can be extracted for each token and it can be difficult to tell if a particular feature is useful or not.

The Ingredient Phrase Tagger approach to features was to use the following:

  • The token itself

  • The position of the token in the sentence, as an index

  • The number of tokens in the sentence, but rounded down to the nearest group in [4, 8, 12, 16, 20]

  • Whether the token starts with a capital letter

  • Whether the token is inside parentheses in the sentence

The features used for this model are a little different

  • The stem of the token

  • The part of speech (POS) tag

  • Whether the token is capitalised

  • Whether the token is numeric

  • Whether the token is a unit (determined from the list of units)

  • Whether the token is a punctuation mark

  • Whether the token is an ambiguous unit

  • Whether the token is inside parentheses

  • Whether the token is after a comma

  • Whether the token follows a + symbol

  • Whether the sentence is a short sentence (having less than 3 tokens)

If possible, based on the position of the token in the sentence, the following features are also added

  • The stem of the previous token

  • The POS tag for the previous token combined with the POS tag for the current token

  • The stem of the token before the previous token

  • The POS tag for the token before the previous token combined with the POS tags for the previous and current tokens

  • The stem of the next token

  • The POS tag for the next token combined with the POS tag for the current token

  • The stem of the token after the next token

  • The POS tag for the token after the next token combined with the POS tags for the current and next tokens

The _token_features() function of PreProcessor returns all these features as a dictionary.

def _token_features(self, index: int) -> dict[str, str | bool]:
    """Return the features for each token in the sentence

    Parameters
    ----------
    index : int
        Index of token to get features for.

    Returns
    -------
    dict[str, str | bool]
        Dictionary of features for token at index
    """
    token = self.tokenized_sentence[index]
    features = {
        "bias": "",
        "stem": stem(token),
        "pos": self.pos_tags[index],
        "is_capitalised": self._is_capitalised(token),
        "is_numeric": self._is_numeric(token),
        "is_unit": self._is_unit(token),
        "is_punc": self._is_punc(token),
        "is_ambiguous": self._is_ambiguous_unit(token),
        "is_in_parens": self._is_inside_parentheses(index),
        "is_after_comma": self._follows_comma(index),
        "is_after_plus": self._follows_plus(index),
        "is_short_phrase": len(self.tokenized_sentence) < 3,
    }

    if token != stem(token):
        features["token"] = token

    if index > 0:
        prev_token = self.tokenized_sentence[index - 1]
        features["prev_pos"] = "+".join(
            (self.pos_tags[index - 1], self.pos_tags[index])
        )
        features["prev_stem"] = stem(prev_token)
        features["prev_is_capitalised"] = self._is_capitalised(prev_token)
        features["prev_is_numeric"] = self._is_numeric(prev_token)
        features["prev_is_unit"] = self._is_unit(prev_token)
        features["prev_is_punc"] = self._is_punc(prev_token)
        features["prev_is_ambiguous"] = self._is_ambiguous_unit(prev_token)
        features["prev_is_in_parens"] = self._is_inside_parentheses(index - 1)
        features["prev_is_after_comma"] = self._follows_comma(index - 1)
        features["prev_is_after_plus"] = self._follows_plus(index - 1)

    if index > 1:
        prev_token2 = self.tokenized_sentence[index - 2]
        features["prev_pos2"] = "+".join(
            (
                self.pos_tags[index - 2],
                self.pos_tags[index - 1],
                self.pos_tags[index],
            )
        )
        features["prev_stem2"] = stem(prev_token2)
        features["prev_is_capitalised2"] = self._is_capitalised(prev_token2)
        features["prev_is_numeric2"] = self._is_numeric(prev_token2)
        features["prev_is_unit2"] = self._is_unit(prev_token2)
        features["prev_is_punc2"] = self._is_punc(prev_token2)
        features["prev_is_ambiguous2"] = self._is_ambiguous_unit(prev_token2)
        features["prev_is_in_parens2"] = self._is_inside_parentheses(index - 2)
        features["prev_is_after_comma2"] = self._follows_comma(index - 2)
        features["prev_is_after_plus2"] = self._follows_plus(index - 2)

    if index < len(self.tokenized_sentence) - 1:
        next_token = self.tokenized_sentence[index + 1]
        features["next_pos"] = "+".join(
            (self.pos_tags[index], self.pos_tags[index + 1])
        )
        features["next_stem"] = stem(next_token)
        features["next_is_capitalised"] = self._is_capitalised(next_token)
        features["next_is_numeric"] = self._is_numeric(next_token)
        features["next_is_unit"] = self._is_unit(next_token)
        features["next_is_punc"] = self._is_punc(next_token)
        features["next_is_ambiguous"] = self._is_ambiguous_unit(next_token)
        features["next_is_in_parens"] = self._is_inside_parentheses(index + 1)
        features["next_is_after_comma"] = self._follows_comma(index + 1)
        features["next_is_after_plus"] = self._follows_plus(index + 1)

    if index < len(self.tokenized_sentence) - 2:
        next_token2 = self.tokenized_sentence[index + 2]
        features["next_pos2"] = "+".join(
            (
                self.pos_tags[index + 2],
                self.pos_tags[index + 1],
                self.pos_tags[index],
            )
        )
        features["next_stem2"] = stem(next_token2)
        features["next_is_capitalised2"] = self._is_capitalised(next_token2)
        features["next_is_numeric2"] = self._is_numeric(next_token2)
        features["next_is_unit2"] = self._is_unit(next_token2)
        features["next_is_punc2"] = self._is_punc(next_token2)
        features["next_is_ambiguous2"] = self._is_ambiguous_unit(next_token2)
        features["next_is_in_parens2"] = self._is_inside_parentheses(index + 2)
        features["next_is_after_comma2"] = self._follows_comma(index + 2)
        features["next_is_after_plus2"] = self._follows_plus(index + 2)

    return features

The sentence_features() function of PreProcessor return the features for all tokens in the sentence in a list.

Attention

It is likely that some of these features aren’t necessary. There is a chunk of work for the future to determine the most useful features.