Feature selection#
Feature calculation is done for each token in the sentence, so first the normalised sentence must be tokenised.
Tokenization#
Once the input sentence has been normalised, it can be split into tokens. Each token represents a single unit of the sentence. These are not necessarily the same as a word because we might want to handle punctuation and compound words in particular ways.
The tokenizer in created using NLTK’s Regular Expression tokenizer. The splits an string input according the a regular expression.
The defined tokenizer splits the sentence according the following rules:
# Define regular expressions used by tokenizer.
# Matches one or more whitespace characters
WHITESPACE_TOKENISER = re.compile(r"\S+")
# Matches and captures one of the following: ( ) [ ] { } , " / : ;
PUNCTUATION_TOKENISER = re.compile(r"([\(\)\[\]\{\}\,/:;])")
def tokenize(sentence: str) -> list[str]:
"""Tokenise an ingredient sentence.
The sentence is split on whitespace characters into a list of tokens.
If any of these tokens contains of the punctuation marks captured by
PUNCTUATION_TOKENISER, these are then split and isolated as a seperate
token.
The returned list of tokens has any empty tokens removed.
Parameters
----------
sentence : str
Ingredient sentence to tokenize
Returns
-------
list[str]
List of tokens from sentence.
Examples
--------
>>> tokenize("2 cups (500 ml) milk")
["2", "cups", "(", "500", "ml", ")", "milk"]
>>> tokenize("1-2 mashed bananas: as ripe as possible")
["1-2", "mashed", "bananas", ":", "as", "ripe", "as", "possible"]
"""
tokens = [
PUNCTUATION_TOKENISER.split(tok)
for tok in WHITESPACE_TOKENISER.findall(sentence)
]
return [tok for tok in chain.from_iterable(tokens) if tok]
This splits the sentence apart into wherever there is white space or a punctuation mark in PUNCTUATION_TOKENISER
.
>>> from Preprocess import PreProcessor
>>> p = PreProcessor("1/2 cup orange juice, freshly squeezed")
>>> p.tokenised_sentence
['0.5', 'cup', 'orange', 'juice', ',', 'freshly', 'squeezed']
Features Calculation#
The features for each of each token in each sentence need to be selected and extracted.
There are quite a wide range of features that can be extracted for each token and it can be difficult to tell if a particular feature is useful or not.
The Ingredient Phrase Tagger approach to features was to use the following:
The token itself
The position of the token in the sentence, as an index
The number of tokens in the sentence, but rounded down to the nearest group in [4, 8, 12, 16, 20]
Whether the token starts with a capital letter
Whether the token is inside parentheses in the sentence
The features used for this model are a little different
The stem of the token
The part of speech (POS) tag
Whether the token is capitalised
Whether the token is numeric
Whether the token is a unit (determined from the list of units)
Whether the token is a punctuation mark
Whether the token is an ambiguous unit
Whether the token is inside parentheses
Whether the token is after a comma
Whether the token follows a + symbol
Whether the sentence is a short sentence (having less than 3 tokens)
If possible, based on the position of the token in the sentence, the following features are also added
The stem of the previous token
The POS tag for the previous token combined with the POS tag for the current token
The stem of the token before the previous token
The POS tag for the token before the previous token combined with the POS tags for the previous and current tokens
The stem of the next token
The POS tag for the next token combined with the POS tag for the current token
The stem of the token after the next token
The POS tag for the token after the next token combined with the POS tags for the current and next tokens
The _token_features()
function of PreProcessor
returns all these features as a dictionary.
def _token_features(self, index: int) -> dict[str, str | bool]:
"""Return the features for each token in the sentence
Parameters
----------
index : int
Index of token to get features for.
Returns
-------
dict[str, str | bool]
Dictionary of features for token at index
"""
token = self.tokenized_sentence[index]
features = {
"bias": "",
"stem": stem(token),
"pos": self.pos_tags[index],
"is_capitalised": self._is_capitalised(token),
"is_numeric": self._is_numeric(token),
"is_unit": self._is_unit(token),
"is_punc": self._is_punc(token),
"is_ambiguous": self._is_ambiguous_unit(token),
"is_in_parens": self._is_inside_parentheses(index),
"is_after_comma": self._follows_comma(index),
"is_after_plus": self._follows_plus(index),
"is_short_phrase": len(self.tokenized_sentence) < 3,
}
if token != stem(token):
features["token"] = token
if index > 0:
prev_token = self.tokenized_sentence[index - 1]
features["prev_pos"] = "+".join(
(self.pos_tags[index - 1], self.pos_tags[index])
)
features["prev_stem"] = stem(prev_token)
features["prev_is_capitalised"] = self._is_capitalised(prev_token)
features["prev_is_numeric"] = self._is_numeric(prev_token)
features["prev_is_unit"] = self._is_unit(prev_token)
features["prev_is_punc"] = self._is_punc(prev_token)
features["prev_is_ambiguous"] = self._is_ambiguous_unit(prev_token)
features["prev_is_in_parens"] = self._is_inside_parentheses(index - 1)
features["prev_is_after_comma"] = self._follows_comma(index - 1)
features["prev_is_after_plus"] = self._follows_plus(index - 1)
if index > 1:
prev_token2 = self.tokenized_sentence[index - 2]
features["prev_pos2"] = "+".join(
(
self.pos_tags[index - 2],
self.pos_tags[index - 1],
self.pos_tags[index],
)
)
features["prev_stem2"] = stem(prev_token2)
features["prev_is_capitalised2"] = self._is_capitalised(prev_token2)
features["prev_is_numeric2"] = self._is_numeric(prev_token2)
features["prev_is_unit2"] = self._is_unit(prev_token2)
features["prev_is_punc2"] = self._is_punc(prev_token2)
features["prev_is_ambiguous2"] = self._is_ambiguous_unit(prev_token2)
features["prev_is_in_parens2"] = self._is_inside_parentheses(index - 2)
features["prev_is_after_comma2"] = self._follows_comma(index - 2)
features["prev_is_after_plus2"] = self._follows_plus(index - 2)
if index < len(self.tokenized_sentence) - 1:
next_token = self.tokenized_sentence[index + 1]
features["next_pos"] = "+".join(
(self.pos_tags[index], self.pos_tags[index + 1])
)
features["next_stem"] = stem(next_token)
features["next_is_capitalised"] = self._is_capitalised(next_token)
features["next_is_numeric"] = self._is_numeric(next_token)
features["next_is_unit"] = self._is_unit(next_token)
features["next_is_punc"] = self._is_punc(next_token)
features["next_is_ambiguous"] = self._is_ambiguous_unit(next_token)
features["next_is_in_parens"] = self._is_inside_parentheses(index + 1)
features["next_is_after_comma"] = self._follows_comma(index + 1)
features["next_is_after_plus"] = self._follows_plus(index + 1)
if index < len(self.tokenized_sentence) - 2:
next_token2 = self.tokenized_sentence[index + 2]
features["next_pos2"] = "+".join(
(
self.pos_tags[index + 2],
self.pos_tags[index + 1],
self.pos_tags[index],
)
)
features["next_stem2"] = stem(next_token2)
features["next_is_capitalised2"] = self._is_capitalised(next_token2)
features["next_is_numeric2"] = self._is_numeric(next_token2)
features["next_is_unit2"] = self._is_unit(next_token2)
features["next_is_punc2"] = self._is_punc(next_token2)
features["next_is_ambiguous2"] = self._is_ambiguous_unit(next_token2)
features["next_is_in_parens2"] = self._is_inside_parentheses(index + 2)
features["next_is_after_comma2"] = self._follows_comma(index + 2)
features["next_is_after_plus2"] = self._follows_plus(index + 2)
return features
The sentence_features()
function of PreProcessor
return the features for all tokens in the sentence in a list.
Attention
It is likely that some of these features aren’t necessary. There is a chunk of work for the future to determine the most useful features.