The Featurize module

The Featurize module is the second module in the pipeline, taking care of feature extraction from the output of the Preprocess module. It makes use of ColibriCore to count features, and its output forms the input to the Vectorize module.

Input

If the input to Preprocess_ (.txt or .txtdir) is given as inputfile, this module is ran prior to the Featurize module.

--inputfile
  • The featurize module takes preprocessed documents as input. They can come in four formats:
  1. Extension .tok.txt - File with tokenized text documents on each line.
  2. Extension .tok.txtdir - Directory with tokenized text documents (files ending with .tok.txt).
  3. Extension .frog.json - File with frogged text documents.
  4. Extension .frog.jsondir - Directory with frogged text documents (files ending with .frog.json).

Options

--featuretypes
  • Specify the types of features to extract
  • Options: tokens, lemmas, pos
  • lemmas and pos only apply to input with extension .frog.json and .frog.json.dir
  • multiple options can be given with quotes, divided by a space (for example: ‘tokens lemmas pos’)
  • String parameter; default: tokens
--ngrams
  • Specify the length of the N-grams that you want to include
  • Ngram values should be given within quotes, divided by a space (for example: ‘1 2’)
  • String parameter; default: ‘1 2 3’
--blackfeats
  • In order to exclude words from the feature space, specify them here
  • Each feature to be excluded should be given within quotes, divided by a space (for example: ‘do re mi’)
  • Each ngram within the feature space that includes any of the given blackfeats will be removed
  • String parameter; default: False
--lowercase
  • Choose to lowercase all text and lemma features
  • Boolean parameter; default: False
--minimum-token-frequency
 
  • Option to delete all features that occur below the given threshold
  • Recommended to set to 5 or 10 when applying tfidf or infogain weighting in the Vectorize module
  • Integer parameter; default: 1

Output

.features.npz:Binary file in Numpy format, storing the extracted features per document in sparse format
.vocabulary.txt:
 File that stores the index of each feature

Overview

–inputfile –featuretypes –ngrams –blackfeats –lowercase –minimum-token-frequency Output
docs.tok.txt tokens ‘1 2 3’ False True 2
  • docs.tokens.n_1_2_3.min2.lower_True.black_False.features.npz
  • docs.tokens.n_1_2_3.min2.lower_True.black_False.vocabulary.txt
dos.frog.jsondir ‘tokens lemmas pos’ 1 ‘koala kangaroo’ False 10
  • docs.tokens.lemmas.pos.n_1.min10.lower_False.black_koala_kangaroo.features.npz
  • docs.tokens.lemmas.pos.n_1.min10.lower_False.black_koala_kangaroo.vocabulary.txt

Examples of command line usage

Extract word Ngrams from tokenized text document, lowercasing them and stripping away token Ngrams that occur less than 5 times

$ luiginlp Featurize –module quoll.classification_pipeline.modules.featurize –inputfile docs.tok.txt –lowercase –minimum-token-frequency 5

Extract lemma and pos Ngrams from directory with frogged texts

$ luiginlp Featurize –module quoll.classification_pipeline.modules.featurize –inputfile docs.frog.jsondir –featuretypes ‘lemmas pos’

Frog text document, extract text and pos features and strip away any feature with the word ‘snake’

$ luiginlp Featurize –module quoll.classification_pipeline.modules.featurize –inputfile docs.txt –frogconfig /mylamachinedir/share/frog/nld/frog.cfg –featuretypes ‘tokens pos’ –blackfeats snake