The Featurize module¶
The Featurize module is the second module in the pipeline, taking care of feature extraction from the output of the Preprocess module. It makes use of ColibriCore to count features, and its output forms the input to the Vectorize module.
Input¶
If the input to Preprocess_ (.txt or .txtdir) is given as inputfile, this module is ran prior to the Featurize module.
--inputfile |
|
Options¶
--featuretypes |
|
--ngrams |
|
--blackfeats |
|
--lowercase |
|
--minimum-token-frequency | |
|
Output¶
.features.npz: | Binary file in Numpy format, storing the extracted features per document in sparse format |
---|---|
.vocabulary.txt: | |
File that stores the index of each feature |
Overview¶
–inputfile | –featuretypes | –ngrams | –blackfeats | –lowercase | –minimum-token-frequency | Output |
---|---|---|---|---|---|---|
docs.tok.txt | tokens | ‘1 2 3’ | False | True | 2 |
|
dos.frog.jsondir | ‘tokens lemmas pos’ | 1 | ‘koala kangaroo’ | False | 10 |
|
Examples of command line usage¶
Extract word Ngrams from tokenized text document, lowercasing them and stripping away token Ngrams that occur less than 5 times
$ luiginlp Featurize –module quoll.classification_pipeline.modules.featurize –inputfile docs.tok.txt –lowercase –minimum-token-frequency 5
Extract lemma and pos Ngrams from directory with frogged texts
$ luiginlp Featurize –module quoll.classification_pipeline.modules.featurize –inputfile docs.frog.jsondir –featuretypes ‘lemmas pos’
Frog text document, extract text and pos features and strip away any feature with the word ‘snake’
$ luiginlp Featurize –module quoll.classification_pipeline.modules.featurize –inputfile docs.txt –frogconfig /mylamachinedir/share/frog/nld/frog.cfg –featuretypes ‘tokens pos’ –blackfeats snake