The Vectorize module¶
The Vectorize module is the third module in the pipeline, taking care of weighting and pruning features based on characteristics in the training data. In contrast to the preliminary Featurize_ and Preprocess modules, this module requires separate train and test input.
Input¶
If the input to Preprocess_ (.txt or .txtdir) or Featurize (.tok.txt, .tok.txtdir, .frog.json or .frog.jsondir) is given as input to traininstances or testinstances, this module is ran prior to the Vectorize module.
--traininstances | |
| |
--trainlabels |
|
--testinstances | |
|
Options¶
Options for Preprocess_ and Featurize_ also apply and are effective in combination with the inputfiles for these modules.
--weight |
|
--prune |
|
--balance |
|
--delimiter |
|
--scale |
|
Output¶
.balanced.features.npz: | |
---|---|
Balanced instances Only applied when ‘balance’ is chosen | |
.balanced.labels: | |
Labels related to balanced instances Only applies when ‘balance’ is chosen | |
.balanced.vocabulary: | |
Vocabulary related to balanced instances |
Overview¶
–inputfile | –featuretypes | –ngrams | –blackfeats | –lowercase | –minimum-token-frequency | Output |
---|---|---|---|---|---|---|
docs.tok.txt | tokens | ‘1 2 3’ | False | True | 2 |
|
dos.frog.jsondir | ‘tokens lemmas pos’ | 1 | ‘koala kangaroo’ | False | 10 |
|
Examples of command line usage¶
Extract word Ngrams from tokenized text document, lowercasing them and stripping away token Ngrams that occur less than 5 times
$ luiginlp Featurize –module quoll.classification_pipeline.modules.featurize –inputfile docs.tok.txt –lowercase –minimum-token-frequency 5
Extract lemma and pos Ngrams from directory with frogged texts
$ luiginlp Featurize –module quoll.classification_pipeline.modules.featurize –inputfile docs.frog.jsondir –featuretypes ‘lemmas pos’
Frog text document, extract text and pos features and strip away any feature with the word ‘snake’
$ luiginlp Featurize –module quoll.classification_pipeline.modules.featurize –inputfile docs.txt –frogconfig /mylamachinedir/share/frog/nld/frog.cfg –featuretypes ‘tokens pos’ –blackfeats snake