The Vectorize module¶

The Vectorize module is the third module in the pipeline, taking care of weighting and pruning features based on characteristics in the training data. In contrast to the preliminary Featurize_ and Preprocess modules, this module requires separate train and test input.

Input¶

If the input to Preprocess_ (.txt or .txtdir) or Featurize (.tok.txt, .tok.txtdir, .frog.json or .frog.jsondir) is given as input to traininstances or testinstances, this module is ran prior to the Vectorize module.

`--traininstances`
	Traininstances takes featurized or vectorized documents as input. They can come in the following formats: Extension .features.npz - Output of the Featurize module. Extension .vectors.npz - Output of the Vectorize module (can be used for vectorizing test documents). Extension .csv - File with feature vectors formatted as comma-separated-values (useful when applying feature extraction that is not accomodated by quoll). When working with ‘.csv’-input, a file with featurenames should be created that has the same path and name as the ‘.csv’-file, replacing .csv with .featureselection.txt.
`--trainlabels`	Extension .labels - File with a label per line, should be as many instances as traininstances, where the position of the label corresponds to the instance on the same position.
`--testinstances`
	Like traininstances, test instances takes featurized or vectorized documents as input. They can come in the following formats: Extension .features.npz - Output of the Featurize module. Extension .csv - File with feature vectors formatted as comma-separated-values (useful when applying feature extraction that is not accomodated by quoll). Can only be used when the input to traininstances is of the same format; the number of columns should be as many as for the traininstances input. Like

Options¶

Options for Preprocess_ and Featurize_ also apply and are effective in combination with the inputfiles for these modules.

`--weight`	Specify the feature weighting Does not work in combination with a ‘.csv’-file Options: frequency, binary, tfidf, infogain For tfidf or infogain, it is recommended to set minimum feature frequency to 5 or 10 in the Featurize module String parameter; default: frequency**
`--prune`	Specify the number of features to maintain after pruning Does not work in combination with a ‘.csv’-file Pruning is done by ranking features based on their feature weight (total count of a feature is taken in case of ‘frequency’ and ‘binary’ weighting Integer parameter; default: 5000
`--balance`	Choose to balance the number of train instances The number of instances for each class label is decreased to the instance count of the least frequent class Can help in case of strong class skewness Boolean parameter; default: False
`--delimiter`	Specify the delimiter by which columns in the ‘csv’-file are separated Only applies to ‘.csv’-files String parameter; default: ,
`--scale`	Option to normalize feature values to the same scale Only applies to ‘.csv’-file Useful in combination with some classifiers, if the features in the ‘.csv’-file are from different sources and have a wide range of values Boolean parameter; default: False

Output¶

.balanced.features.npz:
	Balanced instances Only applied when ‘balance’ is chosen
.balanced.labels:
	Labels related to balanced instances Only applies when ‘balance’ is chosen
.balanced.vocabulary:
	Vocabulary related to balanced instances

Overview¶

–inputfile	–featuretypes	–ngrams	–blackfeats	–lowercase	–minimum-token-frequency	Output
docs.tok.txt	tokens	‘1 2 3’	False	True	2	docs.tokens.n_1_2_3.min2.lower_True.black_False.features.npz docs.tokens.n_1_2_3.min2.lower_True.black_False.vocabulary.txt
dos.frog.jsondir	‘tokens lemmas pos’	1	‘koala kangaroo’	False	10	docs.tokens.lemmas.pos.n_1.min10.lower_False.black_koala_kangaroo.features.npz docs.tokens.lemmas.pos.n_1.min10.lower_False.black_koala_kangaroo.vocabulary.txt

Examples of command line usage¶

Extract word Ngrams from tokenized text document, lowercasing them and stripping away token Ngrams that occur less than 5 times

$ luiginlp Featurize –module quoll.classification_pipeline.modules.featurize –inputfile docs.tok.txt –lowercase –minimum-token-frequency 5

Extract lemma and pos Ngrams from directory with frogged texts

$ luiginlp Featurize –module quoll.classification_pipeline.modules.featurize –inputfile docs.frog.jsondir –featuretypes ‘lemmas pos’

Frog text document, extract text and pos features and strip away any feature with the word ‘snake’

$ luiginlp Featurize –module quoll.classification_pipeline.modules.featurize –inputfile docs.txt –frogconfig /mylamachinedir/share/frog/nld/frog.cfg –featuretypes ‘tokens pos’ –blackfeats snake

The Vectorize module¶

Input¶

Options¶

Output¶

Overview¶

Examples of command line usage¶

Table Of Contents

Related Topics

This Page