The Featurize module¶

The Featurize module is the second module in the pipeline, taking care of feature extraction from the output of the Preprocess module. It makes use of ColibriCore to count features, and its output forms the input to the Vectorize module.

Input¶

If the input to Preprocess_ (.txt or .txtdir) is given as inputfile, this module is ran prior to the Featurize module.

--inputfile

The featurize module takes preprocessed documents as input. They can come in four formats:

Extension .tok.txt - File with tokenized text documents on each line.
Extension .tok.txtdir - Directory with tokenized text documents (files ending with .tok.txt).
Extension .frog.json - File with frogged text documents.
Extension .frog.jsondir - Directory with frogged text documents (files ending with .frog.json).

Options¶

`--featuretypes`	Specify the types of features to extract Options: tokens, lemmas, pos lemmas and pos only apply to input with extension .frog.json and .frog.json.dir multiple options can be given with quotes, divided by a space (for example: ‘tokens lemmas pos’) String parameter; default: tokens
`--ngrams`	Specify the length of the N-grams that you want to include Ngram values should be given within quotes, divided by a space (for example: ‘1 2’) String parameter; default: ‘1 2 3’
`--blackfeats`	In order to exclude words from the feature space, specify them here Each feature to be excluded should be given within quotes, divided by a space (for example: ‘do re mi’) Each ngram within the feature space that includes any of the given blackfeats will be removed String parameter; default: False
`--lowercase`	Choose to lowercase all text and lemma features Boolean parameter; default: False
`--minimum-token-frequency`
	Option to delete all features that occur below the given threshold Recommended to set to 5 or 10 when applying tfidf or infogain weighting in the Vectorize module Integer parameter; default: 1

Output¶

.vocabulary.txt:
.features.npz:	Binary file in Numpy format, storing the extracted features per document in sparse format
	File that stores the index of each feature

Overview¶

–inputfile	–featuretypes	–ngrams	–blackfeats	–lowercase	–minimum-token-frequency	Output
docs.tok.txt	tokens	‘1 2 3’	False	True	2	docs.tokens.n_1_2_3.min2.lower_True.black_False.features.npz docs.tokens.n_1_2_3.min2.lower_True.black_False.vocabulary.txt
dos.frog.jsondir	‘tokens lemmas pos’	1	‘koala kangaroo’	False	10	docs.tokens.lemmas.pos.n_1.min10.lower_False.black_koala_kangaroo.features.npz docs.tokens.lemmas.pos.n_1.min10.lower_False.black_koala_kangaroo.vocabulary.txt

Examples of command line usage¶

Extract word Ngrams from tokenized text document, lowercasing them and stripping away token Ngrams that occur less than 5 times

$ luiginlp Featurize –module quoll.classification_pipeline.modules.featurize –inputfile docs.tok.txt –lowercase –minimum-token-frequency 5

Extract lemma and pos Ngrams from directory with frogged texts

$ luiginlp Featurize –module quoll.classification_pipeline.modules.featurize –inputfile docs.frog.jsondir –featuretypes ‘lemmas pos’

Frog text document, extract text and pos features and strip away any feature with the word ‘snake’

$ luiginlp Featurize –module quoll.classification_pipeline.modules.featurize –inputfile docs.txt –frogconfig /mylamachinedir/share/frog/nld/frog.cfg –featuretypes ‘tokens pos’ –blackfeats snake

The Featurize module¶

Input¶

Options¶

Output¶

Overview¶

Examples of command line usage¶

Table Of Contents

Related Topics

This Page