The Vectorize module
==================================

The Vectorize module is the third module in the pipeline, taking care of weighting and pruning features based on characteristics in the training data. In contrast to the preliminary Featurize_ and Preprocess_ modules, this module requires separate train and test input.

Input
--------

*If the input to Preprocess_ (.txt or .txtdir) or Featurize (.tok.txt, .tok.txtdir, .frog.json or .frog.jsondir) is given as input to traininstances or testinstances, this module is ran prior to the Vectorize module.* 

--traininstances                            + Traininstances takes featurized or vectorized documents as input. They can come in the following formats:
                                            1. Extension **.features.npz** - Output of the Featurize module. 
                                            2. Extension **.vectors.npz** - Output of the Vectorize module (can be used for vectorizing test documents).
                                            3. Extension **.csv** - File with feature vectors formatted as comma-separated-values (useful when applying feature extraction that is not accomodated by quoll). When working with \'.csv\'-input, a file with featurenames should be created that has the same path and name as the \'.csv\'-file, replacing .csv with .featureselection.txt.

--trainlabels                               + Extension **.labels** - File with a label per line, should be as many instances as traininstances, where the position of the label corresponds to the instance on the same position.  

--testinstances                             + Like traininstances, test instances takes featurized or vectorized documents as input. They can come in the following formats:
                                            1. Extension **.features.npz** - Output of the Featurize module. 
                                            2. Extension **.csv** - File with feature vectors formatted as comma-separated-values (useful when applying feature extraction that is not accomodated by quoll). Can only be used when the input to traininstances is of the same format; the number of columns should be as many as for the traininstances input. Like 


Options
--------
*Options for Preprocess_ and Featurize_ also apply and are effective in combination with the inputfiles for these modules.* 

--weight                    + Specify the feature weighting
                            + Does not work in combination with a \'.csv\'-file
                            + Options: **frequency**, **binary**, **tfidf**, **infogain
                            + For tfidf or infogain, it is recommended to set minimum feature frequency to 5 or 10 in the Featurize module
                            + String parameter; default: **frequency**

--prune                     + Specify the number of features to maintain after pruning
                            + Does not work in combination with a \'.csv\'-file
                            + Pruning is done by ranking features based on their feature weight (total count of a feature is taken in case of \'frequency\' and \'binary\' weighting  
                            + Integer parameter; default: 5000
                            
--balance                   + Choose to balance the number of train instances
                            + The number of instances for each class label is decreased to the instance count of the least frequent class 
                            + Can help in case of strong class skewness
                            + Boolean parameter; default: **False**

--delimiter                 + Specify the delimiter by which columns in the \'csv\'-file are separated
                            + Only applies to \'.csv\'-files
                            + String parameter; default: **,**
                        
--scale                     + Option to normalize feature values to the same scale
                            + Only applies to \'.csv\'-file
                            + Useful in combination with some classifiers, if the features in the \'.csv\'-file are from different sources and have a wide range of values 
                            + Boolean parameter; default: **False**

Output
-------
:.balanced.features.npz:
  Balanced instances 
  Only applied when \'balance\' is chosen
:.balanced.labels:
  Labels related to balanced instances 
  Only applies when \'balance\' is chosen
:.balanced.vocabulary:
  Vocabulary related to balanced instances

Overview
--------

+------------------+-----------------------+---------------+--------------------+------------------+--------------------------------+---------------------------------------------------------------------------------------+
| --inputfile      | --featuretypes        | --ngrams      | --blackfeats       | --lowercase      | --minimum-token-frequency      | Output                                                                                |
+==================+=======================+===============+====================+==================+================================+=======================================================================================+
| docs.tok.txt     | tokens                | \'1 2 3\'     | False              | True             | 2                              | + docs.tokens.n_1_2_3.min2.lower_True.black_False.features.npz                        |
|                  |                       |               |                    |                  |                                | + docs.tokens.n_1_2_3.min2.lower_True.black_False.vocabulary.txt                      |                     
+------------------+-----------------------+---------------+--------------------+------------------+--------------------------------+---------------------------------------------------------------------------------------+
| dos.frog.jsondir | \'tokens lemmas pos\' | 1             | \'koala kangaroo\' | False            | 10                             | + docs.tokens.lemmas.pos.n_1.min10.lower_False.black_koala_kangaroo.features.npz      |
|                  |                       |               |                    |                  |                                | + docs.tokens.lemmas.pos.n_1.min10.lower_False.black_koala_kangaroo.vocabulary.txt    |
+------------------+-----------------------+---------------+--------------------+------------------+--------------------------------+---------------------------------------------------------------------------------------+

Examples of command line usage
--------

**Extract word Ngrams from tokenized text document, lowercasing them and stripping away token Ngrams that occur less than 5 times**

$ luiginlp Featurize --module quoll.classification_pipeline.modules.featurize --inputfile docs.tok.txt --lowercase --minimum-token-frequency 5

**Extract lemma and pos Ngrams from directory with frogged texts**

$ luiginlp Featurize --module quoll.classification_pipeline.modules.featurize --inputfile docs.frog.jsondir --featuretypes \'lemmas pos\'

**Frog text document, extract text and pos features and strip away any feature with the word \'snake\'**

$ luiginlp Featurize --module quoll.classification_pipeline.modules.featurize --inputfile docs.txt --frogconfig /mylamachinedir/share/frog/nld/frog.cfg --featuretypes \'tokens pos\' --blackfeats snake

.. _ColibriCore: https://proycon.github.io/colibri-core/
.. _Preprocess: preprocess.rst
.. _Vectorize: vectorize.rst