The Featurize module
==================================

The Featurize module is the second module in the pipeline, taking care of feature extraction from the output of the Preprocess_ module. It makes use of ColibriCore_ to count features, and its output forms the input to the Vectorize_ module. 

Input
--------

*If the input to Preprocess_ (.txt or .txtdir) is given as inputfile, this module is ran prior to the Featurize module.* 

--inputfile                 + The featurize module takes preprocessed documents as input. They can come in four formats:
                            1. Extension **.tok.txt** - File with tokenized text documents on each line. 
                            2. Extension **.tok.txtdir** - Directory with tokenized text documents (files ending with **.tok.txt**).
                            3. Extension **.frog.json** - File with frogged text documents.
                            4. Extension **.frog.jsondir** - Directory with frogged text documents (files ending with **.frog.json**).                  

Options
--------

--featuretypes              + Specify the types of features to extract
                            + Options: **tokens**, **lemmas**, **pos**
                            + lemmas and pos only apply to input with extension *.frog.json* and *.frog.json.dir*
                            + multiple options can be given with quotes, divided by a space (for example: *\'tokens lemmas pos\'*)
                            + String parameter; default: **tokens**

--ngrams                    + Specify the length of the N-grams that you want to include
                            + Ngram values should be given within quotes, divided by a space (for example: *\'1 2\'*)
                            + String parameter; default: **\'1 2 3\'**
                            
--blackfeats                + In order to exclude words from the feature space, specify them here
                            + Each feature to be excluded should be given within quotes, divided by a space (for example: *\'do re mi\'*)
                            + Each ngram within the feature space that includes any of the given blackfeats will be removed
                            + String parameter; default: **False**

--lowercase                 + Choose to lowercase all text and lemma features
                            + Boolean parameter; default: **False**
                        
--minimum-token-frequency   + Option to delete all features that occur below the given threshold
                            + Recommended to set to 5 or 10 when applying tfidf or infogain weighting in the Vectorize module
                            + Integer parameter; default: **1**

Output
-------
:.features.npz:
  Binary file in Numpy format, storing the extracted features per document in sparse format 
:.vocabulary.txt:
  File that stores the index of each feature

Overview
--------

+------------------+-----------------------+---------------+--------------------+------------------+--------------------------------+---------------------------------------------------------------------------------------+
| --inputfile      | --featuretypes        | --ngrams      | --blackfeats       | --lowercase      | --minimum-token-frequency      | Output                                                                                |
+==================+=======================+===============+====================+==================+================================+=======================================================================================+
| docs.tok.txt     | tokens                | \'1 2 3\'     | False              | True             | 2                              | + docs.tokens.n_1_2_3.min2.lower_True.black_False.features.npz                        |
|                  |                       |               |                    |                  |                                | + docs.tokens.n_1_2_3.min2.lower_True.black_False.vocabulary.txt                      |                     
+------------------+-----------------------+---------------+--------------------+------------------+--------------------------------+---------------------------------------------------------------------------------------+
| dos.frog.jsondir | \'tokens lemmas pos\' | 1             | \'koala kangaroo\' | False            | 10                             | + docs.tokens.lemmas.pos.n_1.min10.lower_False.black_koala_kangaroo.features.npz      |
|                  |                       |               |                    |                  |                                | + docs.tokens.lemmas.pos.n_1.min10.lower_False.black_koala_kangaroo.vocabulary.txt    |
+------------------+-----------------------+---------------+--------------------+------------------+--------------------------------+---------------------------------------------------------------------------------------+

Examples of command line usage
--------

**Extract word Ngrams from tokenized text document, lowercasing them and stripping away token Ngrams that occur less than 5 times**

$ luiginlp Featurize --module quoll.classification_pipeline.modules.featurize --inputfile docs.tok.txt --lowercase --minimum-token-frequency 5

**Extract lemma and pos Ngrams from directory with frogged texts**

$ luiginlp Featurize --module quoll.classification_pipeline.modules.featurize --inputfile docs.frog.jsondir --featuretypes \'lemmas pos\'

**Frog text document, extract text and pos features and strip away any feature with the word \'snake\'**

$ luiginlp Featurize --module quoll.classification_pipeline.modules.featurize --inputfile docs.txt --frogconfig /mylamachinedir/share/frog/nld/frog.cfg --featuretypes \'tokens pos\' --blackfeats snake

.. _ColibriCore: https://proycon.github.io/colibri-core/
.. _Preprocess: preprocess.rst
.. _Vectorize: vectorize.rst