Text features input format for classification algorithms in scikit-learn

I'm starting to use the scikit-learn to do some NLP. I've already used some classifiers from NLTK and now I want to try the ones implemented in scikit-learn.

My data is basically sentences, and I extract features from some words of those sentences to do some classification task. Most of my features are nominal: part-of-speech (POS) of the word, word-to-the-left, POS word-to-the-left, word-to-the-right, POS word-to-the-right, syntactic relations path from one word to another, etc.

When I made some experiments using the NLTK classifiers (Decision Tree, Naive Bayes), the feature set was just a dictionary with the corresponding values for the features: the nominal values. Such as: [ {"postag":"noun", "wleft":"house", "path":"VPNPNP",...},.... ]. I just had to pass this to the classifiers and they did their job.

This is part of the code used:

def train_classifier(self):
        if self.reader == None:
            raise ValueError("No reader was provided for accessing training instances.")

        # Get the argument candidates
        argcands = self.get_argcands(self.reader)

        # Extract the necessary features from the argument candidates
        training_argcands = []
        for argcand in argcands:
            if argcand["info"]["label"] == "NULL":
                training_argcands.append( (self.extract_features(argcand), "NULL") )
                training_argcands.append( (self.extract_features(argcand), "ARG") )

        # Train the appropriate supervised model
        self.classifier = DecisionTreeClassifier.train(training_argcands)


Here's an example of one of the feature sets extracted:

[({'phrase': u'np', 'punct_right': 'NULL', 'phrase_left-sibling': 'NULL', 'subcat': 'fcl=np np vp np pu', 'pred_lemma': u'revelar', 'phrase_right-sibling': u'np', 'partial_path': 'vp fcl', 'first_word-postag': 'Bras\xc3\xadlia PROP', 'last_word-postag': 'Bras\xc3\xadlia PROP', 'phrase_parent': u'fcl', 'pred_context_right': u'um', 'pred_form': u'revela', 'punct_left': 'NULL', 'path': 'vp\xc2\xa1fcl!np', 'position': 0, 'pred_context_left_postag': u'ADV', 'voice': 0, 'pred_context_right_postag': u'ART', 'pred_context_left': u'hoje'}, 'NULL')]

As I mentioned before, most of the features are nominal (a string value).

Now, I want try out the classifiers in the scikit-learn package. As I understand, this type of feature sets are not acceptable for the algorithms implemented in sklearn, since all feature values must be numeric, and they have to be in an array or matrix. Therefore, I transformed the "original" feature sets using the DictVectorizer class. However, when I pass this transformed vectors, I get the following errors:

# With DecisionTreeClass
Traceback (most recent call last): 
  File "/usr/local/lib/python2.7/dist-packages/sklearn/tree/tree.py", line 458, in fit
    X = np.asarray(X, dtype=DTYPE, order='F')
  File "/usr/local/lib/python2.7/dist-packages/numpy/core/numeric.py", line 235, in asarray
    return array(a, dtype, copy=False, order=order)
TypeError: float() argument must be a string or a number

# With GaussianNB

Traceback (most recent call last):
  File "/usr/local/lib/python2.7/dist-packages/sklearn/naive_bayes.py", line 156, in fit
    n_samples, n_features = X.shape
ValueError: need more than 0 values to unpack

I get these errors when I just use DictVectorizer(). However, if I use DictVectorizer(sparse=False), I get the errors even before the code gets to the training part:

Traceback (most recent call last):
train_argcands_feats = self.feat_vectorizer.fit_transform(train_argcands_feats)
  File "/usr/local/lib/python2.7/dist-packages/sklearn/feature_extraction/dict_vectorizer.py", line 123, in fit_transform
    return self.transform(X)
  File "/usr/local/lib/python2.7/dist-packages/sklearn/feature_extraction/dict_vectorizer.py", line 212, in transform
    Xa = np.zeros((len(X), len(vocab)), dtype=dtype)
ValueError: array is too big.

Because of this error, it's obvious that a sparse representation has to be used.

So the question is: how do I transform my nominal features so as to use the classification algorithms provided by scikit-learn?

Thanks in advance for all the help you could give me.


As suggested by an answer below, I tried to use the NLTK wrapper for scikit-learn. I just changed the code line that creates the classifier:

self.classifier = SklearnClassifier(DecisionTreeClassifier())

Then, when I call the "train" method I get the following:

File "/usr/local/lib/python2.7/dist-packages/nltk/classify/scikitlearn.py", line 100, in train
    X = self._convert(featuresets)
  File "/usr/local/lib/python2.7/dist-packages/nltk/classify/scikitlearn.py", line 109, in _convert
    return self._featuresets_to_coo(featuresets)
  File "/usr/local/lib/python2.7/dist-packages/nltk/classify/scikitlearn.py", line 126, in _featuresets_to_coo
ValueError: could not convert string to float: np

So, apparently, the wrapper can't create the sparse matrix because the features are nominal. Then, I'm back to the original problem.


ValueError: array is too big. is quite explicit: you cannot allocate a dense array datastructure of (n_samples, n_features) in memory. It's useless (and impossible in your case) to store that many zeros in a contiguous chunk of memory. Use a sparse datastructure as in the DictVectorizer documentation instead.

Also if you prefer the NLTK API you can use its scikit-learn integration instead of using scikit-learn DictVectorizer:


Have a look at the end of the file.

The problem with the NLTK wrapper for scikit-learn is that it actually wants dicts mapping feature names to numeric values, so that's not going to solve the problem in this case. The DictVectorizer is scikit-learn is more sophisticated, in that it does a "one-of-K" coding when it encounters string feature values; here's how you can use it:

>>> data = [({'first_word-postag': 'Bras\xc3\xadlia PROP',
   'last_word-postag': 'Bras\xc3\xadlia PROP',
   'partial_path': 'vp fcl',
   'path': 'vp\xc2\xa1fcl!np',
   'phrase': u'np',
   'phrase_left-sibling': 'NULL',
   'phrase_parent': u'fcl',
   'phrase_right-sibling': u'np',
   'position': 0,
   'pred_context_left': u'hoje',
   'pred_context_left_postag': u'ADV',
   'pred_context_right': u'um',
   'pred_context_right_postag': u'ART',
   'pred_form': u'revela',
   'pred_lemma': u'revelar',
   'punct_left': 'NULL',
   'punct_right': 'NULL',
   'subcat': 'fcl=np np vp np pu',
   'voice': 0},

Break this list into two lists, one containing samples, the other the corresponding labels:

>>> samples, labels = zip(*data)

Pass the samples to DictVectorizer.fit (you can optionally pass the labels as well in a separate argument, but they will be ignored):

>>> v = DictVectorizer()
>>> X = v.fit_transform(samples)
>>> X
<1x19 sparse matrix of type '<type 'numpy.float64'>'
    with 19 stored elements in COOrdinate format>

You should then be able to pass X to a scikit-learn classifier that accepts sparse input. GaussianNB does not do that, as @ogrisel already pointed out. For NLP tasks, you'll want to use MultinomialNB or BernoulliNB, since those are designed specifically for discrete data.

Need Your Help

File uploads in HTML5 offline applications

html5 file-upload local-storage offline cache-manifest

I am working on a Web based application which will potentially be used in environments with unstable Internet connection. I am implementing it as an HTML5 offline application that will utilize HTML5

Trouble Adding an item to the back of a doubly linked list

c++ insert linked-list double tail

I’m new to doubly linked lists. I’m working on a few pieces of code: one function to add an item to the front, another to add an item to the back, and forward and reverse display methods that outp...