Support for sparse matrices in scikits.learn
I recently added support for sparse matrices (as defined in
scipy.sparse) in some classifiers of scikits.learn.
In those classes, the fit method will perform the algorithm without
converting to a dense representation and will also store parameters in
an efficient format.
Right now, the only classese that implements this is SVC and LinearSVC
in scikits.learn.svm.sparse, although the plan is to add more classes in
the future. These are capable of taking sparse matrices in the fit()
method and will also store support vectors as sparse matrices.
Here is an example. We first create a toy dataset and import relevant
modules:
In [2]: from scikits.learn. import svm
In [3]: X, Y = scipy.sparse.csr_matrix([[0, 0], [0, 1]]), [0, 1]
In [4]: clf = svm.sparse.SVC(kernel='linear')
now we will fit the model and query some of its parameters:
Out[5]:
SVC(kernel='linear', C=1.0, probability=0, shrinking=1, eps=0.001,
cache_size=100.0,
coef0=0.0,
gamma=0.0)
In [6]: clf.support_
Out[6]:
<2x2 sparse matrix of type '<type 'numpy.float64'>'
with 1 stored elements in Compressed Sparse Row format>
In [7]: clf.coef_
Out[7]:
<1x2 sparse matrix of type '<type 'numpy.float64'>'
with 1 stored elements in Compressed Sparse Row format>
For a more complete example, you can look at Classification
of text documents using sparse features, contributed by Olivier Grisel.