Smells like hacker spirit
11 February 2011
I was last weekend in FOSDEM presenting scikits.learn (here are the slides I used at the Data Analytics Devroom). Kudos to Olivier Grisel and all the people who organized such a fun and authentic meeting!
11 February 2011
I was last weekend in FOSDEM presenting scikits.learn (here are the slides I used at the Data Analytics Devroom). Kudos to Olivier Grisel and all the people who organized such a fun and authentic meeting!
31 December 2010
Latest release of scikits.learn comes with an awesome collection of examples. These are some of my favorites:
This example by Olivier Grisel, downloads a 58MB faces dataset from Labeled Faces in the Wild, and is able to perform PCA for feature extraction and SVC for classification, yielding a very acceptable 0.85 f1-score.
This example by Peter Prettenhofer, models the geographical distribution of two south american mammals given past observations and 14 environmental variables.
This example, again by Peter Prettenhofer and based on matplotlib and Tk, lets you draw data points in a canvas and it will interactively show the decision function of the SVM classifier. See this video for a small showcase (music by Joe Crepúsculo can be downloaded here)
29 November 2010
Based on the work of libsvm-dense by Ming-Wei Chang, Hsuan-Tien Lin, Ming-Hen Tsai, Chia-Hua Ho and Hsiang-Fu Yu I patched the libsvm distribution shipped with scikits.learn to allow setting weights for individual instances.
The motivation behind this is to be able force a classifier to focus its attention in some samples instead of others. This example shows how different weights modify the decision function:
24 November 2010
Highlights for this release:
* New stochastic gradient descent module by Peter Prettenhofer
* Improved svm module: memory efficiency, automatic class weights.
* Wrap for liblinear’s Multi-class SVC (option multi_class in LinearSVC)
* New features and performance improvements of text feature extraction.
* Improved sparse matrix support, both in main classes (GridSearch) as in sparse modules: scikits.learn.svm.sparse and scikits.learn.glm.sparse.
* Lots of cool new examples: (here, here and here)
* New Gaussian Process module by Vincent Dubourg (still to be merged)
* Faster implementation of the LARS algorithm.
* Probability estimates for logistic regression.
* Lots of bug fixes and documentation improvements.
* Probably other things I am forgetting …
19 November 2010
scikits.learn.svm now uses LibSVM-dense instead of LibSVM for some support vector machine related algorithms when input is a dense matrix.
As a result most of the copies associated with argument passing are avoided, giving 50% less memory footprint and several times less than the python bindings that ship with libsvm, which stores data in the very inefficient python list structure. On the performance side I didn’t see any significant difference, although on large datasets less memory footprint can make the difference between swapping or not.
30 October 2010
For some time now I’ve been missing a function in scipy that exploits the triangular structure of a matrix to efficiently solve the associated system, so I decided to implement it by binding the LAPACK method “trtrs”, which also checks for singularities and is capable handling several right-hand sides.
Contrary to what I expected, binding Fortran code with f2py is pretty straightforward, even for someone like me who has never programmed in that language: I took a similar example, modified it’s parameters and it worked! Also, thanks to Pauli Virtanen the review process was really fast and the patch was committed within a few hours.
The high level interface for LAPACK’s trtrs is linalg.solve_triangular, which accepts roughly the same arguments as linalg.solve, but assumes the first argument is a triangular matrix:
Simple benchmarks lets us clearly appreciate the complexity gap between both methods : solving an (n, n) triangular system is an O(n^2) operation, while solving a full one is at least a O(n^3):
![]() |
| From Screenshots |
30 September 2010
I’ve been working lately with Alexandre Gramfort coding the LARS algorithm in scikits.learn. This algorithm computes the solution to several general linear models used in machine learning: LAR, Lasso, Elasticnet and Forward Stagewise.
Unlike the implementation by coordinate descent, the LARS algorithm gives the full coefficient path along the regularization parameter, and thus it is specially well suited for performing model selection.
The algorithm is coded mostly in python, with some tiny parts in C (because I already had the code for cholesky deletes in C) and a cython interface for the blas function dtrsv, which will be proposed to scipy once I stabilize this code. The algorithm is mostly complete, allowing some optimizations, like using a precomputed Gram matrix or specify maximum number of features/iterations, but could still be extended to compute other models, like ElasticNet or Forward Stagewise.
I haven’t done any benchmarks yet, but preliminary ones by Alexandre Gramfort showed that it is roughly equivalent to this Matlab implementation. Using PyMVPA, it shouldn’t be difficult to benchmark it against th R implementation, though.
12 September 2010
Las week took place in Paris the second scikits.learn sprint. It was
two days of insane activity (115 commits, 6 branches, 33 coffees) in
which we did a lot of work, both implementing new algorithms and fixing
or improving old ones. This includes:
* sparse version of Lasso by coordinate descent. Not (yet) merged into master, but can be looked from Olivier’s branch.
* new API for Pipeline. An example of this can be found in the document SVM-Anova: SVM with univariate feature selection.
* documentation for the bayesian methods and cross validation: Vincent Michel contributed a lot of documentation, mainly taken from chapters of his thesis.
* Ledoit-Wolf covariance estimation.
* Pure python Fast ICA implementation.
And the family picture, featuring (from left to right): Alexandre Gramfort, Bertrand Thirion, Virgine Fritsch, Gael Varoquaux, Vincent Michel, Olivier Grisel and me (taking the picture).

23 August 2010
I recently added support for sparse matrices (as defined in
scipy.sparse) in some classifiers of scikits.learn.
In those classes, the fit method will perform the algorithm without
converting to a dense representation and will also store parameters in
an efficient format.
Right now, the only classese that implements this is SVC and LinearSVC
in scikits.learn.svm.sparse, although the plan is to add more classes in
the future. These are capable of taking sparse matrices in the fit()
method and will also store support vectors as sparse matrices.
Here is an example. We first create a toy dataset and import relevant
modules:
now we will fit the model and query some of its parameters:
For a more complete example, you can look at Classification
of text documents using sparse features, contributed by Olivier Grisel.
18 August 2010
I often find myself debugging python C extensions from gdb, but usually some variables are hidden because aggressive optimizations that distutils sets by default. What I did not know, is that you can prevent those optimizations by passing flags
and your extension becomes much easier to debug from gdb.
30 July 2010
27 May 2010
It is now possible (using the development version as of may 2010) to use Support Vector Machines with custom kernels in scikits.learn.
How to use it couldn’t be more simple: you just pass a callable (the kernel) to the class constructor). For example, a linear kernel would be implemented as follows:
The only requisites for defining a kernel is that it should take as argument two numpy arrays and return also a numpy array. Then you would pass the kernel to the classifier’s constructor:
and that’s all. The construct recognizes this as a custom kernel and you can then use the classifier as any other classifier.
For a complete reference, see the the reference manual and an example.
22 April 2010
If your numpy installation uses system-wide BLAS libraries (this will most likely be the case unless you installed it through prebuilt windows binaries), you can retrieve this information at compile time to link python modules to BLAS.
The function get_info in numpy.distutils.system_info will return a dictionary that contains the needed information to link against BLAS or an empty dict if no system-wide BLAS could be found. For example, MacOSX ships with it’s own optimized BLAS routines, and get_info correctly reports that:
The following example shows a setup.py that links against system-wide BLAS if possible. If no appropriate BLAS routine could be found, it will print a warning message, but will compile it’s own BLAS routine and embed it in the python extension.
A real-word example of this can be found in scipy.odr module and in scikits.learn’s liblinear bindings.
22 March 2010
Today I released a new version of the scikits.learn library for machine learning.
This new release includes the new libsvm bindings, Jake VanderPlas’ BallTree algorithm for *fast* nearest neighbor queries in high dimension, etc. Here is the official announcement.
As usual, it can be downloaded from sourceforge or from the PyPI.
17 March 2010
Suppose some given data points each belong to one of two classes, and the goal is to decide which class a new data point will be in. In the case of support vector machines, a data point is viewed as a p-dimensional vector (2-dimensional in this example), and we want to know whether we can separate such points with a p -1-dimensional hyperplane (a line in our case). There are many hyperplanes that might classify the data. One reasonable choice as the best hyperplane is the one that represents the largest separation, or margin, between the two classes. So we choose the hyperplane so that the distance from it to the nearest data point on each side is maximized. If such a hyperplane exists, it is known as the maximum-margin hyperplane and the linear classifier it defines is known as a maximum margin classifier.
Using the new svm module in scikits.learn, you can easily plot the maximum margin hyperplane. If clf is an instance of svm.SVC(), the coefficients in the decision function are stored in clf.coef_ , and the independent term in clf.rho_ . The complete source code is:
And the result is
where the vectors that are closest to the separating line are highlighted with a small ‘+’.
Up-to date code of this can be found in directory examples/ of scikits.learn
9 March 2010
LibSVM is a C++ library that implements several Support Vector Machine algorithms that are commonly used in machine learning. It is a fast library that has no dependencies and most machine learning frameworks bind it in some way or another. LibSVM comes with a Python interface written in swig, but this interface is inherently slow as it does not take into account numpy’s array structure. Also, it does not wrap all the library’s functionality. Some projects bind it using this bindings and other (such as PyMVPA) make its own wrap, binding some methods directly to numpy’s array structure.
My approach was to code all algorithms that convert libsvm’s data structures (sparse) to numpy arrays (dense) in pure C and wrap them in a very thin Cython layer. Special attention was given to minimize the overhead of converting between libsvm data structures and numpy arrays, as in my opinion this was the main source of bad performance in existing python bindings.
As a first benchmark, I supposed a situation in which the dimension of the subspace is small and there are lots of points to classify. This is typically the case when your data is points in plane or in space and you want to draw the decision function by classifying every point in the grid. In this case, the bottleneck is not the classification algorithm, but the conversion of data from a dense representation used by python and numpy and a sparse representation used by libsvm. Not surprisingly, we get huge performance gains if we speed up the conversion dense/sparse.
In the case of a huge number of dimensions, the speedup is not so spectacular, but we also get a performance boost by making training somewhat faster.
A feature that was needed and that I haven’t found on other implementations is that you can tweak parameters in the SVM class and the classifier will reflect those changes (i.e. parameters are actually copied back and forth, not just passed as an opaque pointer).
Suppose you train an instance of the classifier and are interested in the coefficients that multiply the support vectors in the decision function. In scikits.learn, you can access this array under field .coef_:
>>> import numpy as np
>>> from scikits.learn import svm
>>> clf = svm.SVM()
>>> clf.fit([[1,2], [3,4]], [-1, 1])
>>> clf.coef
clf.coef0 clf.coef_
>>> clf.coef_
array([[ 1., -1.]])
Now, changing the value of these coefficients effectively changes the decision function:
>>> clf.predict([[1,2]])
array([ -1.])
>>> clf.coef_ = np.array([[0.0, -1.0]])
>>> clf.predict([[1,2]])
array([ 1.])
All code can be found in the scikit (you’ll have to get the svn version), in file scikits/learn/svm.py and scikits/learn/src/. All plots are generated from this script
In the benchmarks, a Linear Kernel was used, as it is the most common. Other more computationally intensive kernels would probably narrow the difference.
This code should be treated as alpha quality and has not being extensively tested. Please report any bugs that you encounter to the tracker
4 March 2010
Yesterday we had an extremely productive coding sprint for the scikits.learn. The idea was to put people with common interests in a room and make them work in a single codebase.
Alexandre Gramfort and Olivier Grisel worked on GLMNet, Bertrand Thirion and Gaël Varoquaux worked on univariate feature selection and Vincent worked on Bayesian Regression.
I was supposed to work with Vincent, but as soon as Bertrand spot some bugs in my libsvm bindings, I could not think of anything except that, and eventually the day finished just as I fixed the bug …
You can find some cool examples of the things we did in directory examples:
1 February 2010
Today I released the first public version of Scikit-Learn (release notes).
It’s a python module implementing some machine learning algorithms, and it’s shaping quite good. For this release I did not want to do any incompatible changes, so most of them are just bug fixes and updates.
For the next release, however, some more radical changes are planned, and definitely something should be done about the (incredibly long) namespace, having to tape from scikits.learn.machine.manifold_learning.regression.neighbors import Neighbors each time you want to perform a nearest-neighbor algorithms is just not practical!
Here is a nice screenshot,
7 January 2010
This week we created a sourceforge project to host our development of scikit-learn. Although the project already had a directory in scipy’s repo, we needed more flexibility in the user management and in the mailing list creation, so we opted for SourceForge.
To be honest, after using git and Google Code for bug tracking, I was not very excited about using subversion/sourceforge again. On the other hand, we needed some sort of compromise that would allow a very heterogeneous range of developers to work together, and after some (surprisingly civilized) emails and some chatting with Gael, we agreed that SourceForge was indeed the best choice.
In case you are interested, there’s a (preliminary) web page with more info. You might also want to have a look at the previous project’s web page.
5 January 2010