Fast bindings for LibSVM in scikits.learn

9 March 2010

LibSVM is a C++ library that implements several Support Vector Machine algorithms that are commonly used in machine learning. It is a fast library that has no dependencies and most machine learning frameworks bind it in some way or another. LibSVM comes with a Python interface written in swig, but this interface is inherently slow as it does not take into account numpy’s array structure. Also, it does not wrap all the library’s functionality. Some projects bind it using this bindings and other (such as PyMVPA) make its own wrap, binding some methods directly to numpy’s array structure.

My approach was to code all algorithms that convert libsvm’s data structures (sparse) to numpy arrays (dense) in pure C and wrap them in a very thin Cython layer. Special attention was given to minimize the overhead of converting between libsvm data structures and numpy arrays, as in my opinion this was the main source of bad performance in existing python bindings.

Benchmarks

As a first benchmark, I supposed a situation in which the dimension of the subspace is small and there are lots of points to classify. This is typically the case when your data is points in plane or in space and you want to draw the decision function by classifying every point in the grid. In this case, the bottleneck is not the classification algorithm, but the conversion of data from a dense representation used by python and numpy and a sparse representation used by libsvm. Not surprisingly, we get huge performance gains if we speed up the conversion dense/sparse.

bench1

Course of dimensionality

In the case of a huge number of dimensions, the speedup is not so spectacular, but we also get a performance boost by making training somewhat faster.

Benchmarks of different bindings of LibSVM

Bidirectional mapping

A feature that was needed and that I haven’t found on other implementations is that you can tweak parameters in the SVM class and the classifier will reflect those changes (i.e. parameters are actually copied back and forth, not just passed as an opaque pointer).

Suppose you train an instance of the classifier and are interested in the coefficients that multiply the support vectors in the decision function. In scikits.learn, you can access this array under field .coef_:

>>> import numpy as np
>>> from scikits.learn import svm
>>> clf = svm.SVM()
>>> clf.fit([[1,2], [3,4]], [-1, 1])

>>> clf.coef
clf.coef0 clf.coef_
>>> clf.coef_
array([[ 1., -1.]])

Now, changing the value of these coefficients effectively changes the decision function:

>>> clf.predict([[1,2]])
array([ -1.])
>>> clf.coef_ = np.array([[0.0, -1.0]])
>>> clf.predict([[1,2]])
array([ 1.])

Code

All code can be found in the scikit (you’ll have to get the svn version), in file scikits/learn/svm.py and scikits/learn/src/. All plots are generated from this script

Notes

In the benchmarks, a Linear Kernel was used, as it is the most common. Other more computationally intensive kernels would probably narrow the difference.

Bugs

This code should be treated as alpha quality and has not being extensively tested. Please report any bugs that you encounter to the tracker

scikits.learn coding sprint in Paris

4 March 2010

Yesterday we had an extremely productive coding sprint for the scikits.learn. The idea was to put people with common interests in a room and make them work in a single codebase.

Alexandre Gramfort and Olivier Grisel worked on GLMNet, Bertrand Thirion and Gaël Varoquaux worked on univariate feature selection and Vincent worked on Bayesian Regression.

Scikti learn Sprint

I was supposed to work with Vincent, but as soon as Bertrand spot some bugs in my libsvm bindings, I could not think of anything except that, and eventually the day finished just as I fixed the bug …

You can find some cool examples of the things we did in directory examples:

lasso coordinate descent

regression using support vector machines

Scikit-learn 0.1

1 February 2010

Today I released the first public version of Scikit-Learn (release notes).

It’s a python module implementing some machine learning algorithms, and it’s shaping quite good. For this release I did not want to do any incompatible changes, so most of them are just bug fixes and updates.

For the next release, however, some more radical changes are planned, and definitely something should be done about the (incredibly long) namespace, having to tape

from scikits.learn.machine.manifold_learning.regression.neighbors import Neighbors

each time you want to perform a nearest-neighbor algorithms is just not practical!

Here is a nice screenshot,

Scikit - learn

scikit-learn project on sourceforge

7 January 2010

This week we created a sourceforge project to host our development of scikit-learn. Although the project already had a directory in scipy’s repo, we needed more flexibility in the user management and in the mailing list creation, so we opted for SourceForge.

To be honest, after using git and Google Code for bug tracking, I was not very excited about using subversion/sourceforge again. On the other hand, we needed some sort of compromise that would allow a very heterogeneous range of developers to work together, and after some (surprisingly civilized) emails and some chatting with Gael, we agreed that SourceForge was indeed the best choice.

In case you are interested, there’s a (preliminary) web page with more info. You might also want to have a look at the previous project’s web page.

After holidays

5 January 2010

New job, new code, new city, new colleges. Feels something like this:

Contra Corriente

Winter in Paris is not funny

22 December 2009

This week I arrived to the place where I will be working the following two years: Neurospin.

Neurospin

It’s a research center located 20 km from Paris, and so far things are going smoothly: the place is beautiful, work is great and food is excellent. Well OK, I do miss some things from Spain and weather is horrible, but from now on it can only get better, I suppose.

Last days in Granada

15 December 2009

Nice thing about winter in Granada is, that even in the coldest days, the sky is always blue.

Granada desde la A-92

Learning, Machine Learning

15 December 2009

My new job is about managing an open source package for machine learning in Python. I’ve had some experience with Python now, but I am a total newbie in the field of machine learning, so my first task will be to find a good reference book in the subject and start reading.

The books I’ve come across so far have been:

The classic “Artificial Intelligence, A Modern Approach” by Rusell & Norvig, also known by its initials AIMA has been a very valuable introduction. It has four chapters devoted to Machine Learning, and while it does not go very deeply into the maths nor provide detailed analysis of the algorithm, it can be read fairly easily and has very interesting historical notes at the end of each chapter.

“Data Mining, Practical Machine Learning Tools and Techniques” by Ian H. Witten and Eibe Frank. It is easy going, a bit more in-depth but does not go very deep into the maths (and when it does, the section is marked as ‘optional’). The second part of the book, called “The Weka Machine Learning Workbench” describes Weka, a machine learning framework written in Java. For me, this is an invaluable resource, as Weka seems to be fairly well designed and will surely provide some design inspiration.

“Pattern Classification”, by Richard O Duda, Peter E Hart, David G Stork. A bit harsh at the beginning, this book is slowly becoming my favorite. Doesn’t hide the maths behind the algorithms, has complete sections on computational complexity, lots of exercises at the end of each chapter … I just wish it came with answers to (selected) exercises.

Interestingly, this books starts describing the most specific methods and finishes with the most general ones. In AIMA, you’ll find the reverse scheme.

Moving to Paris!

12 December 2009

I’m extremely glad that finally I am moving to Paris to work as part of
the INRIA crew. I’ll be working with
Gael Varoquaux and his team in an extremely cool Python related
project (more to come on this in the following weeks).

Granada has been a great place for me on the last 10 years, but now I feel its time to move, and hey, Paris is not a bad place :-)

Summer of Code is over

5 September 2009

Google Summer of Code program is officially over. It has been four months of intense work, exciting benchmarks and patch reviewing. It was a huge pleasure working with you guys!

As for the project, I implemented a complete logic module and then an assumption system for sympy (sympy.logic, sympy.assumptions, sympy.queries). I even had time to make the logic module fast. On top of this, there’s the refine module. It is there where you can see some nice examples and where all the power of sympy.queries and sympy.logic is exposed.

Although this sounds good, there are some things that I did not complete on time. I could not remove the old assumption system. There are simply too many things that depend on this to remove it on one move. However, I agreed with Ondrej that we both would be working on this the days 15-30 September. This has to be done because we definitely do not want to make a sympy release with two different assumption systems!

PD: a more detailed report lives here

Speed improvements for ask() (sympy.queries.ask)

20 August 2009

I managed to overcome the overhead in ask() that arises when converting between symbol and integer representation of sentences in conjunctive normal.

The result went beyond what I expected. The test suite for the query module got 10x times faster in my laptop. From 26 seconds, it descended to an impressive 2.03 secs. There is still room for improvement, but it is no longer “so desperately slow”.

I’ll submit those patches soon to sympy’s trunk, but in the meantime they are in my logic branch:

git pull http://fseoane.net/git/sympy.git logic

Logic module (sympy.logic): improving speed

18 August 2009

Today I’ve been doing some speed improvements for the logic module. More precisely, I implemented an efficient internal representation for clauses in conjunctive normal form.

In practice this means a huge performance boost for all problems that make use the function satisfiable() or dpll_satisfiable(). For example, test_dimacs.py has moved from 2.7 seconds to an impressive 0.3 sec, and ask() runs on average 3x times faster, although both problems still have an overhead because of the conversion to this new representation that can be avoided in most times.

Now, the details. Traditionally, dpll (the algorithm that we use for deciding satisfiability) used to store clauses as arrays of symbols, and this worked fine, but sadly comparing symbols in sympy is slow, and this algorithm does a lot of comparisons … but we can map each sympy symbol to a unique integer, and with minor modifications to the algorithm we get these performance gains.

Now, the code. You can pull from my branch logic:

git pull http://fseoane.net/git/sympy.git logic

There are now some obvious performance tweaks we can do:

- in ask(), we can skip the conversion to integer representation by ‘precompiling’ known_facts_dict into this representation. This should be easy and will probably give performance boosts of several orders of magnitude.

- this integer representation is very similar to the one used in dimacs CNF files, so a parser that directly converts CNF files to this integer representation should make solving CNF files much faster.

I would like to give some credit to Ronan Lamy, who sent a patch some time ago, and although I did not include it (yet) into main sympy branch, it inspired me for these modifications.

Refine module

17 August 2009

This commit introduced a new module in sympy: the refine module.

The purpose of this module is to simplify expressions when they are bound to assumptions. For example, if you know that x>0, then you can simplify abs(x) to x. This code was traditionally embedded into the core, but now this will be part of an external module (sympy.refine) upon which the core has no dependencies.

In a not very original move, I named the main function in this module refine(). It’s syntax is very straightforward: first argument is an expression and second argument are assumptions. Some examples (from isympy):

In [1]: refine(1+abs(x), Assume(x, Q.positive))
Out[1]: 1 + x

In [2]: refine(exp(I*x*pi), Assume(x, Q.odd))
Out[2]: -1

In [3]: refine(exp(I*x*pi), Assume(x, Q.even))
Out[3]: 1

Right now the module lacks some rules, but the design (very similar to the query module) will make adding these rules an easy task.

Query module - finally in trunk

10 August 2009

The query module is finally in the main SymPy repository. I made substantial changes since last post, most of them at the user interface level (thanks to Vinzent and Mateusz for many insightful comments).

Main function is ask(), which replaces the old expression.is_* syntax. You can ask many things. For example, you can ask if a given expression is integer, prime or real:

>>> ask(2, Q.integer)
True
>>> ask(x**2, Q.integer)
None
>>> ask(x**2, Q.integer, Assume(x, Q.integer))
True
>>> ask(sqrt(2)*x, Q.integer, Assume(x, Q.integer))
>>> ask(I*x, Q.real, Assume(x, Q.imaginary))
True
>>> ask(x*y, Q.prime, Assume(x, Q.prime) & Assume(y, Q.prime))
False

as you see, it returns True when it is sure that the expression is an integer, None if it does not know, and False if it is certainly not an integer.

The second argument, which we will call ‘key’ specifies what we want to ask about. For example, Q.integer ask wether expression is an integer, Q.negative wether it’s a negative, etc.
For a complete list of all the keys available, see doc/src/modules/queries.txt in sympy codebase.

It also accepts an optional third argument where you can specify assumptions that symbols in expr satisfy. That can be any kind of boolean expression involving assumptions, for example:

Assume(x, Q.positive) & Assume(x, Q.integer)

or

Assume(x, Q.positive) | Assume(x, Q.negative)

, or even

NAnd(Assume(x, Q.positive), Assume(x, Q.integer)

django, change language settings dynamically

7 August 2009

After some failed attempts, I just found how to change the language settings dynamically in django, and I thought it could be useful to someone. Just use function activate() from django.utils.translation. For example:

from django.utils.translation import activate

activate(‘es-ES’)

will change global language settings to ‘es-ES’ (Spain spanish). I use it because I have two forms and I want that errors appear in different languages, but I do not want to get into gettext

can we merge now, pleeease ?

21 July 2009

Three months after I began to write sympy.queries, I feel it’s about time to include it in sympy’s trunk, so today I sent for review 4 patches that implement the complete query module.

It’s been a lot of fun, but it has also caused me some headaches … specially last month trying to have the code as much bug-free as possible.

Next steps is to improve performace of query() and write the refine module.

PD: This weekend I’ll be at Leipzig for EuroSciPy, hope to meet there some SymPy developers!

Refine module, proof of concept

9 July 2009

The 0.6.5 release of SymPy is taking longer than expected because some bugs in the testing framework, so my query module is not merged into trunk (yet). In the meantime, I am implementing a refine module (very little code is available yet).

The refine module implements a refine() function (better names accepted) that would work in a very similar way as Mathematica’s Refine (http://documents.wolfram.com/mathematica/functions/Refine). It tries to simplify an expression based on it’s assumptions. For example:

>>> refine(abs(x), Assume(x, positive=True))
x
>>> refine(abs(x), Assume(x, negative=True))
-x

Initial code is in my git repo, branch queries.

Preparing a new release

30 June 2009

Last days I’ve been busy preparing the first public beta of SymPy 0.6.5. Most of the time was spent solving a bug that made documentation tests fail under python2.4, but now that this is solved, I hope that by the end of the week we could have a final release.

When this release is published, we’ll merge my query module and work on getting it right for 0.7.

Efficient DPLL algorithm

28 June 2009

Background: DPLL is the algorithm behind SymPy’s implementation of logic.inference.satisfiable

After reading the original papers by Davis & Putnam [1], I managed to implement a more efficient version of the DPLL algorithm. It is 10x times faster on medium-sized problems (40 variables), and solves some wrong result bugs [2].

As a side effect, the query module has become 2x faster

Source code lives in my sympy repo, http:/fseoane.net/git/sympy.git, branch logic

References:
[1]: http://portal.acm.org/citation.cfm?coll=GUIDE&dl=GUIDE&id=321034
[2] http://people.sc.fsu.edu/~burkardt/data/cnf/dubois22.cnf

Queries and performance

23 June 2009

After some hacking on the queries module, I finally got it right without the limitations of past versions. You can check it out from my repo http://fseoane.net/git/sympy.git, branch master.

It now relies even more on logic.inference.satisfiable(), which is just an implementation of the DPLL algorithm. Bad news is that (my implementation of ) dpll_satisfiable() is SLOW, so inevitably queries are SLOW. But everything is not lost, since the algorithm is quite fast, and in fact other variants of the algorithm (MiniSAT) perform 6600x times faster than my implementation on medium-sized problems (60 variables, 170 clauses). So this looks like something smells bad on the programming side …

However, I spent the day profiling the function (link to source code used for profiling) without much success