July in Paris

July 30th, 2010

One of the best things of spending summer in Paris: its parcs (here, with friends @ Parc Montsouris).

Con Julia en Montsouris

Support Vector machines with custom kernels using scikits.learn

May 27th, 2010

It is now possible (using the development version as of may 2010) to use Support Vector Machines with custom kernels in scikits.learn.

How to use it couldn’t be more simple: you just pass a callable (the kernel) to the class constructor). For example, a linear kernel would be implemented as follows:

import numpy as np
def my_kernel(x, y):
    return np.dot(x, y.T)

The only requisites for defining a kernel is that it should take as argument two numpy arrays and return also a numpy array. Then you would pass the kernel to the classifier’s constructor:

from scikits.learn import svm
clf = svm.SVC(kernel=my_kernel)

and that’s all. The construct recognizes this as a custom kernel and you can then use the classifier as any other classifier.

clf.fit([[0, 0], [1, 1]], [0, 1])
print clf.predict([[0, 0]])
> [0.]

For a complete reference, see the the reference manual and an example.

Howto link against system-wide BLAS library using numpy.distutils

April 22nd, 2010

If your numpy installation uses system-wide BLAS libraries (this will most likely be the case unless you installed it through prebuilt windows binaries), you can retrieve this information at compile time to link python modules to BLAS.

The function get_info in numpy.distutils.system_info will return a dictionary that contains the needed information to link against BLAS or an empty dict if no system-wide BLAS could be found. For example, MacOSX ships with it’s own optimized BLAS routines, and get_info correctly reports that:

In [1]: from numpy.distutils.system_info import get_info

In [2]: get_info(‘blas_opt’)
Out[2]:
{‘define_macros’: [(‘NO_ATLAS_INFO’, 3)],
 ‘extra_compile_args’: [‘-msse3′,
      ‘-I/System/Library/Frameworks/vecLib.framework/Headers’],
 ‘extra_link_args’: [‘-Wl,-framework’, ‘-Wl,Accelerate’]}

The following example shows a setup.py that links against system-wide BLAS if possible. If no appropriate BLAS routine could be found, it will print a warning message, but will compile it’s own BLAS routine and embed it in the python extension.

rom os.path import join

def configuration(parent_package=, top_path=None):
    import warnings
    from numpy.distutils.misc_util import Configuration
    from numpy.distutils.system_info import get_info, BlasNotFoundError
    config = Configuration(‘foo’, parent_package, top_path)

    libfoo_files = [‘foo.c’]
    blas_sources = [join(‘blas’, ‘daxpy.c’),
                    join(‘blas’, ‘dscal.c’)]

    blas_info = get_info(‘blas_opt’)
    if not blas_info:
        warnings.warn(BlasNotFoundError.__doc__)
        libfoo_files.append(blas_sources)

    libraries = blas_info.pop(‘libraries’, [])
    include_dirs = blas_info.pop(‘include_dirs’, [])
    config.add_extension(‘foo’,
        sources=sources,
        libraries=libraries,
        include_dirs=include_dirs,
        **blas_info
    )

    return config

if __name__ == ‘__main__’:
    from numpy.distutils.core import setup
    setup(**configuration(top_path=).todict())

A real-word example of this can be found in scipy.odr module and in scikits.learn’s liblinear bindings.

scikits.learn 0.2 release

March 22nd, 2010

Today I released a new version of the scikits.learn library for machine learning.

This new release includes the new libsvm bindings, Jake VanderPlas’ BallTree algorithm for *fast* nearest neighbor queries in high dimension, etc. Here is the official announcement.

As usual, it can be downloaded from sourceforge or from the PyPI.

Plot the maximum margin hyperplane with scikits.learn

March 17th, 2010

Suppose some given data points each belong to one of two classes, and the goal is to decide which class a new data point will be in. In the case of support vector machines, a data point is viewed as a p-dimensional vector (2-dimensional in this example), and we want to know whether we can separate such points with a p -1-dimensional hyperplane (a line in our case). There are many hyperplanes that might classify the data. One reasonable choice as the best hyperplane is the one that represents the largest separation, or margin, between the two classes. So we choose the hyperplane so that the distance from it to the nearest data point on each side is maximized. If such a hyperplane exists, it is known as the maximum-margin hyperplane and the linear classifier it defines is known as a maximum margin classifier.

Using the new svm module in scikits.learn, you can easily plot the maximum margin hyperplane. If clf is an instance of svm.SVC(), the coefficients in the decision function are stored in clf.coef_ , and the independent term in clf.rho_ . The complete source code is:

import numpy as np
import pylab as pl
from scikits.learn import svm

# we create 40 separable points
np.random.seed(0)
X = np.r_[np.random.randn(20, 2) - [2,2], np.random.randn(20, 2) + [2, 2]]
Y = [0]*20 + [1]*20

# fit the model
clf = svm.SVC(kernel=‘linear’)
clf.fit(X, Y)

# get the separating hyperplane
w = np.dot(clf.coef_[0], clf.support_)
a = -w[0]/w[1]
xx = np.linspace(-5, 5)
yy = a*xx + (clf.rho_[0])/w[1]

# plot the paralels to the separating hyperplane that pass through the
# support vectors
b = clf.support_[0]
yy_down = a*xx + (b[1] - a*b[0])
b = clf.support_[-1]
yy_up = a*xx + (b[1] - a*b[0])

# plot the line, the points, and the nearest vectors to the plane
pl.set_cmap(pl.cm.Paired)
pl.plot(xx, yy, ‘k-’)
pl.plot(xx, yy_down, ‘k–’)
pl.plot(xx, yy_up, ‘k–’)
pl.scatter(X[:,0], X[:,1], c=Y)
pl.scatter(clf.support_[:,0], clf.support_[:,1], marker=‘+’)

pl.axis(‘tight’)
pl.show()

And the result is

Separating hyperplane

where the vectors that are closest to the separating line are highlighted with a small ‘+’.

Up-to date code of this can be found in directory examples/ of scikits.learn

Fast bindings for LibSVM in scikits.learn

March 9th, 2010

LibSVM is a C++ library that implements several Support Vector Machine algorithms that are commonly used in machine learning. It is a fast library that has no dependencies and most machine learning frameworks bind it in some way or another. LibSVM comes with a Python interface written in swig, but this interface is inherently slow as it does not take into account numpy’s array structure. Also, it does not wrap all the library’s functionality. Some projects bind it using this bindings and other (such as PyMVPA) make its own wrap, binding some methods directly to numpy’s array structure.

My approach was to code all algorithms that convert libsvm’s data structures (sparse) to numpy arrays (dense) in pure C and wrap them in a very thin Cython layer. Special attention was given to minimize the overhead of converting between libsvm data structures and numpy arrays, as in my opinion this was the main source of bad performance in existing python bindings.

Benchmarks

As a first benchmark, I supposed a situation in which the dimension of the subspace is small and there are lots of points to classify. This is typically the case when your data is points in plane or in space and you want to draw the decision function by classifying every point in the grid. In this case, the bottleneck is not the classification algorithm, but the conversion of data from a dense representation used by python and numpy and a sparse representation used by libsvm. Not surprisingly, we get huge performance gains if we speed up the conversion dense/sparse.

bench1

Curse of dimensionality

In the case of a huge number of dimensions, the speedup is not so spectacular, but we also get a performance boost by making training somewhat faster.

Benchmarks of different bindings of LibSVM

Bidirectional mapping

A feature that was needed and that I haven’t found on other implementations is that you can tweak parameters in the SVM class and the classifier will reflect those changes (i.e. parameters are actually copied back and forth, not just passed as an opaque pointer).

Suppose you train an instance of the classifier and are interested in the coefficients that multiply the support vectors in the decision function. In scikits.learn, you can access this array under field .coef_:

>>> import numpy as np
>>> from scikits.learn import svm
>>> clf = svm.SVM()
>>> clf.fit([[1,2], [3,4]], [-1, 1])

>>> clf.coef
clf.coef0 clf.coef_
>>> clf.coef_
array([[ 1., -1.]])

Now, changing the value of these coefficients effectively changes the decision function:

>>> clf.predict([[1,2]])
array([ -1.])
>>> clf.coef_ = np.array([[0.0, -1.0]])
>>> clf.predict([[1,2]])
array([ 1.])

Code

All code can be found in the scikit (you’ll have to get the svn version), in file scikits/learn/svm.py and scikits/learn/src/. All plots are generated from this script

Notes

In the benchmarks, a Linear Kernel was used, as it is the most common. Other more computationally intensive kernels would probably narrow the difference.

Bugs

This code should be treated as alpha quality and has not being extensively tested. Please report any bugs that you encounter to the tracker

scikits.learn coding sprint in Paris

March 4th, 2010

Yesterday we had an extremely productive coding sprint for the scikits.learn. The idea was to put people with common interests in a room and make them work in a single codebase.

Alexandre Gramfort and Olivier Grisel worked on GLMNet, Bertrand Thirion and Gaël Varoquaux worked on univariate feature selection and Vincent worked on Bayesian Regression.

Scikti learn Sprint

I was supposed to work with Vincent, but as soon as Bertrand spot some bugs in my libsvm bindings, I could not think of anything except that, and eventually the day finished just as I fixed the bug …

You can find some cool examples of the things we did in directory examples:

lasso coordinate descent

regression using support vector machines

Scikit-learn 0.1

February 1st, 2010

Today I released the first public version of Scikit-Learn (release notes).

It’s a python module implementing some machine learning algorithms, and it’s shaping quite good. For this release I did not want to do any incompatible changes, so most of them are just bug fixes and updates.

For the next release, however, some more radical changes are planned, and definitely something should be done about the (incredibly long) namespace, having to tape

from scikits.learn.machine.manifold_learning.regression.neighbors import Neighbors

each time you want to perform a nearest-neighbor algorithms is just not practical!

Here is a nice screenshot,

Scikit - learn

scikit-learn project on sourceforge

January 7th, 2010

This week we created a sourceforge project to host our development of scikit-learn. Although the project already had a directory in scipy’s repo, we needed more flexibility in the user management and in the mailing list creation, so we opted for SourceForge.

To be honest, after using git and Google Code for bug tracking, I was not very excited about using subversion/sourceforge again. On the other hand, we needed some sort of compromise that would allow a very heterogeneous range of developers to work together, and after some (surprisingly civilized) emails and some chatting with Gael, we agreed that SourceForge was indeed the best choice.

In case you are interested, there’s a (preliminary) web page with more info. You might also want to have a look at the previous project’s web page.

After holidays

January 5th, 2010

New job, new code, new city, new colleges. Feels something like this:

Contra Corriente

Winter in Paris is not funny

December 22nd, 2009

This week I arrived to the place where I will be working the following two years: Neurospin.

Neurospin

It’s a research center located 20 km from Paris, and so far things are going smoothly: the place is beautiful, work is great and food is excellent. Well OK, I do miss some things from Spain and weather is horrible, but from now on it can only get better, I suppose.

Last days in Granada

December 15th, 2009

Nice thing about winter in Granada is, that even in the coldest days, the sky is always blue.

Granada desde la A-92

Learning, Machine Learning

December 15th, 2009

My new job is about managing an open source package for machine learning in Python. I’ve had some experience with Python now, but I am a total newbie in the field of machine learning, so my first task will be to find a good reference book in the subject and start reading.

The books I’ve come across so far have been:

The classic “Artificial Intelligence, A Modern Approach” by Rusell & Norvig, also known by its initials AIMA has been a very valuable introduction. It has four chapters devoted to Machine Learning, and while it does not go very deeply into the maths nor provide detailed analysis of the algorithm, it can be read fairly easily and has very interesting historical notes at the end of each chapter.

“Data Mining, Practical Machine Learning Tools and Techniques” by Ian H. Witten and Eibe Frank. It is easy going, a bit more in-depth but does not go very deep into the maths (and when it does, the section is marked as ‘optional’). The second part of the book, called “The Weka Machine Learning Workbench” describes Weka, a machine learning framework written in Java. For me, this is an invaluable resource, as Weka seems to be fairly well designed and will surely provide some design inspiration.

“Pattern Classification”, by Richard O Duda, Peter E Hart, David G Stork. A bit harsh at the beginning, this book is slowly becoming my favorite. Doesn’t hide the maths behind the algorithms, has complete sections on computational complexity, lots of exercises at the end of each chapter … I just wish it came with answers to (selected) exercises.

Interestingly, this books starts describing the most specific methods and finishes with the most general ones. In AIMA, you’ll find the reverse scheme.

Moving to Paris!

December 12th, 2009

I’m extremely glad that finally I am moving to Paris to work as part of
the INRIA crew. I’ll be working with
Gael Varoquaux and his team in an extremely cool Python related
project (more to come on this in the following weeks).

Granada has been a great place for me on the last 10 years, but now I feel its time to move, and hey, Paris is not a bad place :-)

Summer of Code is over

September 5th, 2009

Google Summer of Code program is officially over. It has been four months of intense work, exciting benchmarks and patch reviewing. It was a huge pleasure working with you guys!

As for the project, I implemented a complete logic module and then an assumption system for sympy (sympy.logic, sympy.assumptions, sympy.queries). I even had time to make the logic module fast. On top of this, there’s the refine module. It is there where you can see some nice examples and where all the power of sympy.queries and sympy.logic is exposed.

Although this sounds good, there are some things that I did not complete on time. I could not remove the old assumption system. There are simply too many things that depend on this to remove it on one move. However, I agreed with Ondrej that we both would be working on this the days 15-30 September. This has to be done because we definitely do not want to make a sympy release with two different assumption systems!

PD: a more detailed report lives here

Speed improvements for ask() (sympy.queries.ask)

August 20th, 2009

I managed to overcome the overhead in ask() that arises when converting between symbol and integer representation of sentences in conjunctive normal.

The result went beyond what I expected. The test suite for the query module got 10x times faster in my laptop. From 26 seconds, it descended to an impressive 2.03 secs. There is still room for improvement, but it is no longer “so desperately slow”.

I’ll submit those patches soon to sympy’s trunk, but in the meantime they are in my logic branch:

git pull http://fseoane.net/git/sympy.git logic

Logic module (sympy.logic): improving speed

August 18th, 2009

Today I’ve been doing some speed improvements for the logic module. More precisely, I implemented an efficient internal representation for clauses in conjunctive normal form.

In practice this means a huge performance boost for all problems that make use the function satisfiable() or dpll_satisfiable(). For example, test_dimacs.py has moved from 2.7 seconds to an impressive 0.3 sec, and ask() runs on average 3x times faster, although both problems still have an overhead because of the conversion to this new representation that can be avoided in most times.

Now, the details. Traditionally, dpll (the algorithm that we use for deciding satisfiability) used to store clauses as arrays of symbols, and this worked fine, but sadly comparing symbols in sympy is slow, and this algorithm does a lot of comparisons … but we can map each sympy symbol to a unique integer, and with minor modifications to the algorithm we get these performance gains.

Now, the code. You can pull from my branch logic:

git pull http://fseoane.net/git/sympy.git logic

There are now some obvious performance tweaks we can do:

- in ask(), we can skip the conversion to integer representation by ‘precompiling’ known_facts_dict into this representation. This should be easy and will probably give performance boosts of several orders of magnitude.

- this integer representation is very similar to the one used in dimacs CNF files, so a parser that directly converts CNF files to this integer representation should make solving CNF files much faster.

I would like to give some credit to Ronan Lamy, who sent a patch some time ago, and although I did not include it (yet) into main sympy branch, it inspired me for these modifications.

Refine module

August 17th, 2009

This commit introduced a new module in sympy: the refine module.

The purpose of this module is to simplify expressions when they are bound to assumptions. For example, if you know that x>0, then you can simplify abs(x) to x. This code was traditionally embedded into the core, but now this will be part of an external module (sympy.refine) upon which the core has no dependencies.

In a not very original move, I named the main function in this module refine(). It’s syntax is very straightforward: first argument is an expression and second argument are assumptions. Some examples (from isympy):

In [1]: refine(1+abs(x), Assume(x, Q.positive))
Out[1]: 1 + x

In [2]: refine(exp(I*x*pi), Assume(x, Q.odd))
Out[2]: -1

In [3]: refine(exp(I*x*pi), Assume(x, Q.even))
Out[3]: 1

Right now the module lacks some rules, but the design (very similar to the query module) will make adding these rules an easy task.

Query module - finally in trunk

August 10th, 2009

The query module is finally in the main SymPy repository. I made substantial changes since last post, most of them at the user interface level (thanks to Vinzent and Mateusz for many insightful comments).

Main function is ask(), which replaces the old expression.is_* syntax. You can ask many things. For example, you can ask if a given expression is integer, prime or real:

>>> ask(2, Q.integer)
True
>>> ask(x**2, Q.integer)
None
>>> ask(x**2, Q.integer, Assume(x, Q.integer))
True
>>> ask(sqrt(2)*x, Q.integer, Assume(x, Q.integer))
>>> ask(I*x, Q.real, Assume(x, Q.imaginary))
True
>>> ask(x*y, Q.prime, Assume(x, Q.prime) & Assume(y, Q.prime))
False

as you see, it returns True when it is sure that the expression is an integer, None if it does not know, and False if it is certainly not an integer.

The second argument, which we will call ‘key’ specifies what we want to ask about. For example, Q.integer ask wether expression is an integer, Q.negative wether it’s a negative, etc.
For a complete list of all the keys available, see doc/src/modules/queries.txt in sympy codebase.

It also accepts an optional third argument where you can specify assumptions that symbols in expr satisfy. That can be any kind of boolean expression involving assumptions, for example:

Assume(x, Q.positive) & Assume(x, Q.integer)

or

Assume(x, Q.positive) | Assume(x, Q.negative)

, or even

NAnd(Assume(x, Q.positive), Assume(x, Q.integer)

django, change language settings dynamically

August 7th, 2009

After some failed attempts, I just found how to change the language settings dynamically in django, and I thought it could be useful to someone. Just use function activate() from django.utils.translation. For example:

from django.utils.translation import activate

activate(‘es-ES’)

will change global language settings to ‘es-ES’ (Spain spanish). I use it because I have two forms and I want that errors appear in different languages, but I do not want to get into gettext