July 30th, 2010
Support Vector machines with custom kernels using scikits.learn
May 27th, 2010
It is now possible (using the development version as of may 2010) to use Support Vector Machines with custom kernels in scikits.learn.
How to use it couldn’t be more simple: you just pass a callable (the kernel) to the class constructor). For example, a linear kernel would be implemented as follows:
def my_kernel(x, y):
return np.dot(x, y.T)
The only requisites for defining a kernel is that it should take as argument two numpy arrays and return also a numpy array. Then you would pass the kernel to the classifier’s constructor:
clf = svm.SVC(kernel=my_kernel)
and that’s all. The construct recognizes this as a custom kernel and you can then use the classifier as any other classifier.
print clf.predict([[0, 0]])
–> [0.]
For a complete reference, see the the reference manual and an example.
Howto link against system-wide BLAS library using numpy.distutils
April 22nd, 2010
If your numpy installation uses system-wide BLAS libraries (this will most likely be the case unless you installed it through prebuilt windows binaries), you can retrieve this information at compile time to link python modules to BLAS.
The function get_info in numpy.distutils.system_info will return a dictionary that contains the needed information to link against BLAS or an empty dict if no system-wide BLAS could be found. For example, MacOSX ships with it’s own optimized BLAS routines, and get_info correctly reports that:
In [2]: get_info(‘blas_opt’)
Out[2]:
{‘define_macros’: [(‘NO_ATLAS_INFO’, 3)],
‘extra_compile_args’: [‘-msse3′,
‘-I/System/Library/Frameworks/vecLib.framework/Headers’],
‘extra_link_args’: [‘-Wl,-framework’, ‘-Wl,Accelerate’]}
The following example shows a setup.py that links against system-wide BLAS if possible. If no appropriate BLAS routine could be found, it will print a warning message, but will compile it’s own BLAS routine and embed it in the python extension.
def configuration(parent_package=”, top_path=None):
import warnings
from numpy.distutils.misc_util import Configuration
from numpy.distutils.system_info import get_info, BlasNotFoundError
config = Configuration(‘foo’, parent_package, top_path)
libfoo_files = [‘foo.c’]
blas_sources = [join(‘blas’, ‘daxpy.c’),
join(‘blas’, ‘dscal.c’)]
blas_info = get_info(‘blas_opt’)
if not blas_info:
warnings.warn(BlasNotFoundError.__doc__)
libfoo_files.append(blas_sources)
libraries = blas_info.pop(‘libraries’, [])
include_dirs = blas_info.pop(‘include_dirs’, [])
config.add_extension(‘foo’,
sources=sources,
libraries=libraries,
include_dirs=include_dirs,
**blas_info
)
return config
if __name__ == ‘__main__’:
from numpy.distutils.core import setup
setup(**configuration(top_path=”).todict())
A real-word example of this can be found in scipy.odr module and in scikits.learn’s liblinear bindings.
scikits.learn 0.2 release
March 22nd, 2010
Today I released a new version of the scikits.learn library for machine learning.
This new release includes the new libsvm bindings, Jake VanderPlas’ BallTree algorithm for *fast* nearest neighbor queries in high dimension, etc. Here is the official announcement.
As usual, it can be downloaded from sourceforge or from the PyPI.
Plot the maximum margin hyperplane with scikits.learn
March 17th, 2010
Suppose some given data points each belong to one of two classes, and the goal is to decide which class a new data point will be in. In the case of support vector machines, a data point is viewed as a p-dimensional vector (2-dimensional in this example), and we want to know whether we can separate such points with a p -1-dimensional hyperplane (a line in our case). There are many hyperplanes that might classify the data. One reasonable choice as the best hyperplane is the one that represents the largest separation, or margin, between the two classes. So we choose the hyperplane so that the distance from it to the nearest data point on each side is maximized. If such a hyperplane exists, it is known as the maximum-margin hyperplane and the linear classifier it defines is known as a maximum margin classifier.
Using the new svm module in scikits.learn, you can easily plot the maximum margin hyperplane. If clf is an instance of svm.SVC(), the coefficients in the decision function are stored in clf.coef_ , and the independent term in clf.rho_ . The complete source code is:
import pylab as pl
from scikits.learn import svm
# we create 40 separable points
np.random.seed(0)
X = np.r_[np.random.randn(20, 2) - [2,2], np.random.randn(20, 2) + [2, 2]]
Y = [0]*20 + [1]*20
# fit the model
clf = svm.SVC(kernel=‘linear’)
clf.fit(X, Y)
# get the separating hyperplane
w = np.dot(clf.coef_[0], clf.support_)
a = -w[0]/w[1]
xx = np.linspace(-5, 5)
yy = a*xx + (clf.rho_[0])/w[1]
# plot the paralels to the separating hyperplane that pass through the
# support vectors
b = clf.support_[0]
yy_down = a*xx + (b[1] - a*b[0])
b = clf.support_[-1]
yy_up = a*xx + (b[1] - a*b[0])
# plot the line, the points, and the nearest vectors to the plane
pl.set_cmap(pl.cm.Paired)
pl.plot(xx, yy, ‘k-’)
pl.plot(xx, yy_down, ‘k–’)
pl.plot(xx, yy_up, ‘k–’)
pl.scatter(X[:,0], X[:,1], c=Y)
pl.scatter(clf.support_[:,0], clf.support_[:,1], marker=‘+’)
pl.axis(‘tight’)
pl.show()
And the result is
where the vectors that are closest to the separating line are highlighted with a small ‘+’.
Up-to date code of this can be found in directory examples/ of scikits.learn
Fast bindings for LibSVM in scikits.learn
March 9th, 2010
LibSVM is a C++ library that implements several Support Vector Machine algorithms that are commonly used in machine learning. It is a fast library that has no dependencies and most machine learning frameworks bind it in some way or another. LibSVM comes with a Python interface written in swig, but this interface is inherently slow as it does not take into account numpy’s array structure. Also, it does not wrap all the library’s functionality. Some projects bind it using this bindings and other (such as PyMVPA) make its own wrap, binding some methods directly to numpy’s array structure.
My approach was to code all algorithms that convert libsvm’s data structures (sparse) to numpy arrays (dense) in pure C and wrap them in a very thin Cython layer. Special attention was given to minimize the overhead of converting between libsvm data structures and numpy arrays, as in my opinion this was the main source of bad performance in existing python bindings.
Benchmarks
As a first benchmark, I supposed a situation in which the dimension of the subspace is small and there are lots of points to classify. This is typically the case when your data is points in plane or in space and you want to draw the decision function by classifying every point in the grid. In this case, the bottleneck is not the classification algorithm, but the conversion of data from a dense representation used by python and numpy and a sparse representation used by libsvm. Not surprisingly, we get huge performance gains if we speed up the conversion dense/sparse.
Curse of dimensionality
In the case of a huge number of dimensions, the speedup is not so spectacular, but we also get a performance boost by making training somewhat faster.
Bidirectional mapping
A feature that was needed and that I haven’t found on other implementations is that you can tweak parameters in the SVM class and the classifier will reflect those changes (i.e. parameters are actually copied back and forth, not just passed as an opaque pointer).
Suppose you train an instance of the classifier and are interested in the coefficients that multiply the support vectors in the decision function. In scikits.learn, you can access this array under field .coef_:
>>> from scikits.learn import svm
>>> clf = svm.SVM()
>>> clf.fit([[1,2], [3,4]], [-1, 1])
>>> clf.coef
clf.coef0 clf.coef_
>>> clf.coef_
array([[ 1., -1.]])
Now, changing the value of these coefficients effectively changes the decision function:
array([ -1.])
>>> clf.coef_ = np.array([[0.0, -1.0]])
>>> clf.predict([[1,2]])
array([ 1.])
Code
All code can be found in the scikit (you’ll have to get the svn version), in file scikits/learn/svm.py and scikits/learn/src/. All plots are generated from this script
Notes
In the benchmarks, a Linear Kernel was used, as it is the most common. Other more computationally intensive kernels would probably narrow the difference.
Bugs
This code should be treated as alpha quality and has not being extensively tested. Please report any bugs that you encounter to the tracker
scikits.learn coding sprint in Paris
March 4th, 2010
Yesterday we had an extremely productive coding sprint for the scikits.learn. The idea was to put people with common interests in a room and make them work in a single codebase.
Alexandre Gramfort and Olivier Grisel worked on GLMNet, Bertrand Thirion and Gaël Varoquaux worked on univariate feature selection and Vincent worked on Bayesian Regression.
I was supposed to work with Vincent, but as soon as Bertrand spot some bugs in my libsvm bindings, I could not think of anything except that, and eventually the day finished just as I fixed the bug …
You can find some cool examples of the things we did in directory examples:
Scikit-learn 0.1
February 1st, 2010
Today I released the first public version of Scikit-Learn (release notes).
It’s a python module implementing some machine learning algorithms, and it’s shaping quite good. For this release I did not want to do any incompatible changes, so most of them are just bug fixes and updates.
For the next release, however, some more radical changes are planned, and definitely something should be done about the (incredibly long) namespace, having to tape
each time you want to perform a nearest-neighbor algorithms is just not practical!
Here is a nice screenshot,
scikit-learn project on sourceforge
January 7th, 2010
This week we created a sourceforge project to host our development of scikit-learn. Although the project already had a directory in scipy’s repo, we needed more flexibility in the user management and in the mailing list creation, so we opted for SourceForge.
To be honest, after using git and Google Code for bug tracking, I was not very excited about using subversion/sourceforge again. On the other hand, we needed some sort of compromise that would allow a very heterogeneous range of developers to work together, and after some (surprisingly civilized) emails and some chatting with Gael, we agreed that SourceForge was indeed the best choice.
In case you are interested, there’s a (preliminary) web page with more info. You might also want to have a look at the previous project’s web page.
After holidays
January 5th, 2010
Winter in Paris is not funny
December 22nd, 2009
This week I arrived to the place where I will be working the following two years: Neurospin.
It’s a research center located 20 km from Paris, and so far things are going smoothly: the place is beautiful, work is great and food is excellent. Well OK, I do miss some things from Spain and weather is horrible, but from now on it can only get better, I suppose.
Last days in Granada
December 15th, 2009
Learning, Machine Learning
December 15th, 2009
My new job is about managing an open source package for machine learning in Python. I’ve had some experience with Python now, but I am a total newbie in the field of machine learning, so my first task will be to find a good reference book in the subject and start reading.
The books I’ve come across so far have been:
The classic “Artificial Intelligence, A Modern Approach” by Rusell & Norvig, also known by its initials AIMA has been a very valuable introduction. It has four chapters devoted to Machine Learning, and while it does not go very deeply into the maths nor provide detailed analysis of the algorithm, it can be read fairly easily and has very interesting historical notes at the end of each chapter.
“Data Mining, Practical Machine Learning Tools and Techniques” by Ian H. Witten and Eibe Frank. It is easy going, a bit more in-depth but does not go very deep into the maths (and when it does, the section is marked as ‘optional’). The second part of the book, called “The Weka Machine Learning Workbench” describes Weka, a machine learning framework written in Java. For me, this is an invaluable resource, as Weka seems to be fairly well designed and will surely provide some design inspiration.
“Pattern Classification”, by Richard O Duda, Peter E Hart, David G Stork. A bit harsh at the beginning, this book is slowly becoming my favorite. Doesn’t hide the maths behind the algorithms, has complete sections on computational complexity, lots of exercises at the end of each chapter … I just wish it came with answers to (selected) exercises.
Interestingly, this books starts describing the most specific methods and finishes with the most general ones. In AIMA, you’ll find the reverse scheme.
Moving to Paris!
December 12th, 2009
I’m extremely glad that finally I am moving to Paris to work as part of
the INRIA crew. I’ll be working with
Gael Varoquaux and his team in an extremely cool Python related
project (more to come on this in the following weeks).
Granada has been a great place for me on the last 10 years, but now I feel its time to move, and hey, Paris is not a bad place
Summer of Code is over
September 5th, 2009
Google Summer of Code program is officially over. It has been four months of intense work, exciting benchmarks and patch reviewing. It was a huge pleasure working with you guys!
As for the project, I implemented a complete logic module and then an assumption system for sympy (sympy.logic, sympy.assumptions, sympy.queries). I even had time to make the logic module fast. On top of this, there’s the refine module. It is there where you can see some nice examples and where all the power of sympy.queries and sympy.logic is exposed.
Although this sounds good, there are some things that I did not complete on time. I could not remove the old assumption system. There are simply too many things that depend on this to remove it on one move. However, I agreed with Ondrej that we both would be working on this the days 15-30 September. This has to be done because we definitely do not want to make a sympy release with two different assumption systems!
PD: a more detailed report lives here
Speed improvements for ask() (sympy.queries.ask)
August 20th, 2009
I managed to overcome the overhead in ask() that arises when converting between symbol and integer representation of sentences in conjunctive normal.
The result went beyond what I expected. The test suite for the query module got 10x times faster in my laptop. From 26 seconds, it descended to an impressive 2.03 secs. There is still room for improvement, but it is no longer “so desperately slow”.
I’ll submit those patches soon to sympy’s trunk, but in the meantime they are in my logic branch:
Logic module (sympy.logic): improving speed
August 18th, 2009
Today I’ve been doing some speed improvements for the logic module. More precisely, I implemented an efficient internal representation for clauses in conjunctive normal form.
In practice this means a huge performance boost for all problems that make use the function satisfiable() or dpll_satisfiable(). For example, test_dimacs.py has moved from 2.7 seconds to an impressive 0.3 sec, and ask() runs on average 3x times faster, although both problems still have an overhead because of the conversion to this new representation that can be avoided in most times.
Now, the details. Traditionally, dpll (the algorithm that we use for deciding satisfiability) used to store clauses as arrays of symbols, and this worked fine, but sadly comparing symbols in sympy is slow, and this algorithm does a lot of comparisons … but we can map each sympy symbol to a unique integer, and with minor modifications to the algorithm we get these performance gains.
Now, the code. You can pull from my branch logic:
There are now some obvious performance tweaks we can do:
- in ask(), we can skip the conversion to integer representation by ‘precompiling’ known_facts_dict into this representation. This should be easy and will probably give performance boosts of several orders of magnitude.
- this integer representation is very similar to the one used in dimacs CNF files, so a parser that directly converts CNF files to this integer representation should make solving CNF files much faster.
—
I would like to give some credit to Ronan Lamy, who sent a patch some time ago, and although I did not include it (yet) into main sympy branch, it inspired me for these modifications.
Refine module
August 17th, 2009
This commit introduced a new module in sympy: the refine module.
The purpose of this module is to simplify expressions when they are bound to assumptions. For example, if you know that x>0, then you can simplify abs(x) to x. This code was traditionally embedded into the core, but now this will be part of an external module (sympy.refine) upon which the core has no dependencies.
In a not very original move, I named the main function in this module refine(). It’s syntax is very straightforward: first argument is an expression and second argument are assumptions. Some examples (from isympy):
In [1]: refine(1+abs(x), Assume(x, Q.positive))
Out[1]: 1 + x
In [2]: refine(exp(I*x*pi), Assume(x, Q.odd))
Out[2]: -1
In [3]: refine(exp(I*x*pi), Assume(x, Q.even))
Out[3]: 1
Right now the module lacks some rules, but the design (very similar to the query module) will make adding these rules an easy task.
Query module - finally in trunk
August 10th, 2009
The query module is finally in the main SymPy repository. I made substantial changes since last post, most of them at the user interface level (thanks to Vinzent and Mateusz for many insightful comments).
Main function is ask(), which replaces the old expression.is_* syntax. You can ask many things. For example, you can ask if a given expression is integer, prime or real:
True
>>> ask(x**2, Q.integer)
None
>>> ask(x**2, Q.integer, Assume(x, Q.integer))
True
>>> ask(sqrt(2)*x, Q.integer, Assume(x, Q.integer))
>>> ask(I*x, Q.real, Assume(x, Q.imaginary))
True
>>> ask(x*y, Q.prime, Assume(x, Q.prime) & Assume(y, Q.prime))
False
as you see, it returns True when it is sure that the expression is an integer, None if it does not know, and False if it is certainly not an integer.
The second argument, which we will call ‘key’ specifies what we want to ask about. For example, Q.integer ask wether expression is an integer, Q.negative wether it’s a negative, etc.
For a complete list of all the keys available, see doc/src/modules/queries.txt in sympy codebase.
It also accepts an optional third argument where you can specify assumptions that symbols in expr satisfy. That can be any kind of boolean expression involving assumptions, for example:
or
, or even
django, change language settings dynamically
August 7th, 2009
After some failed attempts, I just found how to change the language settings dynamically in django, and I thought it could be useful to someone. Just use function activate() from django.utils.translation. For example:
activate(‘es-ES’)
will change global language settings to ‘es-ES’ (Spain spanish). I use it because I have two forms and I want that errors appear in different languages, but I do not want to get into gettext

Posted in










Hola. Quieres un yogur?