line-by-line memory usage of a Python program

24 April 2012

My newest project is a Python library for monitoring memory consumption of arbitrary process, and one of its most useful features is the line-by-line analysis of memory usage for Python code.

I wrote a basic prototype six months ago after being surprised by the lack of related tools. I wanted to plot memory consumption of a couple of Python functions but did not find a python module to do the job. I came to the conclusion that there is no standard way to get the memory usage of the Python interpreter from within Python, so I resorted to reading for from /proc/$PID/statm. From there on I realized that one the fetching of memory is done, making a line-by-line report wouldn’t be hard.

Back to today. I’ve been using the line-by-line memory monitoring to diagnose poor memory management (hidden temporaries, unused allocation, etc.) for some time. It seems to work on two different computers, so full of confidence as I am, I’ll write a blog post about it …

How to use it?

The easiest way to get it is to install from the Python Package Index:

    $ easy_install -U memory_profiler # pip install -U memory_profiler

but other options include fetching the latests from github or dropping it on your current working directory or somewhere else on your PYTHONPATH since it consist of a single file.

Then next step is to write some python code to profile. It can be just about any function, but for the purpose of this blog post I’ll create a function my_func() with mostly memory allocations and save it to a file named example.py:

import numpy as np

@profile
def my_func():
    a = np.zeros((100, 100))
    b = np.zeros((1000, 1000))
    c = np.zeros((10000, 1000))
    return a, b, c

if __name__ == '__main__':
    my_func()

Note that I’ve decorated the function with @profile. This tells the profiler to look into function my_func and gather the memory consumption for each line.

Wake up the cookie monster

To start profiling and output the result to stdout, run the script as usual and append the options “-m memory_profiler -l -v” to the python interpreter.

$ python -m memory_profiler -l -v example.py
Line #    Mem usage   Line Contents
===================================
     3                @profile
     4     13.68 MB   def my_func():
     5     13.77 MB       a = np.zeros((100, 100))
     6     21.40 MB       b = np.zeros((1000, 1000))
     7     97.70 MB       c = np.zeros((10000, 1000))
     8     97.70 MB       return a, b, c

voilá! Each line is prefixed by the memory usage in MB of the Python interpreter after that line has been executed.

Low rank approximation

6 November 2011

A little experiment to see what low rank approximation looks like. These are the best rank-k approximations (in the Frobenius norm) to the a natural image for increasing values of k and an original image of rank 512.

Python code can be found here. GIF animation made using ImageMagic’s convert script.

qr_multiply function in scipy.linalg

14 October 2011

In scipy’s development version there’s a new function closely related to the QR-decomposition of a matrix and to the least-squares solution of a linear system.

What this function does is to compute the QR-decomposition of a matrix and then multiply the resulting orthogonal factor by another arbitrary matrix. In pseudocode:

def qr_multiply(X, Y):
    Q, R = qr(X)
    return dot(Q.T, Y)

but unlike this naive implementation, qr_multiply is able to do all this without explicitly computing the orthogonal Q matrix, resulting both in memory and time saving. In the following picture I measured the memory consumption as a function of time of running this computation on a 1.000 x 1.000 matrix X and a vector Y (full code can be found here):

It can be seen that not only qr_multiply is almost twice as fast as the naive approach, but also that the memory consumption is significantly reduced, since the orthogonal factor is never explicitly computed.

Credit for implementing the qr_multiply function goes to Martin Teichmann.

scikit-learn 0.9

2 October 2011

Last week we released a new version of scikit-learn. The Changelog is particularly impressive, yet personally this release is important for other reasons.

This will probably be my last release as a paid engineer. I’m starting a PhD next month, and although I plan to continue contributing to the project and make a few more releases, I will certainly have less time to devote to it. Luckily, I received a lot of help from the community while preparing the release, from Changelog writing to build of Windows binaries, thus I expect the transition to go smoothly.

Almost two years have elapsed since the first 0.1 release. During this time, we did a lot of refactoring and broke the API several times. However, I’ve seen some concerns about API stability both at the EuroScipy conference and in the mailing list where I’ve realized we need to provide an API that does not break in every release, and do this in a way that the project remains fun for developers.

That’s why I’m extremely glad to see that although this release is big in changes, these have been made in a more organized manner. Yes, we’ve broken the API once again, but now there’s a compatibility layer that ensures that code written for 0.8 will continue working with the new release.

Reworked example gallery for scikit-learn

4 September 2011

I’ve been working lately in improving the scikit-learn example gallery to show also a small thumbnail of the plotted result. Here is what the gallery looks like now


And the real thing should be already displayed in the development documentation. The next thing is to add a static image to those that don’t generate any result, examples such as the SVM GUI should have an image to display.

scikit-learn’s EuroScipy 2011 coding sprint — day two

25 August 2011

Today’s coding sprint was a bit more crowded, with some notable scipy hackers such as Ralph Gommers, Stefan van der Walt, David Cournapeau or Fernando Perez from Ipython joining in. On what got done:

– We merged Jake‘s new BallTree code. This is a pure Cython implementation of a nearest-neighbor search similar to the KDTree class in scipy.spatial, but much faster. The code looks awesome and it’s a big speedup compared to the older code.

– Vlad is ready to merge his dictionary learning code, something that should happen in the upcoming days.

– Initial support for Python 3. scikit-learn should now at least build and import cleanly under Python 3.

– some bugfixes in the Pipeline object and in docstrings.

So this was the end of the scikit-learn sprint, but EuroScipy has just begun. See you tomorrow at the conference (follow the signs)!

scikit-learn’s EuroScipy 2011 coding sprint — day one

23 August 2011

As a warm-up for the upcoming EuroScipy conference, some of the scikit-learn developers decided to gather and work together for a couple of days.

Today was the first day and there was only a handfull of us, as the real kickoff is expected tomorrow. Some interesting coding happened, although most of us where still preparing material for the EuroScipy tutorials …

– API changes: remove of keyword parameters to fit method, added method set_params (pull request).

– Some bugfixing in NuSVR (pull request)

– Review of Vlad‘s code, developed during his Summer of Code program.

– A lot of discussion about algorithm, code, APIs and buildbot dance !

Ridge regression path

12 July 2011

Ridge coefficients for multiple values of the regularization parameter can be elegantly computed by updating the thin SVD decomposition of the design matrix:

import numpy as np
from scipy import linalg

def ridge(A, b, alphas):
    """Return coefficients for regularized least squares
               min ||A x - b|| + alpha ||x||^2
    """

    U, s, V = linalg.svd(A, full_matrices=False)
    d = np.dot(U.T, b) / (s + alphas[:, np.newaxis] / s)
    return np.dot(d, V)

For a concrete problem it then can be used to efficiently compute it’s path, that is, to plot the coefficients as a function of the regularization parameter.

A variant of this algorithm can then be used to compute the optimal regularization parameter in the sense of leave-one-out cross-validation and is implemented in scikit-learn’s RidgeCV (for which Mathieu Blondel has an excelent post by ). This optimal parameter is denoted with a vertical dotted line in the following picture, full code can be found here.

LLE comes in different flavours

30 June 2011

I haven’t worked in the manifold module since last time, yet thanks to Jake VanderPlas there are some cool features I can talk about.

First of, the ARPACK backend is finally working and gives factor one speedup over the lobcpg + PyAMG approach. The key is to use ARPACK’s shift-invert mode instead of the regular mode, a subtle change that drove me crazy for weeks and that Jake spotted by comparing it to his C++ LLE implementation.

More importantly, some variants of Locally Linear Embedding (LLE) have been added to the module: Modified LLE, Hessian LLE and LTSA. These seem to generate better solutions than the classical LLE with timings that are not far apart. All the LLE variants currently implemented can be seen in this example, where they are applied to an S-shaped dataset.

Manifold learning in scikit-learn

7 June 2011

The manifold module in scikit-learn is slowly progressing: the locally linear embedding implementation was finally merged along with some documentation. At about the same time but in a different timezone, Jake VanderPlas began coding other manifold learning methods and back in Paris Olivier Grisel made my digits example a lot nicer by adding the embedding of different dimensionality reduction techniques from scikit-learn:




Handwritten digits and Locally Linear Embedding

4 May 2011

I decided to test my new Locally Linear Embedding (LLE) implementation against a real dataset. At first I didn’t think this would turn out very well, since LLE seems to be somewhat fragile, yielding largely different results for small differences in parameters such as number of neighbors or tolerance, but as it turns out, results are not bad at all.

The idea is to take a handwritten digit, stored as a 8×8 pixel image and flatten it into a an array of 8×8 = 64 floating-point values.

Then each handwritten digit can be seen as a point in a 64-dimensional space. Of course, visualizing in 64-dimensional spaces is not easy, and that’s where Locally Linear Embedding comes handy. We’ll use this method to reduce the dimension from 64 to 2 with the hope of preserving most of the underlying manifold structure. The following is a plot of the handwritten digits {0, 1, 2, 3, 4} after performing locally linear embedding. As you can see, some groups are nicely clustered, notably the 0 is isolated while other like {4, 5} are closer, precisely those that are more similar.

Source code for this example can be found here but relies on my manifold branch of scikit-learn.

Low-level routines for Support Vector Machines

27 April 2011

I’ve been working lately in improving the low-level API of the libsvm bindings in scikit-learn. The goal is to provide an API that encourages an efficient use of these libraries for expert users.

These are methods that have lower overhead than the object-oriented interface as they are closer to the C implementation, but do not have an interface as polished. Here, all parameters are expected to be of the correct type, and submitting one of the wrong type will make the function exit immediately with a ValueError. For instance, input data is expected to be of type float64, even for class labels!

Another peculiarity of these methods is that they only take and return numpy arrays. No custom objects, all method take and return arrays. That looks something like:

import numpy as np
from scikits.learn import svm, datasets

iris = datasets.load_iris()
iris.target = iris.target.astype(np.float64)

learned_params = svm.libsvm.fit(iris.data, iris.target)
pred = svm.libsvm.predict(iris.data, *learned_params)

Here, I used the fact that the parameters returned by libsvm.fit can just passed to libsvm.predict. However, any other given parameters should be manually passed to both method.

new get_blas_funcs in scipy.linalg

23 April 2011

Today got merged some changes I made to function scipy.linalg.get_blas_funcs(). The main enhacement is that get_blas_funcs() now also accepts a single string as input parameter and a dtype, so that fetching the BLAS function for a specific type becomes more natural.

For example, fetching the gemm routine for a single-precision complex number now looks like this:

gemm = scipy.linalg.get_blas_funcs('gemm', dtype=np.complex64)

compared to the clumsy old syntax:

X = np.empty(0, dtype=np.complex64)
gemm, = scipy.linalg.get_blas_funcs(('gemm',), (X,))

Locally linear embedding and sparse eigensolvers

21 April 2011

I’ve been working for some time on implementing a locally linear embedding algorithm for the upcoming manifold module in scikit-learn.

While several implementations of this algorithm exist in Python, as far as I know none of them is able to use a sparse eigensolver in the last step of the algorithm, falling back to dense routines causing a huge overhead in this step.

To overcome this, my first implementation used scipy.sparse.linalg.eigsh, which is a sparse eigensolver shipped by scipy and based on ARPACK. However, this approach converged extremely slowly, with timings that exceeded largely those of dense solvers.

Recently I found a way that seems to work reasonably well, with timings that win by a factor of 5 on the swiss roll existing routines. This code is able to solve the problem making use of a preconditioner computed by PyAMG.

import numpy as np
from scipy.sparse import linalg, eye
from pyamg import smoothed_aggregation_solver
from scikits.learn import neighbors

def locally_linear_embedding(X, n_neighbors, out_dim, tol=1e-6, max_iter=200):
    W = neighbors.kneighbors_graph(
        X, n_neighbors=n_neighbors, mode='barycenter')

    # M = (I-W)' (I-W)
    A = eye(*W.shape, format=W.format) - W
    A = (A.T).dot(A).tocsr()

    # initial approximation to the eigenvectors
    X = np.random.rand(W.shape[0], out_dim)
    ml = smoothed_aggregation_solver(A, symmetry='symmetric')
    prec = ml.aspreconditioner()

    # compute eigenvalues and eigenvectors with LOBPCG
    eigen_values, eigen_vectors = linalg.lobpcg(
        A, X, M=prec, largest=False, tol=tol, maxiter=max_iter)

    index = np.argsort(eigen_values)
    return eigen_vectors[:, index], np.sum(eigen_values)

Full code for this algorithm applied to the swiss roll can be found here here, and I hope it will soon be part of scikit-learn.

scikits.learn is now part of pythonxy

20 April 2011

The guys behind pythonxy have been kind enough to add the latest scikit-learn as an additional plugin for their distribution. Having scikit-learn being in both pythonxy and EPD will hopefully make it easier to use for Windows users.

pythonxy-logo

For now I will continue to make windows precompiled binaries, but pythonxy users finally have a package that is guaranteed to work with their installation.

Least squares with equality constrain

14 April 2011

The following algorithm computes the Least squares solution || Ax – b|| subject to the equality constrain Bx = d. It’s a classic algorithm that can be implemented only using a QR decomposition and a least squares solver.

This implementation uses numpy and scipy. It makes use of the new linalg.solve_triangular function in scipy 0.9, although degrades to linalg.solve on older versions.

import numpy as np

def lse(A, b, B, d, cond=None):
    """
    Equality-contrained least squares.

    The following algorithm minimizes ||Ax - b|| subject to the
    constrain Bx = d.

    Parameters
    ----------
    A : array-like, shape=[m, n]

    b : array-like, shape=[m]

    B : array-like, shape=[p, n]

    d : array-like, shape=[p]

    cond : float, optional
        Cutoff for 'small' singular values; used to determine effective
        rank of A. Singular values smaller than
        ``rcond * largest_singular_value`` are considered zero.

    Reference
    ---------
    Matrix Computations, Golub & van Loan, algorithm 12.1.2

    Examples
    --------
    >>> A, b = [[0, 2, 3], [1, 3, 4.5]], [1, 1]
    >>> B, d = [[1, 1, 0]], [1]
    >>> lse(A, b, B, d)
    array([-0.5       ,  1.5       , -0.66666667])    
    """

    from scipy import linalg
    if not hasattr(linalg, 'solve_triangular'):
        # compatibility for old scipy
        def solve_triangular(X, y, **kwargs):
            return linalg.solve(X, y)
    else:
        solve_triangular = linalg.solve_triangular
    A, b, B, d = map(np.asanyarray, (A, b, B, d))
    p = B.shape[0]
    Q, R = linalg.qr(B.T)
    y = solve_triangular(R[:p, :p], d, trans='T', lower=False)
    A = np.dot(A, Q)
    z = linalg.lstsq(A[:, p:], b - np.dot(A[:, :p], y),
                         cond=cond)[0].ravel()
    return np.dot(Q[:, :p], y) + np.dot(Q[:, p:], z)

Update: now scipy has a function qr_multiply which would considerably speed up this code

A profiler for Python extensions

6 April 2011

Profiling Python extensions has not been a pleasant experience for me, so I made my own package to do the job. Existing alternatives were either hard to use, forcing you to recompile with custom flags like gprofile or desperately slow like valgrind/callgrind. The package I’ll talk about is called YEP and is designed to be:

  1. Unobtrusive: no recompiling, no custom linking. Just lauch & profile.
  2. Fast: waiting sucks.
  3. Easy to use.

Basic usage

YEP is distributed as a python module and can be downloaded from the pypi. After installation, it is executed by giving the -m yep flags to the interpreter. Without any arguments, it will just print a help message:

    $ python -m yep
Usage: python -m yep [options] scriptfile [arg] ...
 ...

Say you want to profile a script called my_script.py, then the way to quickly get a profiler report is to execute:

    $ python -m yep -v my_script.py

For example, running YEP on this example that makes use of libsvm, a C++ library for Support Vector Machines, outputs

From Screenshots

The last column prints the name of the functions, so just looking at those that start with svm:: gives you an overview of how our libsvm is spending its time.

Other usages

Calling YEP without the -v will create a my_script.py.prof file that can be analyzed with pprof (google-pprof on some systems). pprof has a huge range of options, letting you to filter on some funtions, output to ghostview or print a line-by-line profiling, to mention a few. For example, you can generate a call graph with the command:

 $ pprof --gv /usr/bin/python my_script.py.prof

More control

If you would like to manually start/stop the profiler rather than profile the whole script, you can use the functions yep.start() and yep.stop() inside a python script. This will write the profile to a given filename, so make sure the directory is writable:

import yep
yep.start('out.prof') # will create an out.prof file
# do something ...
yep.stop()

Future work

The -v option showed at the beginning is just a dirty hack that launches pprof and pipes the output into less. A more robust approach would be to read the resulting profile from python and manipulate it from there, either to std or to pstats format. This shouldn’t be too difficult as the pprof format is described here

Acknowledgment

The original idea to use google-perftools to profile Python extensions was given on this Stack overflow question

scikit-learn coding sprint in Paris

2 April 2011

Yesterday was the scikit-learn coding sprint in Paris. It was great to meet with old developers (Vincent Michel) and new ones: some of whom I was already familiar with from the mailing list while others came just to say hi and get familiar with the code. It was really great to have people from such different backgrounds discuss on concrete problems and getting things done.

A lot of work was done, most of it unmerged yet, but if I had to highlight the three most important for me, that would be the the merge of the hcluster2 branch, the awesome work of thouis in replacing the C++ interface to the ball_tree with a Cython one and suppport for Python3 (not bug-free but imports OK).

As for me, I’ve been working mostly in providing efficient cross-validatation for Support Vector Machines. The status of this is: low-level API seems to work fine (scikits.learn.svm.libsvm.cross_validation) but high-level API still needs some work.

This is the picture featuring (most) of the people that were at the sprint around 16h in Logilab’s headquarters.

IMG_0012

py3k in scikit-learn

28 March 2011

One thing I’d really like to see done in this Friday’s scikit-learn sprint is to have full support for Python 3.

There’s a branch were the hard word has been done (porting C extensions, automatic 2to3 conversion, etc.), although joblib still has some bugs and no one has attempted to do anything serious with this branch yet …

Computing the vector norm

15 February 2011

Update: a fast and stable norm was added to scipy.linalg in August 2011 and will be available in scipy 0.10

Last week I discussed with Gael how we should compute the euclidean norm of a vector a using SciPy. Two approaches suggest themselves, either calling scipy.linalg.norm(a) or computing sqrt(a.T a), but as I learned later, both suck.

Note: I use single-precision arithmetic for simplicity, but similar results hold for double-precision.

Overflow and underflow

Both approaches behave terribly in presence of big or small numbers. Take for example an array with a single entry:

In [0]: a = np.array([1e20], dtype=np.float32)
In [1]: a
Out[1]: array([  1.00000002e+20], dtype=float32)
In [2]: scipy.linalg.norm(a)
Out[2]: inf
In [3]: np.sqrt(np.dot(a.T, a))
Out[3]: inf

That is, both methods return Infinity. However, the correct answer is 10^20, which would comfortably fit in a single-precision instruction. Similar examples can be found where numbers underflow.

Stability

Again, scipy.linalg.norm has a terrible behavior in what concerns numerical stability. In presence of different magnitudes severe cancellation can occur. Take for example and array with one 10.000 in the first value and 10.000 ones behind:

a = np.array([1e4] + [1]*10000, dtype=np.float32)

In this case, scipy.linalg.norm will discard all the ones, producing

In [3]: linalg.norm(a) - 1e4
Out[3]: 0.0

when the correct answer is 0.5. Here \sqrt{a^T a} has a much nicer behavior since results of a dot-product in single precision are accumulated using double-precision (but if double-precision is used, results won’t be accumulated using quadruple-precision):

In [4]: np.sqrt(np.dot(a.T, a)) - 1e4
Out[4]: 0.5

BLAS BLAS BLAS …

The BLAS function nrm2 does automatic scaling of parameters rendering it more stable and tolerant to overflow. Luckily, scipy provides a mechanism to call some BLAS functions:

In [5]: nrm2, = scipy.linalg.get_blas_funcs(('nrm2',), (a,))

Using this function, no overflow occurs (hurray!)

In [95]: a = np.array([1e20], dtype=np.float32)
In [96]: nrm2(a)
Out[96]: 1.0000000200408773e+20

and stability is greatly improved

In [99]: nrm2(a) - 1e4
Out[99]: 0.49998750062513864

Timing

Computing the 2-norm of an array is a very cheap operation, thus computations are usually dominated by external factors, such as latency of memory access or overhead in the Python/C layer. Experimental benchmarks on an array of size 10^7 show that nrm2 is marginally slower than \sqrt{a^T a}, because scaling has a cost, but is is also more stable and less prone to overflow and underflow. It also shows that scipy.linalg.norm is the slowest (and numerically worst!) of all.

\sqrt{a^T a} BLAS nrm2(a) scipy.linalg.norm(a)
0.02 0.02 0.16