Hanno Klemm | 22 Feb 16:15
Picon

Newbie Questions about speeding up some Numpy calculations


Hi All,

I am pretty new to Cython, so please bare with me. I am at the moment trying to speed up some code that I use a lot
in calculations, so  thought this might be a good idea to learn how to use Cython when I have already some
working Numpy code. 

After searching the tutorials I came up with the attached solution, however compared with the original
Numpy code this is getting slightly slower rather than faster, so obviously I am doing something very
wrong. Could somebody please give me some pointers where I can find examples that are closer to what I am
trying to speed up than what I can find in the tutorials?

Attached are the original numpy code, as well as the Cython code I have tried to develop. If I run this on my
machine, I get:

Python 2.7.2 |EPD 7.2-2 (32-bit)| (default, Sep  7 2011, 09:16:50) 
Type "copyright", "credits" or "license" for more information.

IPython 0.12 -- An enhanced Interactive Python.
?         -> Introduction and overview of IPython's features.
%quickref -> Quick reference.
help      -> Python's own help system.
object?   -> Details about 'object', use 'object??' for extra details.

In [1]: import be

In [2]: timeit be.referencecase(samples=250000)
1 loops, best of 3: 2.12 s per loop

In [3]: import be1
(Continue reading)

Jake Vanderplas | 22 Feb 19:12
Picon

Re: Newbie Questions about speeding up some Numpy calculations

Hi,
I spent a few minutes looking over your code, and I'm having trouble seeing what you're trying to compute!
A few observations:
 - because your loop only has eight repetitions, and you're using mostly array-based operations, I don't think cythonization will buy you a whole lot of speed for this code.  You could probably gain just as much by being more careful with how you do these calculations in python + numpy only.
 - in the loop in getResult, the array self.cube is not defined anywhere as a numpy array, so slower python indexing is being used
 - in general, using slices (i.e. field[:, 0]) does not take advantage of fast cython indexing.  It would probably be faster to do this in a loop
 - as far as I know, using np.ndarray[np.float_t, ndim=1] etc. only results in a speed improvement if this is used to decorate an object passed as part of a function call: that is, your getResult function should probably define these arrays, and then call a cdef'ed function which does the heavy lifting.

I hope that helps.
  Jake

On Wed, Feb 22, 2012 at 7:15 AM, Hanno Klemm <klemm <at> phys.ethz.ch> wrote:

Hi All,

I am pretty new to Cython, so please bare with me. I am at the moment trying to speed up some code that I use a lot in calculations, so  thought this might be a good idea to learn how to use Cython when I have already some working Numpy code.

After searching the tutorials I came up with the attached solution, however compared with the original Numpy code this is getting slightly slower rather than faster, so obviously I am doing something very wrong. Could somebody please give me some pointers where I can find examples that are closer to what I am trying to speed up than what I can find in the tutorials?

Attached are the original numpy code, as well as the Cython code I have tried to develop. If I run this on my machine, I get:

Python 2.7.2 |EPD 7.2-2 (32-bit)| (default, Sep  7 2011, 09:16:50)
Type "copyright", "credits" or "license" for more information.

IPython 0.12 -- An enhanced Interactive Python.
?         -> Introduction and overview of IPython's features.
%quickref -> Quick reference.
help      -> Python's own help system.
object?   -> Details about 'object', use 'object??' for extra details.

In [1]: import be

In [2]: timeit be.referencecase(samples=250000)
1 loops, best of 3: 2.12 s per loop

In [3]: import be1

In [4]: timeit be1.referencecase(samples=250000)
1 loops, best of 3: 1.82 s per loop





Many thanks,
Hanno






Aronne Merrelli | 23 Feb 07:25
Picon

Re: Newbie Questions about speeding up some Numpy calculations



On Wed, Feb 22, 2012 at 12:12 PM, Jake Vanderplas <jakevdp <at> gmail.com> wrote:
Hi,
I spent a few minutes looking over your code, and I'm having trouble seeing what you're trying to compute!
A few observations:
 - because your loop only has eight repetitions, and you're using mostly array-based operations, I don't think cythonization will buy you a whole lot of speed for this code.  You could probably gain just as much by being more careful with how you do these calculations in python + numpy only.
 - in the loop in getResult, the array self.cube is not defined anywhere as a numpy array, so slower python indexing is being used
 - in general, using slices (i.e. field[:, 0]) does not take advantage of fast cython indexing.  It would probably be faster to do this in a loop
 - as far as I know, using np.ndarray[np.float_t, ndim=1] etc. only results in a speed improvement if this is used to decorate an object passed as part of a function call: that is, your getResult function should probably define these arrays, and then call a cdef'ed function which does the heavy lifting.

This is all true - I'll add one extra observation: I just profiled the python version and got this result for referencecase():

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
       48    2.010    0.042    2.010    0.042 be1.py:3(fFunction)
        1    0.850    0.850    2.988    2.988 be1.py:75(getResult)
        1    0.064    0.064    0.127    0.127 be1.py:52(validatePoints)

So I guess if you really need to speed this up I might suggest looking at the interior loop of getResult(), where fFunction is called.

I rewrote the fFunction as follows:
from libc.math cimport log, atan
<at> cython.boundscheck(False)
<at> cython.wraparound(False)
<at> cython.cdivision(True)
def fFunctionc(np.ndarray[np.float_t, ndim=1] x not None,
...

        for n in range(x.shape[0]):
            res[n] = z[n]*atan(x[n]*y[n]/(z[n]*R[n])) - x[n]*log(R[n]+y[n]) - y[n]*log(R[n]+x[n])
...

(everything in the ... is the same as the previous)




Aronne
Aronne Merrelli | 23 Feb 07:32
Picon

Re: Newbie Questions about speeding up some Numpy calculations



On Thu, Feb 23, 2012 at 12:25 AM, Aronne Merrelli <aronne.merrelli <at> gmail.com> wrote:


I rewrote the fFunction as follows:
from libc.math cimport log, atan
<at> cython.boundscheck(False)
<at> cython.wraparound(False)
<at> cython.cdivision(True)
def fFunctionc(np.ndarray[np.float_t, ndim=1] x not None,
...

        for n in range(x.shape[0]):
            res[n] = z[n]*atan(x[n]*y[n]/(z[n]*R[n])) - x[n]*log(R[n]+y[n]) - y[n]*log(R[n]+x[n])
...

(everything in the ... is the same as the previous)



(sorry for the spam - misclicked the stupid laptop touchpad and sent the previous before I finished typing)


The function modified as above gives this timing change (the name of the modified version is getResultc, that calls fFunctionc)

In [30]: %timeit result = tb.getResult(samples)
1 loops, best of 3: 2.93 s per loop

In [31]: %timeit result = tb.getResultc(samples)
1 loops, best of 3: 2.39 s per loop


That's not really that great - my guess is to push it further you would need to combine that block of calls to fFunction into a single large function that was better at reusing intermediate calculations. It appears that many pieces are calculated multiple times (for example, log(rP + t)) - so it could be more efficient to avoid recomputing things. It would make the code much less readable, though, so it might not be worth the effort. I'd guess that would only be worth a factor of 2-3 speed increase at most.


Aronne
Hanno Klemm | 23 Feb 10:27
Picon

Re: Newbie Questions about speeding up some Numpy calculations


Hi,

many thanks for your comments and suggestions. I will implement them and see where I get. Thanks in particular about the clarification where the decorations for arrays make sense and where not. 

Best regards,
Hanno




On Feb 23, 2012, at 7:32, Aronne Merrelli wrote:



On Thu, Feb 23, 2012 at 12:25 AM, Aronne Merrelli <aronne.merrelli <at> gmail.com> wrote:


I rewrote the fFunction as follows:
from libc.math cimport log, atan
<at> cython.boundscheck(False)
<at> cython.wraparound(False)
<at> cython.cdivision(True)
def fFunctionc(np.ndarray[np.float_t, ndim=1] x not None,
...

        for n in range(x.shape[0]):
            res[n] = z[n]*atan(x[n]*y[n]/(z[n]*R[n])) - x[n]*log(R[n]+y[n]) - y[n]*log(R[n]+x[n])
...

(everything in the ... is the same as the previous)



(sorry for the spam - misclicked the stupid laptop touchpad and sent the previous before I finished typing)


The function modified as above gives this timing change (the name of the modified version is getResultc, that calls fFunctionc)

In [30]: %timeit result = tb.getResult(samples)
1 loops, best of 3: 2.93 s per loop

In [31]: %timeit result = tb.getResultc(samples)
1 loops, best of 3: 2.39 s per loop


That's not really that great - my guess is to push it further you would need to combine that block of calls to fFunction into a single large function that was better at reusing intermediate calculations. It appears that many pieces are calculated multiple times (for example, log(rP + t)) - so it could be more efficient to avoid recomputing things. It would make the code much less readable, though, so it might not be worth the effort. I'd guess that would only be worth a factor of 2-3 speed increase at most.


Aronne

Aronne Merrelli | 24 Feb 15:41
Picon

Re: Newbie Questions about speeding up some Numpy calculations



On Thu, Feb 23, 2012 at 3:27 AM, Hanno Klemm <klemm <at> phys.ethz.ch> wrote:

Hi,

many thanks for your comments and suggestions. I will implement them and see where I get. Thanks in particular about the clarification where the decorations for arrays make sense and where not. 

Best regards,
Hanno



Hi,

One other suggestion that occurred to me this morning: since your bottleneck is mainly in calculations on ndarrays, the numexpr module would be useful. It also requires a trivial amount of effort to use. I replaced fFunction like so:

def fFunction(s, t, z, R):
    return z * np.arctan((s * t) / (z * R)) - s * np.log(R + t) - t * np.log(R + s)
def fFunctionE(s, t, z, R):
    return numexpr.evaluate( 'z*arctan(s*t/(z*R)) - s*log(R+t) - t*log(R+s)' )

I like this variant since the equation is just as readable as the original. I put a "method" flag on getResult(), so that method=0 uses the old fFunction, and method=1 uses numexpr. This gets slightly more than a factor of 2 speedup:

In [72]: %timeit result0 = tb.getResult(samples, method = 0)
1 loops, best of 3: 2.73 s per loop

In [73]: %timeit result1 = tb.getResult(samples, method = 1)
1 loops, best of 3: 1.31 s per loop

The two results are equal within floating point precision, but not exactly equal, since numexpr's calculation will have different internal rounding:

In [74]: np.allclose(result0, result1)
Out[74]: True

In [75]: np.max(np.abs(result0-result1))
Out[75]: 4.8250192541215921e-20


numexpr is also nice for these kinds of calculations since the original calculation over numpy.ndarray objects will create extra temporary arrays. numexpr does not (which is a primary reason why it is faster) so the peak memory usage should drop substantially.


HTH,
Aronne
Hanno Klemm | 24 Feb 15:43
Picon

Re: Newbie Questions about speeding up some Numpy calculations

Wow! Thanks a lot! I will give this a go.

Best regards,
Hanno

Hanno Klemm



On Feb 24, 2012, at 15:41, Aronne Merrelli wrote:



On Thu, Feb 23, 2012 at 3:27 AM, Hanno Klemm <klemm <at> phys.ethz.ch> wrote:

Hi,

many thanks for your comments and suggestions. I will implement them and see where I get. Thanks in particular about the clarification where the decorations for arrays make sense and where not. 

Best regards,
Hanno



Hi,

One other suggestion that occurred to me this morning: since your bottleneck is mainly in calculations on ndarrays, the numexpr module would be useful. It also requires a trivial amount of effort to use. I replaced fFunction like so:

def fFunction(s, t, z, R):
    return z * np.arctan((s * t) / (z * R)) - s * np.log(R + t) - t * np.log(R + s)
def fFunctionE(s, t, z, R):
    return numexpr.evaluate( 'z*arctan(s*t/(z*R)) - s*log(R+t) - t*log(R+s)' )

I like this variant since the equation is just as readable as the original. I put a "method" flag on getResult(), so that method=0 uses the old fFunction, and method=1 uses numexpr. This gets slightly more than a factor of 2 speedup:

In [72]: %timeit result0 = tb.getResult(samples, method = 0)
1 loops, best of 3: 2.73 s per loop

In [73]: %timeit result1 = tb.getResult(samples, method = 1)
1 loops, best of 3: 1.31 s per loop

The two results are equal within floating point precision, but not exactly equal, since numexpr's calculation will have different internal rounding:

In [74]: np.allclose(result0, result1)
Out[74]: True

In [75]: np.max(np.abs(result0-result1))
Out[75]: 4.8250192541215921e-20


numexpr is also nice for these kinds of calculations since the original calculation over numpy.ndarray objects will create extra temporary arrays. numexpr does not (which is a primary reason why it is faster) so the peak memory usage should drop substantially.


HTH,
Aronne

Francesc Alted | 24 Feb 17:43
Picon

Re: Newbie Questions about speeding up some Numpy calculations

On Feb 24, 2012, at 8:41 AM, Aronne Merrelli wrote:
> Hi,
> 
> One other suggestion that occurred to me this morning: since your bottleneck is mainly in calculations on
ndarrays, the numexpr module would be useful. It also requires a trivial amount of effort to use. I
replaced fFunction like so:
> 
> def fFunction(s, t, z, R):
>     return z * np.arctan((s * t) / (z * R)) - s * np.log(R + t) - t * np.log(R + s)
> def fFunctionE(s, t, z, R):
>     return numexpr.evaluate( 'z*arctan(s*t/(z*R)) - s*log(R+t) - t*log(R+s)' )
> 
> I like this variant since the equation is just as readable as the original. I put a "method" flag on
getResult(), so that method=0 uses the old fFunction, and method=1 uses numexpr. This gets slightly more
than a factor of 2 speedup:
> 
> In [72]: %timeit result0 = tb.getResult(samples, method = 0)
> 1 loops, best of 3: 2.73 s per loop
> 
> In [73]: %timeit result1 = tb.getResult(samples, method = 1)
> 1 loops, best of 3: 1.31 s per loop
> 
> The two results are equal within floating point precision, but not exactly equal, since numexpr's
calculation will have different internal rounding:
> 
> In [74]: np.allclose(result0, result1)
> Out[74]: True
> 
> In [75]: np.max(np.abs(result0-result1))
> Out[75]: 4.8250192541215921e-20
> 
> 
> numexpr is also nice for these kinds of calculations since the original calculation over numpy.ndarray
objects will create extra temporary arrays. numexpr does not (which is a primary reason why it is faster)
so the peak memory usage should drop substantially.

This is really OT, but in this particular case where the computation is CPU bounded, numexpr is faster just
because it can use several processors simultaneously.  But you are right that numexpr also needs much less
memory for the evaluation than NumPy, although this alone does not explain the speedup in this particular case.

Also, it could be interesting to link numexpr against Intel MKL for achieving still better performance
during the evaluation of transcendental functions (your case).

-- Francesc Alted


Gmane