Pierre Haessig | 1 Feb 2012 19:31
Favicon
Gravatar

autocorrelation computation performance : use of np.correlate

Hi,

[I'm not sure whether this discussion belongs to numpy-discussion or 
scipy-dev]

In day to day time series analysis I regularly need to look at the data 
autocorrelation ("acorr" or "acf" depending on the software package).
The straighforward available function I have is matplotlib.pyplot.acorr. 
However, for a moderately long time series (say of length 10**5) it 
taking a huge time just to just dislays the autocorrelation values 
within a small range of time lags.
The main reason being it is relying on np.correlate(x,x, mode=2) while 
only a few lags are needed.
(I guess mode=2 is an (old fashioned?) way to set mode='full')

I know that np.correlate performance issue has been discussed already, 
and there is a *somehow* related ticket 
(http://projects.scipy.org/numpy/ticket/1260). I noticed in the ticket's 
change number 2 the following comment by Josef : "Maybe a truncated 
convolution/correlation would be good". I'll come back to this soon.

I made an example script "acf_timing.py" to start my point with some 
timing data :

In Ipython:
 >>> run acf_timing.py # it imports statsmodel's acf + define 2 other 
acf implementations + an example data 10**5 samples long

%time l,c = mpl_acf(a, 10)
CPU times: user 8.69 s, sys: 0.00 s, total: 8.69 s
(Continue reading)

Benjamin Root | 2 Feb 2012 00:48
Picon
Favicon

autocorrelation computation performance : use of np.correlate



On Wednesday, February 1, 2012, Pierre Haessig <pierre.haessig <at> crans.org> wrote:
> Hi,
>
> [I'm not sure whether this discussion belongs to numpy-discussion or scipy-dev]
>
> In day to day time series analysis I regularly need to look at the data autocorrelation ("acorr" or "acf" depending on the software package).
> The straighforward available function I have is matplotlib.pyplot.acorr. However, for a moderately long time series (say of length 10**5) it taking a huge time just to just dislays the autocorrelation values within a small range of time lags.
> The main reason being it is relying on np.correlate(x,x, mode=2) while only a few lags are needed.
> (I guess mode=2 is an (old fashioned?) way to set mode='full')
>
> I know that np.correlate performance issue has been discussed already, and there is a *somehow* related ticket (http://projects.scipy.org/numpy/ticket/1260). I noticed in the ticket's change number 2 the following comment by Josef : "Maybe a truncated convolution/correlation would be good". I'll come back to this soon.
>
> I made an example script "acf_timing.py" to start my point with some timing data :
>
> In Ipython:
>>>> run acf_timing.py # it imports statsmodel's acf + define 2 other acf implementations + an example data 10**5 samples long
>
> %time l,c = mpl_acf(a, 10)
> CPU times: user 8.69 s, sys: 0.00 s, total: 8.69 s
> Wall time: 11.18 s # pretty long...
>
>  %time c = sm_acf(a, 10)
> CPU times: user 8.76 s, sys: 0.01 s, total: 8.78 s
> Wall time: 10.79 s # long as well. statsmodel has a similar underlying implementation
> # http://statsmodels.sourceforge.net/generated/scikits.statsmodels.tsa.stattools.acf.html#scikits.statsmodels.tsa.stattools.acf
>
> #Now, better option : use the fft convolution
> %time c=sm_acf(a, 10,fft=True)
> CPU times: user 0.03 s, sys: 0.01 s, total: 0.04 s
> Wall time: 0.07 s
> # Fast, but I'm not sure about the memory implication of using fft though.
>
> #The naive option : just compute the acf lags that are needed
> %time l,c = naive_acf(a, 10)
> CPU times: user 0.01 s, sys: 0.00 s, total: 0.01 s
> Wall time: 0.01 s
> # Iterative computation. Pretty silly but very fast
> # (Now of course, this naive implementation won't scale nicely for a lot of lags)
>
> Now comes (at last) the question : what should be done about this performance issue ?
>  - should there be a truncated np.convolve/np.correlate function, as Josef suggested ?
>  - or should people in need of autocorrelation find some workarounds because this usecase is not big enough to call for a change in np.convolve ?
>
> I really feel this question is about *where* a change should be implemented  (numpy, scipy.signal, maplotlib ?) so that it makes sense while not breaking 10^10 lines of numpy related code...
>
> Best,
> Pierre
>
>

Speaking for matplotlib, the acorr() (and xcorr()) functions in mpl are merely a convenience.  The proper place for any change would not be mpl (although, we would certainly take advantage of any improved acorr() and xcorr() that are made available in numpy.

Ben Root
_______________________________________________
NumPy-Discussion mailing list
NumPy-Discussion <at> scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion
josef.pktd | 2 Feb 2012 01:58
Picon

Re: autocorrelation computation performance : use of np.correlate

On Wed, Feb 1, 2012 at 6:48 PM, Benjamin Root <ben.root <at> ou.edu> wrote:
>
>
> On Wednesday, February 1, 2012, Pierre Haessig <pierre.haessig <at> crans.org>
> wrote:
>> Hi,
>>
>> [I'm not sure whether this discussion belongs to numpy-discussion or
>> scipy-dev]
>>
>> In day to day time series analysis I regularly need to look at the data
>> autocorrelation ("acorr" or "acf" depending on the software package).
>> The straighforward available function I have is matplotlib.pyplot.acorr.
>> However, for a moderately long time series (say of length 10**5) it taking a
>> huge time just to just dislays the autocorrelation values within a small
>> range of time lags.
>> The main reason being it is relying on np.correlate(x,x, mode=2) while
>> only a few lags are needed.
>> (I guess mode=2 is an (old fashioned?) way to set mode='full')
>>
>> I know that np.correlate performance issue has been discussed already, and
>> there is a *somehow* related ticket
>> (http://projects.scipy.org/numpy/ticket/1260). I noticed in the ticket's
>> change number 2 the following comment by Josef : "Maybe a truncated
>> convolution/correlation would be good". I'll come back to this soon.
>>
>> I made an example script "acf_timing.py" to start my point with some
>> timing data :
>>
>> In Ipython:
>>>>> run acf_timing.py # it imports statsmodel's acf + define 2 other acf
>>>>> implementations + an example data 10**5 samples long
>>
>> %time l,c = mpl_acf(a, 10)
>> CPU times: user 8.69 s, sys: 0.00 s, total: 8.69 s
>> Wall time: 11.18 s # pretty long...
>>
>>  %time c = sm_acf(a, 10)
>> CPU times: user 8.76 s, sys: 0.01 s, total: 8.78 s
>> Wall time: 10.79 s # long as well. statsmodel has a similar underlying
>> implementation
>> #
>> http://statsmodels.sourceforge.net/generated/scikits.statsmodels.tsa.stattools.acf.html#scikits.statsmodels.tsa.stattools.acf
>>
>> #Now, better option : use the fft convolution
>> %time c=sm_acf(a, 10,fft=True)
>> CPU times: user 0.03 s, sys: 0.01 s, total: 0.04 s
>> Wall time: 0.07 s
>> # Fast, but I'm not sure about the memory implication of using fft though.
>>
>> #The naive option : just compute the acf lags that are needed
>> %time l,c = naive_acf(a, 10)
>> CPU times: user 0.01 s, sys: 0.00 s, total: 0.01 s
>> Wall time: 0.01 s
>> # Iterative computation. Pretty silly but very fast
>> # (Now of course, this naive implementation won't scale nicely for a lot
>> of lags)

I don't think it's silly to have a short python loop, statsmodels
actually uses the loop in the models, for example in yule_walker (and
GLSAR), because in most statistical application I wouldn't expect a
large number of lags. The time series models don't use the acov
directly, but I think most of the time we just loop over the lags.

>>
>> Now comes (at last) the question : what should be done about this
>> performance issue ?
>>  - should there be a truncated np.convolve/np.correlate function, as Josef
>> suggested ?
>>  - or should people in need of autocorrelation find some workarounds
>> because this usecase is not big enough to call for a change in np.convolve ?
>>
>> I really feel this question is about *where* a change should be
>> implemented  (numpy, scipy.signal, maplotlib ?) so that it makes sense while
>> not breaking 10^10 lines of numpy related code...
>>
>> Best,
>> Pierre
>>
>>
>
> Speaking for matplotlib, the acorr() (and xcorr()) functions in mpl are
> merely a convenience.  The proper place for any change would not be mpl
> (although, we would certainly take advantage of any improved acorr() and
> xcorr() that are made available in numpy.

I also think that numpy or scipy would be the natural candidates for a
correlate that works fast for an intermediate number of desired lags
(but still short compared to length of data).

Josef

>
> Ben Root
> _______________________________________________
> NumPy-Discussion mailing list
> NumPy-Discussion <at> scipy.org
> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>
Ralf Gommers | 4 Feb 2012 23:19
Gravatar

Re: autocorrelation computation performance : use of np.correlate



On Thu, Feb 2, 2012 at 1:58 AM, <josef.pktd <at> gmail.com> wrote:
On Wed, Feb 1, 2012 at 6:48 PM, Benjamin Root <ben.root <at> ou.edu> wrote:
>
>
> On Wednesday, February 1, 2012, Pierre Haessig <pierre.haessig <at> crans.org>
> wrote:
>> Hi,
>>
>> [I'm not sure whether this discussion belongs to numpy-discussion or
>> scipy-dev]
>>
>> In day to day time series analysis I regularly need to look at the data
>> autocorrelation ("acorr" or "acf" depending on the software package).
>> The straighforward available function I have is matplotlib.pyplot.acorr.
>> However, for a moderately long time series (say of length 10**5) it taking a
>> huge time just to just dislays the autocorrelation values within a small
>> range of time lags.
>> The main reason being it is relying on np.correlate(x,x, mode=2) while
>> only a few lags are needed.
>> (I guess mode=2 is an (old fashioned?) way to set mode='full')
>>
>> I know that np.correlate performance issue has been discussed already, and
>> there is a *somehow* related ticket
>> (http://projects.scipy.org/numpy/ticket/1260). I noticed in the ticket's
>> change number 2 the following comment by Josef : "Maybe a truncated
>> convolution/correlation would be good". I'll come back to this soon.
>>
>> I made an example script "acf_timing.py" to start my point with some
>> timing data :
>>
>> In Ipython:
>>>>> run acf_timing.py # it imports statsmodel's acf + define 2 other acf
>>>>> implementations + an example data 10**5 samples long
>>
>> %time l,c = mpl_acf(a, 10)
>> CPU times: user 8.69 s, sys: 0.00 s, total: 8.69 s
>> Wall time: 11.18 s # pretty long...
>>
>>  %time c = sm_acf(a, 10)
>> CPU times: user 8.76 s, sys: 0.01 s, total: 8.78 s
>> Wall time: 10.79 s # long as well. statsmodel has a similar underlying
>> implementation
>> #
>> http://statsmodels.sourceforge.net/generated/scikits.statsmodels.tsa.stattools.acf.html#scikits.statsmodels.tsa.stattools.acf
>>
>> #Now, better option : use the fft convolution
>> %time c=sm_acf(a, 10,fft=True)
>> CPU times: user 0.03 s, sys: 0.01 s, total: 0.04 s
>> Wall time: 0.07 s
>> # Fast, but I'm not sure about the memory implication of using fft though.
>>
>> #The naive option : just compute the acf lags that are needed
>> %time l,c = naive_acf(a, 10)
>> CPU times: user 0.01 s, sys: 0.00 s, total: 0.01 s
>> Wall time: 0.01 s
>> # Iterative computation. Pretty silly but very fast
>> # (Now of course, this naive implementation won't scale nicely for a lot
>> of lags)

I don't think it's silly to have a short python loop, statsmodels
actually uses the loop in the models, for example in yule_walker (and
GLSAR), because in most statistical application I wouldn't expect a
large number of lags. The time series models don't use the acov
directly, but I think most of the time we just loop over the lags.

>>
>> Now comes (at last) the question : what should be done about this
>> performance issue ?
>>  - should there be a truncated np.convolve/np.correlate function, as Josef
>> suggested ?
>>  - or should people in need of autocorrelation find some workarounds
>> because this usecase is not big enough to call for a change in np.convolve ?
>>
>> I really feel this question is about *where* a change should be
>> implemented  (numpy, scipy.signal, maplotlib ?) so that it makes sense while
>> not breaking 10^10 lines of numpy related code...
>>
>> Best,
>> Pierre
>>
>>
>
> Speaking for matplotlib, the acorr() (and xcorr()) functions in mpl are
> merely a convenience.  The proper place for any change would not be mpl
> (although, we would certainly take advantage of any improved acorr() and
> xcorr() that are made available in numpy.

I also think that numpy or scipy would be the natural candidates for a
correlate that works fast for an intermediate number of desired lags
(but still short compared to length of data).

scipy.signal is the right place I think. numpy shouldn't grow too many functions like this.

Ralf

_______________________________________________
NumPy-Discussion mailing list
NumPy-Discussion <at> scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion
Pierre Haessig | 15 Feb 2012 11:17
Favicon
Gravatar

Re: autocorrelation computation performance : use of np.correlate

Le 04/02/2012 23:19, Ralf Gommers a écrit :

scipy.signal is the right place I think. numpy shouldn't grow too many functions like this.
[going back in time on the autocorrelation topic]

I see scipy.signal being the good place. However, I have the (possibly wrong) feeling that Matplotlib is not so much depending on scipy.
Would Matplotlib's acorr/xcorr functions benefit from a faster function available in scipy as opposed to numpy ?

Pierre
_______________________________________________
NumPy-Discussion mailing list
NumPy-Discussion <at> scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion
Benjamin Root | 15 Feb 2012 12:15
Picon
Favicon

Re: autocorrelation computation performance : use of np.correlate



On Wednesday, February 15, 2012, Pierre Haessig <pierre.haessig <at> crans.org> wrote:
> Le 04/02/2012 23:19, Ralf Gommers a écrit :
>
> scipy.signal is the right place I think. numpy shouldn't grow too many functions like this.
>
> [going back in time on the autocorrelation topic]
>
> I see scipy.signal being the good place. However, I have the (possibly wrong) feeling that Matplotlib is not so much depending on scipy.
> Would Matplotlib's acorr/xcorr functions benefit from a faster function available in scipy as opposed to numpy ?
>
> Pierre
>

Mpl does not depend on scipy at all and it is intended to stay that way.  Mind you, mpl does not "require" acorr/xcorr because we do it using existing numpy tools. And any user needing a faster version can simply call it themselves and simply plot the results themselves.  The current functions are merely convenience functions

So, yes, it would be nice to have it in numpy, but if it fits better in scipy, then put it there.

Ben Root
_______________________________________________
NumPy-Discussion mailing list
NumPy-Discussion <at> scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion

Gmane