Alan Jackson | 12 Aug 04:18
Favicon

Problem with F distribution, or with me?

I'm confused. I was working on the documentation for the F-distribution,
and I'm getting results from the example that I don't understand.

I set the numerator degrees of freedom to 1 ( 2 groups)
and the denominator degrees of freedom to 48 (25 members in each group)

If I look up in a table, F(p<.01) = 7.19. I checked on the website
http://davidmlane.com/hyperstat/F_table.html
with 1, 48 and F=7 and got P = 0.01099

But when I run f in numpy, I get no values larger than about 0.65

 s = np.random.f(1, 48, 1000000)
In [40]: max(s)
Out[40]: 0.649036568048

I would have expected to see about 1% of the values > 7.19. Am I missing
something stupid?

--

-- 
-----------------------------------------------------------------------
| Alan K. Jackson            | To see a World in a Grain of Sand      |
| alan <at> ajackson.org          | And a Heaven in a Wild Flower,         |
| www.ajackson.org           | Hold Infinity in the palm of your hand |
| Houston, Texas             | And Eternity in an hour. - Blake       |
-----------------------------------------------------------------------
josef.pktd | 12 Aug 21:39

Re: Problem with F distribution, or with me?


The problem is that the F distribution in distributions.c is missing
a multiplication by the ratio of the degrees of freedom, see the correct rk_noncentral_f


*dfden  / dfnum


http://projects.scipy.org/scipy/numpy/browser/trunk/numpy/random/mtrand/distributions.c

226     double rk_f(rk_state *state, double dfnum, double dfden)
227     {
228         return rk_chisquare(state, dfnum) / rk_chisquare(state, dfden);
229     }
230    
231     double rk_noncentral_f(rk_state *state, double dfnum, double dfden, double nonc)
232     {
233         return ((rk_noncentral_chisquare(state, dfnum, nonc)*dfden) /
234                 (rk_chisquare(state, dfden)*dfnum));


change line 228 to
return (rk_chisquare(state, dfnum)*dfden)  / (rk_chisquare(state, dfden) *dfnum);


correct random variables require normalization:

>>> np.sum(1/2.0*40.0*np.random.f(2, 40, 1000000)>  2.44037)
99891

>>> np.sum(1/1.0*48.0*np.random.f(1, 48, 1000000)>  7.19)
10118
>>> np.sum(1/1.0*48.0*np.random.f(1, 48, 1000000)>  7.19)
10174
>>> np.sum(1/1.0*48.0*np.random.f(1, 48, 1000000)>  7.19)
10043

Josef
_______________________________________________
Scipy-dev mailing list
Scipy-dev <at> scipy.org
http://projects.scipy.org/mailman/listinfo/scipy-dev
Robert Kern | 13 Aug 00:06

Re: Problem with F distribution, or with me?

On Tue, Aug 12, 2008 at 14:43,  <josef.pktd <at> gmail.com> wrote:
>
> The problem is that the F distribution in distributions.c is missing
> a multiplication by the ratio of the degrees of freedom, see the correct
> rk_noncentral_f
>
>
> *dfden  / dfnum

Well that's embarassing. Thank you. Fixed in SVN.

--

-- 
Robert Kern

"I have come to believe that the whole world is an enigma, a harmless
enigma that is made terrible by our own mad attempt to interpret it as
though it had an underlying truth."
 -- Umberto Eco
josef.pktd | 13 Aug 07:19

Re: Problem with F distribution, or with me?

I wanted to compare the distributions in numpy.random with scipy.stats.distribution.
When I found the kolmogorov_test in test_distributions.py, I was wondering why this
test did not find the bug in the numpy random number generator.

It seems that this test is much too weak, sample size = 30 and parameters between 1 and 2.

After I made the test stricter, increased the power, I get the rejection/test failure for the F-distribution,
but additionally I get 2 to 4 additional failures, in fatiguelife, loggamma in all runs and in genhalflogistic, and genextreme
only sometimes. Test result of an example run are below. I did not see any obvious problem with my change in the test, the parameters that are used in the tests are not ruled out from what I have seen in the doc strings or a quick google search,
and I don't know these distributions at all or not well enough, to tell whether there is anything wrong with these distributions
or with the tests.

Josef

I'm using
>>> numpy.version.version
'1.1.0'
>>> scipy.version.version
'0.6.0'

Failures with changed test_distributions.py
===============================


>>> execfile(r'C:\Programs\Python24\Lib\site-packages\scipy\stats\tests\test_distributions.py')
  Found 73/73 tests for stats.tests.test_distributions
  Found 10/10 tests for stats.tests.test_morestats
  Found 107/107 tests for stats.tests.test_stats
...................FF......F.F...............F.............................Ties preclude use of exact statistic.
..Ties preclude use of exact statistic.
.................................................................................................................
======================================================================
FAIL: check_cdf (stats.tests.test_distributions.test_f)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "<string>", line 9, in check_cdf
AssertionError: D = 0.493585929987; pval = 0.0; alpha = 0.01
args = (9.8771486774554127, 1.2819774801876884)

======================================================================
FAIL: check_cdf (stats.tests.test_distributions.test_fatiguelife)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "<string>", line 9, in check_cdf
AssertionError: D = 0.101323526498; pval = 0.0; alpha = 0.01
args = (3.3139748541207283,)

======================================================================
FAIL: check_cdf (stats.tests.test_distributions.test_genextreme)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "<string>", line 9, in check_cdf
AssertionError: D = 0.02902; pval = 0.0; alpha = 0.01
args = (10.616290590132825,)

======================================================================
FAIL: check_cdf (stats.tests.test_distributions.test_genhalflogistic)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "<string>", line 9, in check_cdf
AssertionError: D = 0.02343; pval = 0.0; alpha = 0.01
args = (8.4724627096253382,)

======================================================================
FAIL: check_cdf (stats.tests.test_distributions.test_loggamma)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "<string>", line 9, in check_cdf
AssertionError: D = 1.0; pval = 0.0; alpha = 0.01
args = (4.4259066194420793,)

----------------------------------------------------------------------
Ran 190 tests in 5.250s

FAILED (failures=5)
>>>

3 Changes I made to scipy\stats\tests\test_distributions.py
===========================================
 * increase spread for random parameters *10
 * increase sample size N

Note: this is from scipy 0.60, but the same parameters are used in the current trunk

{{{
for dist in dists:
    distfunc = eval('stats.'+dist)
    nargs = distfunc.numargs
    alpha = 0.01
    if dist == 'fatiguelife':
        alpha = 0.001
    if dist == 'erlang':
        args = str((4,)+tuple(rand(2)))
    elif dist == 'frechet':
        args = str(tuple(2*rand(1))+(0,)+tuple(2*rand(2)))
    elif dist == 'triang':
        args = str(tuple(rand(nargs)))
    elif dist == 'reciprocal':
        vals = rand(nargs)
        vals[1] = vals[0] + 1.0
        args = str(tuple(vals))
    else:
        args = str(tuple(1.0+rand(nargs)*10))        # old was without *10
    exstr = r"""
class test_%s(NumpyTestCase):
    def check_cdf(self):
        D,pval = stats.kstest('%s','',args=%s,N=10000)       # old was N=30
        if (pval < %f):
            D,pval = stats.kstest('%s','',args=%s,N=100000)      # old was N=30
            #if (pval < %f):
            #    D,pval = stats.kstest('%s','',args=%s,N=30)
        assert (pval > %f), "D = " + str(D) + "; pval = " + str(pval) + "; alpha = " + str(alpha) + "\nargs = " + str(%s)
""" % (dist,dist,args,alpha,dist,args,alpha,dist,args,alpha,args)
    exec exstr
}}}
_______________________________________________
Scipy-dev mailing list
Scipy-dev <at> scipy.org
http://projects.scipy.org/mailman/listinfo/scipy-dev
josef.pktd | 13 Aug 18:03

Re: Problem with F distribution, or with me?

I looked some more at genextreme and loggamma distributions. Both seem to be ok, but only on a restricted domain. Maybe this could be mentioned in the doc strings.


genextreme: mean and variance of sample
 c        mean            variance
-10.1 2.03183339039e+046 4.10875567714e+097
-9.1 8.73524294338e+046 7.63037062101e+098
-8.1 3.43173786542e+029 3.7041223192e+063
-7.1 2.56715180769e+029 4.83202007123e+063
-6.1 6.38255235148e+025 3.96122540324e+056
-5.1 2.17899531771e+019 2.78413740065e+043
-4.1 2.11131451031e+018 4.45696503066e+041
-3.1 1023550562.61 2.58109364566e+022
-2.1 173604.32446 1.15250794324e+015
-1.1 17.5172037108 348742.573234
-0.1 0.67925866308 2.2049366648
0.9 0.0448402183177 0.929688939344
1.9 -0.413704842079 3.82434734783
2.9 -1.49614654365 58.0798269907
3.9 -5.06385824194 1507.40696605
4.9 -20.0955456516 63494.3016592
5.9 -101.601673556 5611029.51369
6.9 -668.732337727 464527444.003
7.9 -3864.35379097 27915569490.2
8.9 -44833.4387546 2.5781344271e+013
9.9 -156511.731133 5.9148778182e+013
10.9 -4208783.3068 4.1364771468e+017
11.9 -12363752.6511 9.49784745096e+017
12.9 -687931749.834 2.12662176368e+022
13.9 -3564704408.19 1.49396435068e+023
14.9 -55267493180.0 1.75088408867e+026
15.9 -89255193706.2 1.68855387759e+026
16.9 -1.44638018472e+013 1.95168337353e+031
17.9 -2.82021450011e+013 5.28737543263e+031
18.9 -2.70352320915e+014 2.76333229745e+033

genextreme: Kolmogorov test
 c    pval
-10.1 0.0743077659229 fail
-9.1 0.42774246842
-8.1 0.320666171255
-7.1 0.140353007389
-6.1 0.422004148643
-5.1 0.235860485941
-4.1 0.123131615734
-3.1 0.0584578205023 fail
-2.1 0.0806014275316 fail
-1.1 0.551087520596
-0.1 0.443666131702
0.9 0.0897965413418 fail
1.9 0.000687904138691 fail
2.9 0.14533205744
3.9 0.266177157829
4.9 0.248863135101
5.9 0.0525623128538 fail
6.9 0.0189969790597 fail
7.9 1.21972182576e-007 fail
8.9 0.0 fail
9.9 0.0 fail
10.9 0.0 fail
11.9 0.0 fail
12.9 0.0 fail
13.9 0.0 fail
14.9 0.0 fail
15.9 0.0 fail
16.9 0.0 fail
17.9 0.0 fail
18.9 0.0 fail

loggamma: mean and variance of sample
 c        mean            variance
0.1 1.#QNAN 1.#QNAN
0.2 1.#QNAN 1.#QNAN
0.3 1.#QNAN 1.#QNAN
0.4 1.#QNAN 1.#QNAN
0.5 1.#QNAN 1.#QNAN
0.6 1.#QNAN 1.#QNAN
0.7 1.#QNAN 1.#QNAN
0.8 1.#QNAN 1.#QNAN
0.9 1.#QNAN 1.#QNAN
1.0 -0.576668705715 1.63525671992
1.1 -0.525273410576 1.34235414411
1.2 -0.428429138219 1.11702081261
1.3 -0.341788789013 0.974156958424
1.4 -0.241612900595 0.862346072551
1.5 -0.136331803853 0.788179027654
1.6 -0.0313490020812 0.725560746671
1.7 0.0748869989073 0.678337971919
1.8 0.185785501444 0.646038183747
1.9 0.293612948064 0.632854109548
2.0 0.422063781842 0.644406991859
2.1 1.#QNAN 1.#QNAN
2.2 1.#QNAN 1.#QNAN
2.3 1.#QNAN 1.#QNAN
2.4 1.#QNAN 1.#QNAN
2.5 1.#QNAN 1.#QNAN
2.6 1.#QNAN 1.#QNAN
2.7 1.#QNAN 1.#QNAN
2.8 1.#QNAN 1.#QNAN
2.9 1.#QNAN 1.#QNAN
3.0 1.#QNAN 1.#QNAN

loggamma: Kolmogorov test
 c    pval
0.1 0.0 fail
0.2 0.0 fail
0.3 0.0 fail
0.4 0.0 fail
0.5 0.0 fail
0.6 0.0 fail
0.7 0.0 fail
0.8 0.0 fail
0.9 0.0 fail
1.0 0.0944212682068 fail
1.1 0.0721468077268 fail
1.2 0.429110318521
1.3 0.185750416387
1.4 0.12670896322
1.5 0.328821862924
1.6 0.205984847372
1.7 0.283064144052
1.8 0.0290310978597 fail
1.9 0.135995409763
2.0 0.0864826709644 fail
2.1 0.0 fail
2.2 0.0 fail
2.3 0.0 fail
2.4 0.0 fail
2.5 0.0 fail
2.6 0.0 fail
2.7 0.0 fail
2.8 0.0 fail
2.9 0.0 fail
3.0 0.0 fail

This was produced with:

import numpy as np
import scipy.stats as stats


N=100000
#N=1000

print "\ngenextreme: mean and variance of sample"
print ' c        mean            variance'
for i in range(30):
    c = -10.1+i/1.0
    rn = stats.genextreme.rvs(c,size=N)
    print c, np.mean(rn), np.var(rn)

print "\ngenextreme: Kolmogorov test"
print ' c    pval'
for i in range(30):
    c = -10.1+i/1.0
    D, pval = stats.kstest('genextreme','',args=(c,),N=N)
    print c, pval, (pval<0.1 and 'fail' or '')

print "\nloggamma: mean and variance of sample"
print ' c        mean            variance'
for i in range(30):
    c = 0.100+i/10.0
    rn = stats.loggamma.rvs(c,size=N)
    print c, np.mean(rn), np.var(rn)

print "\nloggamma: Kolmogorov test"
print ' c    pval'
for i in range(30):
    c = 0.100+i/10.0
    D, pval = stats.kstest('loggamma','',args=(c,),N=N)
    print c, pval, (pval<0.1 and 'fail' or '')  

_______________________________________________
Scipy-dev mailing list
Scipy-dev <at> scipy.org
http://projects.scipy.org/mailman/listinfo/scipy-dev
josef.pktd | 13 Aug 20:44

Re: Problem with F distribution, or with me?

I think, stats.loggamma.rvs is wrong or uses a definition that I cannot figure out.
It was still bugging me, that I did not figure out what is going on with the loggamma random variables.

When I compare it with the log transformation of a gamma random variable and with the loggamma distribution in R, then it looks llike as if the stats.loggamma.rvs were only correct for a parameter of 2.

In "R", I get the same mean and variance as for np.log(stats.gamma.rvs(k,size=1000000))
and neither R nor np.log(stats.gamma.rvs(...)) have the same domain restriction as stats.loggamma.rvs.

So, to me it seems that there is something fishy with stats.loggamma.rvs.

used package in R: 'VGAM'

Below are some comparison of the loggamma random variables generated by
 * stats.loggamma.rvs
 * np.log(stats.gamma.rvs(...))
 * VGAM.rlgamma in "R"

Josef

Kolmogorov tests
------------------------

for parameter = 2 it looks ok

>>> stats.kstest(np.log(stats.gamma.rvs(2,size=10000)),'loggamma',args=(2,))
(0.010148906122060208, array(0.12659359569633333))
>>> c=2;stats.kstest(np.log(stats.gamma.rvs(c,size=10000)),'loggamma',args=(c,))
(0.0058379222756740345, array(0.50383387183308759))

strange results with c different from 2

>>> c=2.5;stats.kstest(np.log(stats.gamma.rvs(c,size=10000)),'loggamma',args=(c,))
(0.24775311834187796, array(0.0))
>>> c=2.5;stats.kstest(stats.loggamma.rvs(c,size=10000),'loggamma',args=(c,))
(1.0, array(0.0))
>>> c=1.5;stats.kstest(stats.loggamma.rvs(c,size=10000),'loggamma',args=(c,))
(0.0069725721519633965, array(0.3764490603546865))
>>> c=1.5;stats.kstest(np.log(stats.gamma.rvs(c,size=10000)),'loggamma',args=(c,))
(0.12849906523207333, array(0.0))


Comparing mean and variance with "R"
-------------------------------------------------------

for k=2: is ok

>>> np.mean(stats.loggamma.rvs(2,size=10000))
0.42482669940021239
>>> np.mean(np.log(stats.gamma.rvs(2,size=100000)))
0.42284709204863447
in R
> mean(rlgamma(100000, location=0, scale=1, k=2))
[1] 0.4224025


>>> np.var(np.log(stats.gamma.rvs(2,size=10000)))
0.6449444777702128
>>> np.var(stats.loggamma.rvs(2,size=100000))
0.64758634320780184
in R:
> var(rlgamma(100000, location=0, scale=1, k=2))
[1] 0.6360736


for k=5:

>>> np.mean(stats.loggamma.rvs(5,size=1000000))
1.#QNAN
>>> np.mean(np.log(stats.gamma.rvs(5,size=1000000)))
1.506037882165763
in R:
> mean(rlgamma(1000000, location=0, scale=1, k=5))
[1] 1.506325

>>> np.var(stats.loggamma.rvs(5,size=1000000))
1.#QNAN
>>> np.var(np.log(stats.gamma.rvs(5,size=1000000)))
0.22151419947589021
> var(rlgamma(1000000, location=0, scale=1, k=5))
[1] 0.2212415

for other k:

>>> np.var(stats.loggamma.rvs(1.5,size=1000000))
0.78615045130829264
>>> np.var(np.log(stats.gamma.rvs(1.5,size=1000000)))
0.93399967614023915
in R:
> var(rlgamma(1000000, location=0, scale=1, k=1.5))
[1] 0.9332276


>>> np.var(np.log(stats.gamma.rvs(10,size=1000000)))
0.10528311964943039
in R:
> var(rlgamma(1000000, location=0, scale=1, k=10))
[1] 0.1052298





_______________________________________________
Scipy-dev mailing list
Scipy-dev <at> scipy.org
http://projects.scipy.org/mailman/listinfo/scipy-dev
Robert Kern | 24 Aug 09:54

Re: Problem with F distribution, or with me?

On Wed, Aug 13, 2008 at 11:44,  <josef.pktd <at> gmail.com> wrote:
> I think, stats.loggamma.rvs is wrong or uses a definition that I cannot
> figure out.

It isn't related to log(gamma.rvs()). It is the same distribution as
the "standard" version of lgammaff in VGAM:

  http://rss.acs.unt.edu/Rdoc/library/VGAM/html/lgammaff.html

--

-- 
Robert Kern

"I have come to believe that the whole world is an enigma, a harmless
enigma that is made terrible by our own mad attempt to interpret it as
though it had an underlying truth."
 -- Umberto Eco
josef.pktd | 24 Aug 21:49

Re: Problem with F distribution, or with me?

>On Wed, Aug 13, 2008 at 11:44,  <josef.pktd <at> gmail.com> wrote:
>> I think, stats.loggamma.rvs is wrong or uses a definition that I cannot figure out.
>
>It isn't related to log(gamma.rvs()). It is the same distribution as the "standard" version of lgammaff
in VGAM:
>
>  http://rss.acs.unt.edu/Rdoc/library/VGAM/html/lgammaff.html
>
>--
>Robert Kern

Hi,

I still think there is a problem with the loggamma distribution. I am
attaching a script that compares the random variables generated with
scipy.stats.loggamma.rvs  with the theoretical distribution from
scipy.stats.loggamma.pdf and the explicit formula, which is the same
in scipy.stats.loggamma.pdf  as in the
http://rss.acs.unt.edu/Rdoc/library/VGAM/html/lgammaff.html. The
script produces many graphs for the range of parameters that seem
reasonable to me.

>From the histograms you can see that the fit of the sample to the
correct pdf is very weak and seems to hold only for some parameter
values, e.g. c=1, c=2. For c=1.5 or 1.6 which is in the range of the
kstest in the scipy tests, the fit does not look very good.

On the other hand, the log of a gamma random variable has a good fit
to the theoretical distribution in scipy.stats.loggamma.pdf.

I have not found any good statistics reference for the loggamma
distribution and its relationship to the log of a gamma random
variable, but from my interpretation of the results, they seem to have
the same distribution. But while googling, I saw neither a positive
nor a negative statement for this.

In my previous use of the gamma distribution in R, I think, I used the
correct random number generator VGAM.rlgamma, which is linked to in
your reference.

Note: I'm still using matplotlib-0.90.1 and for compatibility I had to
downgrade numpy to 1.0.4, my scipy version is 0.6.0. But I did not see
any relevant changes to scipy.stats in a quick look at the changelogs
in subversion/trac.

Can you run the attached file, and it should give you a quick overview
of whether my suspicious results are real, or whether something has
changed in newer versions, (or if I am really misinterpreting what is
supposed to be going on here).

Josef
_______________________________________________
Scipy-dev mailing list
Scipy-dev <at> scipy.org
http://projects.scipy.org/mailman/listinfo/scipy-dev
Robert Kern | 24 Aug 22:39

Re: Problem with F distribution, or with me?

On Sun, Aug 24, 2008 at 12:49,  <josef.pktd <at> gmail.com> wrote:
>>On Wed, Aug 13, 2008 at 11:44,  <josef.pktd <at> gmail.com> wrote:
>>> I think, stats.loggamma.rvs is wrong or uses a definition that I cannot figure out.
>>
>>It isn't related to log(gamma.rvs()). It is the same distribution as the "standard" version of lgammaff
in VGAM:
>>
>>  http://rss.acs.unt.edu/Rdoc/library/VGAM/html/lgammaff.html
>>
>>--
>>Robert Kern
>
> Hi,
>
> I still think there is a problem with the loggamma distribution. I am
> attaching a script that compares the random variables generated with
> scipy.stats.loggamma.rvs  with the theoretical distribution from
> scipy.stats.loggamma.pdf and the explicit formula, which is the same
> in scipy.stats.loggamma.pdf  as in the
> http://rss.acs.unt.edu/Rdoc/library/VGAM/html/lgammaff.html. The
> script produces many graphs for the range of parameters that seem
> reasonable to me.
>
> >From the histograms you can see that the fit of the sample to the
> correct pdf is very weak and seems to hold only for some parameter
> values, e.g. c=1, c=2. For c=1.5 or 1.6 which is in the range of the
> kstest in the scipy tests, the fit does not look very good.
>
> On the other hand, the log of a gamma random variable has a good fit
> to the theoretical distribution in scipy.stats.loggamma.pdf.

I believe you are correct. The implementation of the CDF and PPF for
this distribution appears to have numerical problems. The default
implementation of the RVS, uses the PPF to invert U(0,1) random
numbers and messes up significantly. Putting in log(mtrand.gamma(c))
for the RVS appears to match the PDF, but not the CDF or PPF.

<some more fiddling>

Ah. All of the references refer to the CDF as the ratio of the
incomplete gamma function gammainc(c,exp(x)) divided by gamma(c). This
was translated to special.gammainc(c,exp(x))/special.gamma(c).
*However*, it appears that Cephes' gammainc() function does this whole
ratio, not just the numerator. Removing the extraneous gamma()s gives
us stable results across a wide range of shape parameters with or
without log(mtrand.gamma(c)) (which we will use).

Thank you for your attention to these issues. It's greatly appreciated.

--

-- 
Robert Kern

"I have come to believe that the whole world is an enigma, a harmless
enigma that is made terrible by our own mad attempt to interpret it as
though it had an underlying truth."
 -- Umberto Eco
josef.pktd | 13 Aug 15:39

Re: Problem with F distribution, or with me? - error in stats.fatiguelife.rvs

It looks like that there is an error in stats.fatiguelife.rvs

Kolmogorov test fails
>>> stats.kstest('fatiguelife','',args=(5,),N=1000)
(0.093216666807115545, array(2.5853082230575808e-008))

Mean of sample

>>> stats.fatiguelife.stats(5,moments='m')
array(13.5)
>>> np.mean(stats.fatiguelife.rvs(5,size=1000))
26.683858360164475
>>> np.mean(stats.fatiguelife.rvs(5,size=10000))
26.841525716395847
>>> np.mean(stats.fatiguelife.rvs(5,size=100000))
26.730694604009678
>>> np.mean(stats.fatiguelife.rvs(5,size=100000)/2)
13.469823793800416

>>> stats.fatiguelife.stats(3,moments='m')
array(5.5)
>>> np.mean(stats.fatiguelife.rvs(3,size=100000))
10.922712537094393
>>> np.mean(stats.fatiguelife.rvs(3,size=100000)/2)
5.5340854278553246

Variance of sample

>>> stats.fatiguelife.stats(3,moments='v')
array(110.25)
>>> np.var(stats.fatiguelife.rvs(3,size=1000000))
440.1793356094052
>>> np.var(stats.fatiguelife.rvs(3,size=1000000)/2)
110.445022957997
>>> np.var(stats.fatiguelife.rvs(3,size=10000000)/2)
110.03364894832275

>>> stats.fatiguelife.stats(5,moments='v')
array(806.25)
>>> np.var(stats.fatiguelife.rvs(5,size=1000000))
3222.4271388000293
>>> np.var(stats.fatiguelife.rvs(5,size=1000000)/2)
809.29193071702855

theoretical mean and cdf  look correct, according to http://www.itl.nist.gov/div898/handbook/eda/section3/eda366a.htm
but random number generator, is wrong by approximately the scale of 1/2

Josef
_______________________________________________
Scipy-dev mailing list
Scipy-dev <at> scipy.org
http://projects.scipy.org/mailman/listinfo/scipy-dev
Robert Kern | 24 Aug 10:35

Re: Problem with F distribution, or with me? - error in stats.fatiguelife.rvs

On Wed, Aug 13, 2008 at 06:42,  <josef.pktd <at> gmail.com> wrote:
> It looks like that there is an error in stats.fatiguelife.rvs

Correct. I have a fix. When I run the scipy test suite tomorrow, I'll
check it in.

--

-- 
Robert Kern

"I have come to believe that the whole world is an enigma, a harmless
enigma that is made terrible by our own mad attempt to interpret it as
though it had an underlying truth."
 -- Umberto Eco
josef.pktd | 24 Aug 22:19

Re: Problem with F distribution, or with me? - error in stats.fatiguelife.rvs

>On Wed, Aug 13, 2008 at 06:42,  <josef.pktd <at> gmail.com> wrote:
>> It looks like that there is an error in stats.fatiguelife.rvs
>
>Correct. I have a fix. When I run the scipy test suite tomorrow, I'll check it in.
>
>--
>Robert Kern

Hi,
Running the scipy test suite looks pretty useless for verifying the
actual distribution except for serious mistakes. It didn't detect
before anything wrong with fatiguelife or loggamma (which I think also
gives incorrect random numbers)

The Kolmogorov test in
http://projects.scipy.org/scipy/scipy/browser/trunk/scipy/stats/tests/test_distributions.py
is pretty powerless to detect mistakes in the actual distribution.
N=30 is too small and the fail threshold for the pval for fatiguelife
is set to alpha = 0.01, while for the other distributions it is at
alpha = 0.1.

The pvalue for N=100 or N=1000  should be a much better indicator
whether the random variable really follows the theoretical
distribution.

Josef
Robert Kern | 24 Aug 23:06

Re: Problem with F distribution, or with me? - error in stats.fatiguelife.rvs

On Sun, Aug 24, 2008 at 13:19,  <josef.pktd <at> gmail.com> wrote:
>>On Wed, Aug 13, 2008 at 06:42,  <josef.pktd <at> gmail.com> wrote:
>>> It looks like that there is an error in stats.fatiguelife.rvs
>>
>>Correct. I have a fix. When I run the scipy test suite tomorrow, I'll check it in.
>>
>>--
>>Robert Kern
>
> Hi,
> Running the scipy test suite looks pretty useless for verifying the
> actual distribution except for serious mistakes.

True. I just don't want to break anything else.

I don't really trust the automated K-S tests anyways. They don't have
good parameter coverage. At the sprint yesterday, I wrote a little GUI
to help me go through all of the distributions with Q-Q plots and a
comparison of the histogram to the theoretical PDF with interactive
control over the parameters. The remaining problems are mostly
failures of the machinery rather than problems with the formulae of
the distributions themselves.

> It didn't detect
> before anything wrong with fatiguelife or loggamma (which I think also
> gives incorrect random numbers)
>
> The Kolmogorov test in
> http://projects.scipy.org/scipy/scipy/browser/trunk/scipy/stats/tests/test_distributions.py
> is pretty powerless to detect mistakes in the actual distribution.
> N=30 is too small and the fail threshold for the pval for fatiguelife
> is set to alpha = 0.01, while for the other distributions it is at
> alpha = 0.1.

No, they are 0.001 and 0.01, but your point is taken.

> The pvalue for N=100 or N=1000  should be a much better indicator
> whether the random variable really follows the theoretical
> distribution.

True. After bumping those up to 1000, the tests still pass with after
my fixes. They will remain bumped up to 1000.

--

-- 
Robert Kern

"I have come to believe that the whole world is an enigma, a harmless
enigma that is made terrible by our own mad attempt to interpret it as
though it had an underlying truth."
 -- Umberto Eco
Anne Archibald | 26 Aug 22:37

Re: Problem with F distribution, or with me? - error in stats.fatiguelife.rvs

2008/8/24  <josef.pktd <at> gmail.com>:

> Running the scipy test suite looks pretty useless for verifying the
> actual distribution except for serious mistakes. It didn't detect
> before anything wrong with fatiguelife or loggamma (which I think also
> gives incorrect random numbers)
>
> The Kolmogorov test in
> http://projects.scipy.org/scipy/scipy/browser/trunk/scipy/stats/tests/test_distributions.py
> is pretty powerless to detect mistakes in the actual distribution.
> N=30 is too small and the fail threshold for the pval for fatiguelife
> is set to alpha = 0.01, while for the other distributions it is at
> alpha = 0.1.
>
> The pvalue for N=100 or N=1000  should be a much better indicator
> whether the random variable really follows the theoretical
> distribution.

Is there any reason not to include your fuzz tests as a part of scipy?
If they take too long, it may make sense to run them only when
exhaustive testing is requested, but they find a lot of bugs for such
a little bit of code.

Also, for the many tests of scipy that use random numbers, does it
make sense to seed the random number generator ahead of time? That way
debuggers can replicate the test failures... Perhaps nose provides a
way to seed the random number generator before every test?

Anne
josef.pktd | 27 Aug 02:18

Re: Problem with F distribution, or with me? - error in stats.fatiguelife.rvs

On Tue, Aug 26, 2008 at 4:37 PM, Anne Archibald
<peridot.faceted <at> gmail.com> wrote:

>
> Is there any reason not to include your fuzz tests as a part of scipy?
> If they take too long, it may make sense to run them only when
> exhaustive testing is requested, but they find a lot of bugs for such
> a little bit of code.
>
> Also, for the many tests of scipy that use random numbers, does it
> make sense to seed the random number generator ahead of time? That way
> debuggers can replicate the test failures... Perhaps nose provides a
> way to seed the random number generator before every test?
>
> Anne
> _______________________________________________
> Scipy-dev mailing list
> Scipy-dev <at> scipy.org
> http://projects.scipy.org/mailman/listinfo/scipy-dev
>

I see two problems for including these tests when they are intended to
be powerful.

The main problem are false positives, i.e. test failures because the
random sample just gives by chance a failing test statistic.

If you seed all random variables, it takes the fuzziness out of these
tests. Currently the parameters of the distribution are randomly drawn
and then the kstests draw, I guess, a random sample for the actual
test. Only seeding one of them would not remove the possibility of
accidental test failures.

My tests are just a variation of the current test in scipy.stats, I
just included more distributions and increased the power of the tests,
which is now also changed in trunk. The pvalue thresholds are set
pretty low and when a test fails, then a second test is run, I assume
in order to reduce the number of false failures. If the tests are too
weak, then they won't detect anything except very severe errors. But
when running the test suite you don't want to get lots of spurious
test failures.

What I did with these test was, make them very strict and then check
individually whether a test failure is actually caused by an error or
just a weakness of the kstest. For some kstest with low pvalues, I
later did not find anything (obviously) wrong with the distribution in
scipy.stats.

For now, my strict fuzz tests are to coarse, all I know in general is
that some tests don't work for some parameters that are not ruled out.
Maybe these parameters should be ruled out for the current
implementation.

But my tests indicate that there might still be other problems for
scipy.stats to run according to it's own specification, and there are
no tests for many other methods or properties e.g. pdf or pmf,
statistics such as moments, at least not from what I have seen.

Can somebody run:

stats.zipf.pmf(range(10),1.5)
stats.zipf.cdf(10,1.5)
stats.zipf.cdf(range(10),1.5)

I get
>>> stats.zipf.cdf(10,1.5)
Traceback (most recent call last):
  File "<stdin>", line 1, in ?
  File "C:\Josef\_progs\Subversion\scipy_trunk\dist\Programs\Python24\Lib\site-p
ackages\scipy\stats\distributions.py", line 3549, in cdf
    place(output,cond,self._cdf(*goodargs))
  File "C:\Josef\_progs\Subversion\scipy_trunk\dist\Programs\Python24\Lib\site-p
ackages\scipy\stats\distributions.py", line 3458, in _cdf
    return self._cdfvec(k,*args)
  File "C:\Programs\Python24\Lib\site-packages\numpy\lib\function_base.py", line
 1092, in __call__
    raise ValueError, "mismatch between python function inputs"\
ValueError: mismatch between python function inputs and received arguments

but this might be, because I am running either an older version of
scipy or from my own faulty (?) build from subversion.
If the current version still has this error, then I think it is
related to the generic calculation of the cdf,
similar to ticket 422, changeset 3797 for the moment calculation.

In my tests I get many value errors, but again I don't know whether it
is my version/setup, whether the parameters are not ruled out but
don't make sense, or whether there is actually a bug somewhere.

Josef
josef.pktd | 27 Aug 03:29

Re: Problem with F distribution, or with me? - ticket 422 might apply for discrete distributions

On Tue, Aug 26, 2008 at 8:18 PM,  <josef.pktd <at> gmail.com> wrote:

> Can somebody run:
>
> stats.zipf.pmf(range(10),1.5)
> stats.zipf.cdf(10,1.5)
> stats.zipf.cdf(range(10),1.5)
>
> I get
>>>> stats.zipf.cdf(10,1.5)
> Traceback (most recent call last):
>  File "<stdin>", line 1, in ?
>  File "C:\Josef\_progs\Subversion\scipy_trunk\dist\Programs\Python24\Lib\site-p
> ackages\scipy\stats\distributions.py", line 3549, in cdf
>    place(output,cond,self._cdf(*goodargs))
>  File "C:\Josef\_progs\Subversion\scipy_trunk\dist\Programs\Python24\Lib\site-p
> ackages\scipy\stats\distributions.py", line 3458, in _cdf
>    return self._cdfvec(k,*args)
>  File "C:\Programs\Python24\Lib\site-packages\numpy\lib\function_base.py", line
>  1092, in __call__
>    raise ValueError, "mismatch between python function inputs"\
> ValueError: mismatch between python function inputs and received arguments
>
> but this might be, because I am running either an older version of
> scipy or from my own faulty (?) build from subversion.
> If the current version still has this error, then I think it is
> related to the generic calculation of the cdf,
> similar to ticket 422, changeset 3797 for the moment calculation.
>
> In my tests I get many value errors, but again I don't know whether it
> is my version/setup, whether the parameters are not ruled out but
> don't make sense, or whether there is actually a bug somewhere.
>
> Josef
>

in changeset 3797 this was added to the continuous random variable:
315	        self.generic_moment.nin = self.numargs+1 # Because of the
*args argument
316	        # of _mom0_sc, vectorize cannot count the number of
arguments correctly.
in current trunk these are lines 318 and 319

I don't understand the details of numpy and vectorize, but I think
that the same problem with the number
of arguments also applies to the generic calculation for the discrete
distribution, i.e. lines

3401 	            self._ppf = new.instancemethod(sgf(_drv_ppf,otypes='d'),
3402 	                                           self, rv_discrete)
3403 	            self._pmf = new.instancemethod(sgf(_drv_pmf,otypes='d'),
3404 	                                           self, rv_discrete)
3405 	            self._cdf = new.instancemethod(sgf(_drv_cdf,otypes='d'),
3406 	                                           self, rv_discrete)

since all _drv_??? methods also use the *args argument and the
exception message I get looks the same as in ticket 422.

Josef
josef.pktd | 28 Aug 22:35

Re: Problem with F distribution, or with me? - ticket 422 might apply for discrete distributions

I just saw the report by Alan McIntyre that this fails
check_rvs (scipy.stats.tests.test_distributions.test_rv_discrete)

Looking briefly at the current trunk source, I think now that the
problem with the generic cdf for defined distributions such as zipf is
in

3383 	        self._cdfvec = sgf(self._cdfsingle,otypes='d')
which calls
3469 	    def _cdfsingle(self, k, *args):

If the explanation in ticket 422 applies more generally, then all
calls to vectorize, i.e. to sgf, would need to be checked in
scipy.stats.distribution.

I'm sorry if I'm barking up the wrong tree, I am just looking at the
pattern without knowing the internals.

Josef

On 8/26/08, josef.pktd <at> gmail.com <josef.pktd <at> gmail.com> wrote:
> On Tue, Aug 26, 2008 at 8:18 PM,  <josef.pktd <at> gmail.com> wrote:
>
>> Can somebody run:
>>
>> stats.zipf.pmf(range(10),1.5)
>> stats.zipf.cdf(10,1.5)
>> stats.zipf.cdf(range(10),1.5)
>>
>> I get
>>>>> stats.zipf.cdf(10,1.5)
>> Traceback (most recent call last):
>>  File "<stdin>", line 1, in ?
>>  File
>> "C:\Josef\_progs\Subversion\scipy_trunk\dist\Programs\Python24\Lib\site-p
>> ackages\scipy\stats\distributions.py", line 3549, in cdf
>>    place(output,cond,self._cdf(*goodargs))
>>  File
>> "C:\Josef\_progs\Subversion\scipy_trunk\dist\Programs\Python24\Lib\site-p
>> ackages\scipy\stats\distributions.py", line 3458, in _cdf
>>    return self._cdfvec(k,*args)
>>  File "C:\Programs\Python24\Lib\site-packages\numpy\lib\function_base.py",
>> line
>>  1092, in __call__
>>    raise ValueError, "mismatch between python function inputs"\
>> ValueError: mismatch between python function inputs and received arguments
>>
>> but this might be, because I am running either an older version of
>> scipy or from my own faulty (?) build from subversion.
>> If the current version still has this error, then I think it is
>> related to the generic calculation of the cdf,
>> similar to ticket 422, changeset 3797 for the moment calculation.
>>
>> In my tests I get many value errors, but again I don't know whether it
>> is my version/setup, whether the parameters are not ruled out but
>> don't make sense, or whether there is actually a bug somewhere.
>>
>> Josef
>>
>
> in changeset 3797 this was added to the continuous random variable:
> 315	        self.generic_moment.nin = self.numargs+1 # Because of the
> *args argument
> 316	        # of _mom0_sc, vectorize cannot count the number of
> arguments correctly.
> in current trunk these are lines 318 and 319
>
> I don't understand the details of numpy and vectorize, but I think
> that the same problem with the number
> of arguments also applies to the generic calculation for the discrete
> distribution, i.e. lines
>
> 3401 	            self._ppf = new.instancemethod(sgf(_drv_ppf,otypes='d'),
> 3402 	                                           self, rv_discrete)
> 3403 	            self._pmf = new.instancemethod(sgf(_drv_pmf,otypes='d'),
> 3404 	                                           self, rv_discrete)
> 3405 	            self._cdf = new.instancemethod(sgf(_drv_cdf,otypes='d'),
> 3406 	                                           self, rv_discrete)
>
> since all _drv_??? methods also use the *args argument and the
> exception message I get looks the same as in ticket 422.
>
> Josef
>
josef.pktd | 28 Aug 23:17

Re: Problem with F distribution, or with me? - ticket 422 might apply for discrete distributions

patch that works for me with scipy 0.6.0
follwing the pattern of Changeset 3797

in class rv_discrete add a line, adapted from Changeset 3797

    def _cdf(self, x, *args):     line 3456 in scipy 0.6.0, line 3473 in trunk
        k = floor(x)
+        self._cdfvec.nin = self.numargs+1  #JP
        return self._cdfvec(k,*args)

I could do this without recompile
now zipf works:

>>> stats.zipf._cdf(10,1.5)
array(0.7638016085053122)
>>> stats.zipf.cdf(10,1.5)
array(0.7638016085053122)

all test in scipy 0.6.0 stats still pass (but that's not very reliable)

stats.test()
  Found 73/73 tests for scipy.stats.tests.test_distributions
  Found 10/10 tests for scipy.stats.tests.test_morestats
  Found 107/107 tests for scipy.stats.tests.test_stats
...........................................................................Ties
preclude use of exact statistic.
..Ties preclude use of exact statistic.
.................................................................................................................
----------------------------------------------------------------------
Ran 190 tests in 0.469s

OK
<unittest._TextTestResult run=190 errors=0 failures=0>

Josef

On 8/28/08, josef.pktd <at> gmail.com <josef.pktd <at> gmail.com> wrote:
> I just saw the report by Alan McIntyre that this fails
> check_rvs (scipy.stats.tests.test_distributions.test_rv_discrete)
>
> Looking briefly at the current trunk source, I think now that the
> problem with the generic cdf for defined distributions such as zipf is
> in
>
> 3383 	        self._cdfvec = sgf(self._cdfsingle,otypes='d')
> which calls
> 3469 	    def _cdfsingle(self, k, *args):
>
> If the explanation in ticket 422 applies more generally, then all
> calls to vectorize, i.e. to sgf, would need to be checked in
> scipy.stats.distribution.
>
> I'm sorry if I'm barking up the wrong tree, I am just looking at the
> pattern without knowing the internals.
>
> Josef
>
> On 8/26/08, josef.pktd <at> gmail.com <josef.pktd <at> gmail.com> wrote:
>> On Tue, Aug 26, 2008 at 8:18 PM,  <josef.pktd <at> gmail.com> wrote:
>>
>>> Can somebody run:
>>>
>>> stats.zipf.pmf(range(10),1.5)
>>> stats.zipf.cdf(10,1.5)
>>> stats.zipf.cdf(range(10),1.5)
>>>
>>> I get
>>>>>> stats.zipf.cdf(10,1.5)
>>> Traceback (most recent call last):
>>>  File "<stdin>", line 1, in ?
>>>  File
>>> "C:\Josef\_progs\Subversion\scipy_trunk\dist\Programs\Python24\Lib\site-p
>>> ackages\scipy\stats\distributions.py", line 3549, in cdf
>>>    place(output,cond,self._cdf(*goodargs))
>>>  File
>>> "C:\Josef\_progs\Subversion\scipy_trunk\dist\Programs\Python24\Lib\site-p
>>> ackages\scipy\stats\distributions.py", line 3458, in _cdf
>>>    return self._cdfvec(k,*args)
>>>  File
>>> "C:\Programs\Python24\Lib\site-packages\numpy\lib\function_base.py",
>>> line
>>>  1092, in __call__
>>>    raise ValueError, "mismatch between python function inputs"\
>>> ValueError: mismatch between python function inputs and received
>>> arguments
>>>
>>> but this might be, because I am running either an older version of
>>> scipy or from my own faulty (?) build from subversion.
>>> If the current version still has this error, then I think it is
>>> related to the generic calculation of the cdf,
>>> similar to ticket 422, changeset 3797 for the moment calculation.
>>>
>>> In my tests I get many value errors, but again I don't know whether it
>>> is my version/setup, whether the parameters are not ruled out but
>>> don't make sense, or whether there is actually a bug somewhere.
>>>
>>> Josef
>>>
>>
>> in changeset 3797 this was added to the continuous random variable:
>> 315	        self.generic_moment.nin = self.numargs+1 # Because of the
>> *args argument
>> 316	        # of _mom0_sc, vectorize cannot count the number of
>> arguments correctly.
>> in current trunk these are lines 318 and 319
>>
>> I don't understand the details of numpy and vectorize, but I think
>> that the same problem with the number
>> of arguments also applies to the generic calculation for the discrete
>> distribution, i.e. lines
>>
>> 3401 	            self._ppf = new.instancemethod(sgf(_drv_ppf,otypes='d'),
>> 3402 	                                           self, rv_discrete)
>> 3403 	            self._pmf = new.instancemethod(sgf(_drv_pmf,otypes='d'),
>> 3404 	                                           self, rv_discrete)
>> 3405 	            self._cdf = new.instancemethod(sgf(_drv_cdf,otypes='d'),
>> 3406 	                                           self, rv_discrete)
>>
>> since all _drv_??? methods also use the *args argument and the
>> exception message I get looks the same as in ticket 422.
>>
>> Josef
>>
>

Gmane