James Abel | 6 Aug 2012 01:09
Picon
Gravatar

kmeans2 question/issue

Hi,
I'm trying to use scipy.cluster.vq.kmeans2() but I'm getting inconsistent
output.  With a simple test input that should have 3 clusters, I'm getting
good results most of the time but every so often the output creates the
wrong clustering.  If anyone could point to what I'm doing wrong I'd
appreciate it!
Code and sample output below.
Thanks!
James

Code:

import sys
import scipy
from scipy.cluster.vq import *

print sys.version
vals = scipy.array((0.0,0.1,0.5,0.6,1.0,1.1))
print vals
white_vals = whiten(vals)
print white_vals.shape, white_vals

# try it several times to see if we get similar answers
count = 0
while count < 5:
    res, idx = kmeans2(white_vals, 3) # changing iter doesn't seem to matter
    print res, idx
    count += 1

Output:
(Continue reading)

Gael Varoquaux | 6 Aug 2012 07:30
Favicon
Gravatar

Re: kmeans2 question/issue

On Sun, Aug 05, 2012 at 04:09:06PM -0700, James Abel wrote:
> I'm trying to use scipy.cluster.vq.kmeans2() but I'm getting inconsistent
> output.  With a simple test input that should have 3 clusters, I'm getting
> good results most of the time but every so often the output creates the
> wrong clustering.

K Means is a non-convex problem: it is dependent on the (random)
initialization. In addition, it is not garantied to find the 'true'
clusters, because quite often it is not possible from the data.

You are not doing anything wrong, you are just asking for something that
is not possible.

HTH,

Gael
James Abel | 9 Aug 2012 05:53
Picon
Gravatar

Re: kmeans2 question/issue

Thanks Gael.

 

BTW, I modified my code to loop until it gets the same clustering twice in a row.  This yields more consistent results.  I don’t know if this is a general solution but it worked for my simple test case.  Code below.

 

James

 

import sys

import scipy

import warnings

from scipy.cluster.vq import *

 

print sys.version

vals = scipy.array((0.0,0.1,0.5,0.6,1.0,1.1))

print vals

white_vals = whiten(vals)

print white_vals.shape, white_vals

 

# Check for same clustering

def clustering_test(a,b):

    # have to create copies, then sort so we don't modify the original

    ea = a.copy()

    eb = b.copy()

    ea.sort()

    eb.sort()

    r = (ea == eb).all()

    print a,b,ea,eb,r

    return r

 

# try it until we get the same clustering twice in a row

found = False

prior_idx = None

while not found:

    with warnings.catch_warnings():

        warnings.simplefilter("ignore") # suppress the warning message (happens if it doesn't find the right number of clusters)

        res, idx = kmeans2(white_vals, 3) # changing iter doesn't seem to matter

    #print res, idx

    if prior_idx is not None:

        eq = clustering_test(idx, prior_idx)

        #print eq.all()

        if eq:

            found = True

    prior_idx = idx

print "result", res, idx

_______________________________________________
SciPy-User mailing list
SciPy-User <at> scipy.org
http://mail.scipy.org/mailman/listinfo/scipy-user
Nelle Varoquaux | 9 Aug 2012 06:06
Picon
Gravatar

Re: kmeans2 question/issue

Hi James,

Usually, we run the optimisation several times and take the solution with the smallest inertia. The technic you use don't ensure you to keep the best solution.

There's a full implementation in scikit-learn with several runs. You can have a look at the code to see how it works.

Cheers,
N

On 8 Aug 2012 20:53, "James Abel" <j <at> abel.co> wrote:

Thanks Gael.

 

BTW, I modified my code to loop until it gets the same clustering twice in a row.  This yields more consistent results.  I don’t know if this is a general solution but it worked for my simple test case.  Code below.

 

James

 

import sys

import scipy

import warnings

from scipy.cluster.vq import *

 

print sys.version

vals = scipy.array((0.0,0.1,0.5,0.6,1.0,1.1))

print vals

white_vals = whiten(vals)

print white_vals.shape, white_vals

 

# Check for same clustering

def clustering_test(a,b):

    # have to create copies, then sort so we don't modify the original

    ea = a.copy()

    eb = b.copy()

    ea.sort()

    eb.sort()

    r = (ea == eb).all()

    print a,b,ea,eb,r

    return r

 

# try it until we get the same clustering twice in a row

found = False

prior_idx = None

while not found:

    with warnings.catch_warnings():

        warnings.simplefilter("ignore") # suppress the warning message (happens if it doesn't find the right number of clusters)

        res, idx = kmeans2(white_vals, 3) # changing iter doesn't seem to matter

    #print res, idx

    if prior_idx is not None:

        eq = clustering_test(idx, prior_idx)

        #print eq.all()

        if eq:

            found = True

    prior_idx = idx

print "result", res, idx


_______________________________________________
SciPy-User mailing list
SciPy-User <at> scipy.org
http://mail.scipy.org/mailman/listinfo/scipy-user

_______________________________________________
SciPy-User mailing list
SciPy-User <at> scipy.org
http://mail.scipy.org/mailman/listinfo/scipy-user
James Abel | 11 Aug 2012 20:19
Picon
Gravatar

Re: kmeans2 question/issue

Thanks – sklearn works great!  I got exactly what I expected each time I ran it!

James

 

import sys

import numpy

from sklearn.cluster import *

 

print sys.version

vals = numpy.array([[0.0],[0.1],[0.5],[0.6],[1.0],[1.1]])

print vals

k_means_ex = KMeans(k=3)

x = k_means_ex.fit_predict(vals)

print x

print k_means_ex.cluster_centers_

print k_means_ex.score(vals)

 

 

2.7.3 (default, Apr 10 2012, 23:31:26) [MSC v.1500 32 bit (Intel)]

[[ 0. ]

[ 0.1]

[ 0.5]

[ 0.6]

[ 1. ]

[ 1.1]]

[1 1 0 0 2 2]

[[ 0.55]

[ 0.05]

[ 1.05]]

-0.015

_______________________________________________
SciPy-User mailing list
SciPy-User <at> scipy.org
http://mail.scipy.org/mailman/listinfo/scipy-user

Gmane