Thursday, April 5, 2012

K- means clustering with scipy

K-means clustering is a method for finding clusters and cluster centers in a set of unlabeled data. Intuitively, we might think of a cluster as comprising a group of data points whose inter-point distances are small compared with the distances to points outside of the cluster. Given an initial set of K centers, the K-means algorithm alternates the two steps:
  • for each center we identify the subset of training points (its cluster) that is closer to it than any other center;
  • the means of each feature for the data points in each cluster are computed, and this mean vector becomes the new center for that cluster.
These two steps are iterated until the centers no longer move or the assignments no longer change. Then, a new point x can be assigned to the cluster of the closest prototype.
The Scipy library provides a good implementation of the K-Means algorithm. Let's see how to use it:
from pylab import plot,show
from numpy import vstack,array
from numpy.random import rand
from scipy.cluster.vq import kmeans,vq

# data generation
data = vstack((rand(150,2) + array([.5,.5]),rand(150,2)))

# computing K-Means with K = 2 (2 clusters)
centroids,_ = kmeans(data,2)
# assign each sample to a cluster
idx,_ = vq(data,centroids)

# some plotting using numpy's logical indexing
plot(data[idx==0,0],data[idx==0,1],'ob',
     data[idx==1,0],data[idx==1,1],'or')
plot(centroids[:,0],centroids[:,1],'sg',markersize=8)
show()
The result should be as follows:


In this case we splitted the data in 2 clusters, the blue points have been assigned to the first and the red ones to the second. The squares are the centers of the clusters.
Let's see try to split the data in 3 clusters:
# now with K = 3 (3 clusters)
centroids,_ = kmeans(data,3)
idx,_ = vq(data,centroids)

plot(data[idx==0,0],data[idx==0,1],'ob',
     data[idx==1,0],data[idx==1,1],'or',
     data[idx==2,0],data[idx==2,1],'og') # third cluster points
plot(centroids[:,0],centroids[:,1],'sm',markersize=8)
show()
This time the the result is as follows:

27 comments:

  1. Your blog is awesome and has helped me discover all kinds of useful tidbits in scipy and numpy. Thank you!

    ReplyDelete
  2. Thanks for the post. I have used kmeans to identify clusters (rings) in a matrix of sea surface height. The objective is to identify the rings and to determine their centroids. But kmeans requires as input parameter the number of clusters to be sought. That is a problem because I usually do not know previously how many rings will be present in the area. So, I was wondering how to avoid this kmeans limitation. Do you have any idea?
    Regards,
    Alex

    ReplyDelete
    Replies
    1. Hello Alexandre,
      how to find a good choice for k is a studied problem in machine learning. I suggest you to read this paper: http://books.nips.cc/papers/files/nips16/NIPS2003_AA36.pdf

      Thanks for your comment!

      Delete
  3. from where did you get the module "from scipy.cluster.vq import kmeans,vq"? where can I get it?

    this stuff looks great, btw.

    ReplyDelete
    Replies
    1. It come with the scipy installation:

      http://docs.scipy.org/doc/scipy/reference/cluster.vq.html#module-scipy.cluster.vq

      Delete
  4. Excellent blog, JG. I love it. I use it and recommend it to others. One question. Many of the k-means tutorials that I've found rely on self-made, perfectly configured data -- i.e., they'll "generate" numbers in just the form necessary to get an easy k-means clustering. Whereas in the real world many of us are using k-means to cluster documents (using NLP) or other information that requires a great deal of work/formatting before we can even use k-means. For example, on documents first one must create a vector (tf-idf for instance) and then complete a similarity measurement (euclidean for instance).

    Thus, the greatest challenge to performing a k-means is often just getting the data into a format that the kmeans calls can work with. For instance, tutorials always seem to have their data in the following list-within-a-list format:[[number1, number2], [number1, number2], [number1, number2], [number1, number2]... ]. How would you recommend getting documents (i.e., natural language) into that format? Note that cosine similarity performed on a tf-idf vector always returns a single list, which isn't "clusterable" (to my knowledge).

    Any help appreciated and thank you in advance for all of your work.

    ReplyDelete
    Replies
    1. Hello,

      You could take a look to the vector space models. Those models make you able to build a fixed length vector using the frequencies of some words you're interested in. This post is a good introduction:

      http://pyevolve.sourceforge.net/wordpress/?p=1589

      Delete
    2. I'll check it out. Thanks much, JG.

      Delete
  5. i am not able to view the result what can be done..
    show() command is not working

    ReplyDelete
  6. Excellent! Help me much to understand k-means clustering.

    Is it possible to use this in a capacitated-VRP (vehicle routing problem) ?
    In which, each node has "demand", and there is a fixed vehicle's "capacity".
    Subject to: sum of all node's demands (in a cluster) is smaller-or-equal to vehicle's capacity.

    Any helps very much appreciated. Thank you.

    ReplyDelete
    Replies
    1. I mean as explained in here, "Cluster-First Route-Second Method"
      http://neo.lcc.uma.es/vrp/solution-methods/heuristics/cluster-first-route-second-method/

      Thanks...

      Delete
  7. This is excellent material, and your code explaining how to use the scipy implementation is beautiful and clear. Just recently I've implemented the algorithm in python myself, it's a lot of fun to play around with configurations to see the clustering in action: http://datasciencelab.wordpress.com/2013/12/12/clustering-with-k-means-in-python/

    ReplyDelete
  8. Could you please discuss a bit the role of 'whitening' which seems to be kind of highly reccommended by the scipy tutorial?

    ReplyDelete
    Replies
    1. Hi, you can find some general information about the whitening transformation here: http://en.wikipedia.org/wiki/Whitening_transformation

      Usually, when you have a data matrix, you use whitening in order to have unit variance across all the samples. This operation could improve the result of the k-means algorithm.

      Delete
  9. Thank you very much for your post, it's very helpful. Here, your dataset has two variables that you partition into 2 and then 3 clusters, so it makes sense to plot the k-means clusters like this, with 1 variable on X-axis and one on the Y-axis. but what about if your dataset has more dimensions? Do you have any suggestions about ways to look at the output under these circumstances?

    Thanks,

    SS

    ReplyDelete
    Replies
    1. Hi, if you have more than two variables and you need to visualize your data in a 2D space, you need a dimensionality reduction method. Here's an example on how to use Isomap: http://glowingpython.blogspot.co.uk/2012/05/manifold-learning-on-handwritten-digits.html

      Delete
    2. Interesting, I'll check it out. Thanks for the speedy response!

      Delete
  10. This comment has been removed by the author.

    ReplyDelete
  11. This post has been very helpful. Thank you very much

    ReplyDelete
  12. If I would like say 100 clusters from k-means, too many to do the plotting by hand as you have in this example, how would you visualise/plot the results from the clustering?

    ReplyDelete
    Replies
    1. You have to defined a list with a color for each cluster and plot each cluster separately with a for loop choosing the appropriate color each time.

      Delete
  13. Great post.

    Do you know of anything similar that can find clusters in 3D space? For example, I have a set of particles each with x, y, z co-ordinates. How would I go about finding clusters (of various densities) in such a dataset?

    Any thoughts or ideas?

    ReplyDelete
    Replies
    1. kmeans works in 3D spaces. You just have to modify the input matrix.

      Delete
    2. Fantastic. Will investigate it further. Thanks.

      Delete
  14. Nice demonstration. But can you tell me how to use this scipy.cluster.vq module to generate codebook for an array of mfcc feature vectors. I've extracted the MFCC feature vectors (13 coefficients) ...now i wish to use vq to perform pattern matching stuff. any idea ?

    ReplyDelete
    Replies
    1. Of course, you could cluster your MFCC vectors in order to find similarities in some parts of your signal. But the applications that you can realize are related to the kind of signal you have (speech, music, ...).

      Delete