(This article was originally published at Normal Deviate, and syndicated at StatsBlogs.)
1. The Problem
One of the most popular algorithms for clustering is k-means. We start with data vectors
. We choose
vectors — cluster centers —
to minimize the error
Unfortunately, finding to minimize
is NP-hard. The usual iterative method, \hrefnosnap{http://en.wikipedia.org/wiki/Lloyd is easy to implement but it is unlikely to come close to minimizing the objective function. So finding
isn’t feasible.
To deal with this, many people choose random starting values, run the -means clustering algorithm then rinse, lather and repeat. In general, this may work poorly and there is no theoretical guarantee of getting close to the minimum. Finding a practical method for approximately minimizing
is thus an important practical problem.
2. The Solution
David Arthur and Sergei Vassilvitskii came up with a wonderful solution in 2007 known as k-means++.
The algorithm is simple and comes with a precise theoretical guarantee.
The first step is to choose a data point at random. Call this point . Next, compute the squared distances
Now choose a second point from the data. The probability of choosing
is
. Now recompute the distance as
Now choose a third point from the data where the probability of choosing
is
. We continue until we have
points
. Finally, we run
-means clustering using
as starting values. Call the resulting centers
.
Arthur and Vassilvitskii prove that
The expected value is over the randomness in the algorithm.
There are various improvements to the algorithm, both in terms of computation and in terms of getting a sharper performance bound.
This is quite remarkable. One simple fix, and an intractable problem has become tractable. And the method comes armed with a theorem.
3. Questions
- Is there an R implementation? It is easy enough to code the algorithm but it really should be part of the basic k-means function in R.
- Is there a version for mixture models? If not, it seems like a paper waiting to be written.
- Are there other intractable statistical problems that can be solved using simple randomized algorithms with provable guarantees? (MCMC doesn’t count because there is no finite sample guarantee.)
4. Reference
Arthur, D. and Vassilvitskii, S. (2007). k-means++: The advantages of careful seeding. Proceedings of the eighteenth annual ACM-SIAM symposium on Discrete algorithms. 1027–1035.
Please comment on the article here: Normal Deviate
