A brief history of K-means

Clustering
A short history of the k-means clustering algorithm, tracing its evolution to more modern optimization.
Author

Andrij David

Published

April 28, 2026

Let’s imagine we are running a factory that manufactures trousers. We have a limited number of lines, say 3, and we want to serve as many customers as possible. Unfortunately, we cannot manufacture custom sizes for everyone. Conversely, a one-size-fits-all approach is impractical. How do we know which size to manufacture for each line? If we can only afford to manufacture 3 different sizes, what should be the exact dimensions of those sizes to cover the vast majority of the population?

In 1956, a Polish mathematician, Hugo Steinhaus, formulated a similar problem and presented his solution (Steinhaus 1956). In his paper, he generalised this to solve any similar problem, such as partitioning a material body (or dataset) into \(n\) parts. He argued that his method could identify, from a given population, \(n\) representatives that best represent different groups within this population.

His approach was straightforward:

  1. Start with an arbitrary decomposition \(K^{1}\) of the body into \(n\) parts.
  2. Define the points \(\mu_{i}\) (for \(i=1,2,\ldots,n\)) as the centroids of each partial body \(K_{i}\).
  3. Define a new decomposition \(K^{r+1}\) as the special decomposition relative to the previously calculated centroids \(\mu^{r}\), where parts are separated by equidistant planes.
  4. Repeat the process for all natural numbers \(r\) to converge towards the solution.

This was the first recorded attempt to form the base of the algorithm that we now call k-means.

To come back to our problem, the “body” is our dataset of customer body measurements (waist, inseam, height, etc.). Each partial body \(K_i\) becomes one cluster of similar customers, and the centroid \(\mu_i\) becomes the ideal representative size for that production line. The mathematics is identical, only the application changes.

Later in 1957, Stuart Lloyd, while at Bell Labs, was working on Pulse-Code Modulation (PCM) (Lloyd 1982). PCM is a standard method used to convert continuous analogue signals (like a human voice) into a digital format. This is usually done by sampling and mapping the continuous amplitude values into a limited number of discrete digital values, such as rounding to the closest value, a process called quantisation.

Lloyd wanted to find a way to optimally choose these discrete values so that the difference, or distortion, between the original analogue voice and the digitised version was minimised. In other words, he wanted to preserve the quality of the digital version.

In an ideal world, we could have an infinite amount of partition spaces, and have a near-zero quality loss between the analogue and digital versions. However, the transmission bandwidth of digital signals is limited, especially back then. Hence, he needed to partition the signal space into limited representative regions. Specifically, he needed to preserve the quality of the signal by minimising the average squared error between the original signal and the quantised signal.

His solution was deceptively simple:

  1. Take every possible continuous input value and assign it to the nearest discrete quantisation level.
  2. Recalculate each discrete level so it becomes the exact centre of all the values assigned to it.
  3. Repeat until the levels stop changing.

The mathematics is equally simple. Given \(n\) points \(x_i \in \mathbb{R}^d\), we seek \(k\) centroids \(\mu_1,\ldots,\mu_k\) that minimise

\[\phi=\sum_{i=1}^{n}\min_{j\in\{1,\ldots,k\}}\lVert x_i-\mu_j\rVert_2^2.\]

This creates a Voronoi tessellation (Voronoi 1908; Aurenhammer 1991) of the data space where each centroid defines a cell containing all points closer to it than to any other centroid. His algorithm alternates between two steps: assigning points to their nearest centroid, then moving centroids to the mean of their assigned points.

This is exactly the same approach that Steinhaus proposed a few years earlier, but applied to PCM. His work was circulated as an internal memo at Bell Labs and was not officially published until 1982 under the title Least squares quantization in PCM. The solution is elegant, but computing these partitions on early computers was a real challenge, especially as the data grew.

Fast forward to 1965, when Edward W. Forgy was trying to solve a major computational bottleneck (Forgy 1965). Back then, statisticians and biologists liked to use a method called hierarchical clustering (Johnson 1967; Ward 1963) to find groupings in multivariate data. This method produces a tree-like diagram showing how data points relate to each other, which provided excellent interpretability to scientists.

However, it requires a substantial amount of computational power and memory to store and compare every single data point to every other data point. For a dataset with \(N\) items, this takes roughly \(O(N^2)\) memory and up to \(O(N^3)\) processing time. Computers from the 1960s would struggle with this, often crashing or taking days to process datasets with only 100 points.

Forgy realised that to be able to process large datasets, scientists had to trade interpretability for computational efficiency. He then introduced an iterative approach to partitioning methods.

  1. Pick \(k\) random actual data points to serve as the initial cluster centres.
  2. Assign all points to their nearest centre.
  3. Update the centres to the mean of their assigned points.
  4. Repeat until stable.

Here, we once again have k-means rediscovered for the third time, but it was only in 1967 that James MacQueen coined the term k-means (MacQueen 1967). The only difference from the two previous approaches lies in the initialisation of the cluster at the beginning of the algorithm. This is a crucial choice. Given the inherent randomness, it can dictate how the algorithm behaves and performs on different datasets. It can even cause the algorithm to degenerate or fail to converge to an optimal solution.

Although the essential idea behind k-means had been rediscovered multiple times by the late 60s, the algorithm was more of a theoretical curiosity than a practical tool for large datasets. It was very slow to run and highly sensitive to the choice of initial centroids.

Lloyd preferred to randomly assign each point to a cluster first, and then compute the initial centroids as the mean of the points in each randomly assigned cluster. This caused the centroids to be placed close to the centre of the dataset, so multiple centroids often begin too close together.

Lloyd’s variant has weak initial separation, needs more iterations, and has a higher chance of converging to a mediocre local minimum. Forgy’s method is simple and solves these problems, but you can still get unlucky and pick outliers or several nearby points, giving poor starts or slow convergence in some runs.

This changed in the 2000s when David Arthur and Sergei Vassilvitskii introduced a smarter way to initialise the clusters by forcing the algorithm to choose initial clusters as far away from each other as possible (Arthur and Vassilvitskii 2007). This dramatically improved the speed of the algorithm because starting from better initialisation points means fewer iterations are needed for convergence.

It follows carefully chosen steps so that the initial cluster centres are as far apart as possible, following some probabilistic assumptions:

  1. The first centroid is chosen randomly from the dataset.
  2. For every remaining data point, calculate the distance to the nearest centroid that has already been selected.
  3. Select the next centroid by choosing randomly a point from the remaining points, weighted by its squared distance to the nearest existing centroid.
  4. Repeat steps 2 and 3 until all centroids are assigned.

Although this approach introduces some computational overhead, it guarantees fewer iterations of the k-means algorithm, reduces the variance between runs, and guarantees an expected approximation ratio of \(O(\log k)\). This is fairly acceptable for any small \(k\), but once we reach a \(k\) in the thousands, it means we have to do the same full network sweeps. To reduce the initialisation time in this case, we can look at k-means|| (Bahmani et al. 2012), which uses oversampling to pick multiple candidate centroids simultaneously.

In the previous sections, each improvement focused on the initialisation step, i.e. trying to find an optimal and fast starting point for the algorithm. But if we look at the steps that follow, the algorithm has to compute the distance from every point to every cluster centre at every iteration. That is wasteful. If we look more carefully, there are many distances we never actually need.

Charles Elkan showed how the triangle inequality could be used to skip those unnecessary distance calculations without affecting the result (Elkan 2003).

When working in a Euclidean space, the direct path between two points is always the shortest. So for any three points \(x\), \(\mu_i\), \(\mu_j\):

\[d(x, \mu_i) + d(x, \mu_j) \geq d(\mu_i, \mu_j).\]

For our problem, we can rearrange it as:

\[d(x, \mu_j) \geq d(\mu_i, \mu_j) - d(x, \mu_i).\]

Here \(\mu_i\) is the centre \(x\) is currently assigned to, and \(\mu_j\) is any other candidate centre.

The key consequence is the skip rule. If we already know that \(d(x, \mu_i) \leq \tfrac{1}{2}\, d(\mu_i, \mu_j)\), then \(\mu_i\) is provably closer to \(x\) than \(\mu_j\), and we can avoid computing \(d(x, \mu_j)\) entirely.

Geometrically, we can imagine that we draw a circle around each cluster centre \(\mu_i\). The radius of this circle is set to exactly half the distance to its absolute closest neighbouring cluster centre i.e. \(\tfrac{1}{2}\, \min_{j \neq i} d(\mu_i, \mu_j)\). Let’s call this circle the safe zone for that given cluster i.e. \(\mu_i\). Any point that falls inside that zone cannot belong to any other cluster, but the cluster centre can still shift from iteration to iteration. Instead of recalculating the distance to every centre every iteration, we keep track of a single number \(u(x)\), the upper bound on the distance from the point \(x\) to its currently assigned centre (geometrically this is like a circle of radius \(u(x)\) around \(x\)). As long as \(u(x)\) stays inside the safe zone of its assigned centre $_i$, the assignment cannot change and no distance to any other centre needs to be recomputed at all.

When the centres move at the end of an iteration, \(u(x)\) grows by at most the distance the centre moved, so it can be updated very cheaply. We only need to do some computation when this upper bound leaves the safe zone.

When the bound check fails, we first tighten the upper bound by computing the true distance \(d(x, \mu_i)\) to the assigned centre and replace \(u(x)\) with it. Then we re-check the safe zone. If the freshly tightened bound still satisfies \(u(x) \leq \tfrac{1}{2}\, d(\mu_i, \mu_j)\) for every other centre \(mu_j\), the alarm was false and the assignment stands.

Otherwise, we test candidate centres one at a time using the same skip rule with the new \(u(x)\) and whole centres are dismissed without a single distance computation. Only for the candidates that survive that we actually compute \(d(x, \mu_j)\) and reassign \(x\) if (and only if) this distance is strictly smaller than the current bound.

The pay-off is that, as iterations progress and centres start to settle, more and more points stay comfortably inside their safe zone, and the cost per iteration drops dramatically.

In the figure above, we have three points \(A\), \(B\), \(C\) and three cluster centres \(\mu_1\), \(\mu_2\), \(\mu_3\). All three points were initially assigned to \(\mu_1\), and each one ends up exercising a different branch of Elkan’s logic.

Point \(A\) sits inside the safe zone of \(\mu_1\). Its upper bound is strictly less than half the distance to any other centre, so the algorithm trusts the assignment. No other distance is computed.

Point \(B\) is now outside the safe zone, but its upper-bound circle (the dotted circle in the figure) is still smaller than \(\tfrac{1}{2}\, d(\mu_1, \mu_3)\). The skip rule prunes \(\mu_3\) for free. However, the bound is not tight enough to dismiss \(\mu_2\), so the exact distance \(d(B, \mu_2)\) is computed to maintain certainty.

Point \(C\) has its upper bound exceeded against every other centre. There is no distance calculation we can avoid here. We must compute the exact distances to all centres, and once we know them, \(C\) is reassigned from \(\mu_1\) to \(\mu_3\).

These three behaviours cover every branch of the algorithm. The same triangle inequality

\[d(x, \mu_j) \geq d(\mu_i, \mu_j) - u(x)\]

and the same skip rule

\[u(x) \leq \tfrac{1}{2}\, d(\mu_i, \mu_j) \;\Rightarrow\; \mu_i \text{ stays closest, } d(x, \mu_j) \text{ is not computed}\]

are doing all the work, where \(u(x)\) is the upper bound on the distance from \(x\) to its currently assigned centre.

Together with later distributed and mini-batch variants (such as the scalable k-means++ proposed by Bahmani et al. in 2012 (Bahmani et al. 2012)), these advances transformed k-means from an elegant but fragile procedure into one of the fastest and most widely used clustering algorithms in practice.

So why do all these assumptions matter? The k-means algorithm has a strong assumption about the topology of the space. It assumes that clusters are convex and roughly spherical. There are several reasons for this, but the simplest is that the algorithm sets the mean of the points as the centre of a given cluster. That implicitly assumes the perfect cluster is roughly symmetrical around its centre and roughly distributed around it, in other words a rough sphere.

However, if a cluster has the shape of a banana or, even worse, wraps around another cluster, the mean of the points (the assumed centre by the algorithm) ends up outside the cluster. For example, take three bananas arranged so that each banana is one true cluster, but their per-cluster means all coincide near the common centre. Three centroids initialised in the most reasonable place imaginable will still draw straight-line decision boundaries that slice each banana into three pieces. No amount of re-initialisation can fix this. The algorithm is fundamentally wrong for non-convex data.

When dealing with this kind of topology, we ultimately have to avoid k-means or any centroid-based methods entirely. Because this family of algorithms always defines a cluster by its mean, it forces every group to be roughly spherical and convex. When the true clusters are curved or intertwined, the mean ends up outside the actual data, and the algorithm has no way to recover.

Instead, we need to use other clustering methods that define clusters as continuous regions, allowing them to capture different shapes and curvatures present in the data manifold. Usually, density-based methods such as DBSCAN (Ester et al. 1996) and HDBSCAN (Campello et al. 2013) identify clusters as continuous dense regions separated by sparser areas that naturally capture arbitrary shapes and handle data noise automatically. Spectral clustering (Ng et al. 2001; Luxburg 2007), on the other hand, treats the data as a graph and uses the eigenvectors of the Laplacian to reveal the underlying manifold geometry, making it particularly effective for non-convex data.

The trousers factory we imagined at the beginning now has its answer. When the population of customers falls into compact, roughly spherical groups, k-means remains one of the fastest and most interpretable clustering tools we have. Yet its long history also teaches us humility, as the algorithm is not a universal hammer. Understanding where it came from, why it works so beautifully in the right cases, and exactly where its assumptions break down is what separates someone who simply runs k-means from someone who knows when to choose something better.

Further Reading

Arthur, David, and Sergei Vassilvitskii. 2007. k-Means++: The Advantages of Careful Seeding.” Proceedings of the Eighteenth Annual ACM-SIAM Symposium on Discrete Algorithms, 1027–35.
Aurenhammer, Franz. 1991. “Voronoi Diagrams. A Survey of a Fundamental Geometric Data Structure.” ACM Computing Surveys 23 (3): 345–405. https://doi.org/10.1145/116873.116880.
Bahmani, Bahman, Benjamin Moseley, Andrea Vattani, Ravi Kumar, and Sergei Vassilvitskii. 2012. “Scalable k-Means++.” Proceedings of the VLDB Endowment 5 (7): 622–33.
Campello, Ricardo J. G. B., Davoud Moulavi, and Jörg Sander. 2013. “Density-Based Clustering Based on Hierarchical Density Estimates.” Advances in Knowledge Discovery and Data Mining (PAKDD 2013), 160–72. https://doi.org/10.1007/978-3-642-37456-2_14.
Elkan, Charles. 2003. “Using the Triangle Inequality to Accelerate k-Means.” Proceedings of the Twentieth International Conference on Machine Learning (ICML-2003), 147–53.
Ester, Martin, Hans-Peter Kriegel, Jörg Sander, and Xiaowei Xu. 1996. “A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise.” Proceedings of the Second International Conference on Knowledge Discovery and Data Mining (KDD-96), 226–31.
Forgy, Edward W. 1965. “Cluster Analysis of Multivariate Data: Efficiency Versus Interpretability of Classifications.” Biometrics 21: 768–69.
Johnson, Stephen C. 1967. “Hierarchical Clustering Schemes.” Psychometrika 32 (3): 241–54. https://doi.org/10.1007/BF02289588.
Lloyd, Stuart P. 1982. “Least Squares Quantization in PCM.” IEEE Transactions on Information Theory 28 (2): 129–37. https://doi.org/10.1109/TIT.1982.1056489.
Luxburg, Ulrike von. 2007. “A Tutorial on Spectral Clustering.” Statistics and Computing 17 (4): 395–416. https://doi.org/10.1007/s11222-007-9033-z.
MacQueen, James. 1967. “Some Methods for Classification and Analysis of Multivariate Observations.” Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability 1: 281–97.
Ng, Andrew Y., Michael I. Jordan, and Yair Weiss. 2001. “On Spectral Clustering: Analysis and an Algorithm.” Advances in Neural Information Processing Systems 14 (NIPS 2001), 849–56.
Steinhaus, Hugo. 1956. “Sur La Division Des Corps Matériels En Parties.” Bulletin de l’Académie Polonaise Des Sciences 4 (12): 801–4.
Voronoi, Georges. 1908. “Nouvelles Applications Des Paramètres Continus à La Théorie Des Formes Quadratiques.” Journal für Die Reine Und Angewandte Mathematik 133: 97–178.
Ward, Joe H. 1963. “Hierarchical Grouping to Optimize an Objective Function.” Journal of the American Statistical Association 58 (301): 236–44. https://doi.org/10.1080/01621459.1963.10500845.