COMP20008 · Elements Of Data Processing
Clustering: k-means, VAT & the Elbow
Clustering finds groups in unlabelled data. This chapter covers k-means (alternately assign points to the nearest centroid and recompute centroids, minimising within-cluster SSE), the elbow method for choosing k, k-means' sensitivity to outliers and how to mitigate it, and VAT, which reorders the distance matrix to reveal cluster tendency visually. It is examined as short-answer + critique: the 2024 exam's Q5 asks you to explain the elbow method with one advantage and one limitation, and how a single outlier affects k-means plus a mitigation.
What this chapter covers
- 011. k-means: assign each point to the nearest centroid → recompute centroids → repeat to convergence
- 022. The objective: minimise within-cluster sum of squared errors (SSE / inertia)
- 033. The elbow method: plot SSE vs k and pick the k at the bend
- 044. Elbow advantage (simple, visual) vs limitation (the bend can be ambiguous/subjective)
- 055. Alternatives to the elbow: silhouette, Calinski–Harabasz
- 066. Outlier sensitivity: a far outlier inflates SSE and pulls a centroid toward itself
- 077. Mitigation: remove/cap outliers, scale features, use k-medoids or density-based methods
- 088. VAT: reorder the pairwise-distance matrix as an image; dark diagonal blocks suggest clusters
The elbow method and the effect of an outlier (mirrors 2024 Q5)
- 1 markElbow method: run k-means for k = 1, 2, 3, … and plot the within-cluster SSE against k. SSE falls as k rises; choose the k at the "elbow", where adding more clusters stops reducing SSE much.
- 1 markAdvantage: it is simple and visual. Limitation: the bend is often ambiguous, so the choice of k can be subjective and there is not always a clear elbow.
- 1 markOutlier effect: because k-means minimises squared distance, a far outlier inflates SSE and pulls a centroid toward itself, distorting that cluster's shape — or wasting an entire cluster on the single point.
- 1 markMitigation: detect and remove or cap outliers before clustering, scale the features, or use a more robust method such as k-medoids or density-based clustering that is less swayed by extreme points.
Key terms
- k-means
- Partitions n points into k clusters by iterating two steps until stable: assign each point to its nearest centroid, then recompute each centroid as the mean of its assigned points. It minimises the within-cluster sum of squared errors (SSE/inertia).
- SSE / inertia
- The within-cluster sum of squared distances from each point to its centroid. k-means minimises it; SSE always falls as k increases (more centroids fit tighter), which is why you need the elbow to choose k rather than just minimising SSE.
- Elbow method
- A heuristic for choosing k: plot SSE against k and pick the value at the "elbow" where the curve bends and extra clusters stop cutting SSE much. Simple and visual, but the bend can be ambiguous, so the choice is subjective.
- k-medoids
- A clustering method like k-means but using actual data points (medoids) as cluster centres and minimising absolute rather than squared distance. It is more robust to outliers, which makes it a standard mitigation when k-means is distorted by extreme points.
- VAT (Visual Assessment of cluster Tendency)
- Reorders the pairwise-distance matrix and displays it as an image; dark blocks along the diagonal indicate groups of mutually close points, suggesting how many clusters exist before you even run k-means.
- Outlier sensitivity
- Because k-means minimises squared distance, a single far point contributes a large squared term, inflating SSE and pulling a centroid toward it. This can distort a cluster or consume a whole cluster on one point, so outliers must be handled first.
Clustering: k-means, VAT & the Elbow FAQ
Why can't I just pick the k that minimises SSE?
Because SSE always decreases as k increases — with k equal to the number of points, every point is its own cluster and SSE is 0. Minimising SSE would always pick the largest k, which is meaningless. The elbow method instead looks for the k where SSE stops falling sharply, balancing fit against the number of clusters; silhouette and Calinski–Harabasz are alternatives that score cluster quality directly.
How exactly does an outlier hurt k-means?
k-means minimises squared distance, so a point far from everything contributes a very large squared term to SSE. To reduce it, a centroid is dragged toward the outlier, distorting that cluster's shape and pulling its mean away from the genuine group — or an entire cluster may be spent on the single outlier. Because the penalty is squared, one extreme value has a disproportionate effect. Fix it by removing/capping outliers, scaling, or using k-medoids/density methods.
What is VAT and when would I use it?
VAT (Visual Assessment of cluster Tendency) reorders the pairwise-distance matrix so that similar points are adjacent, then shows it as a heatmap. Dark blocks along the diagonal reveal groups of mutually close points, giving a visual estimate of how many clusters the data has before you commit to k-means and a value of k. It is a quick sanity check on whether clustering is even appropriate.
How is clustering examined in COMP20008?
As short-answer explanation plus critique, mirroring 2024 Q5: explain the elbow method with one advantage and one limitation, describe how an outlier affects k-means and how to mitigate it, or interpret a VAT matrix. Marks reward the standard limitations (the elbow is subjective; k-means is outlier-sensitive because it squares distance) and a concrete mitigation, not heavy maths.
Exam move
Internalise the k-means loop as two alternating steps (assign to nearest centroid → recompute centroids) and the fact that it minimises SSE — that single word "squared" explains both why SSE always falls with k and why outliers are so damaging. Rehearse the elbow answer as a fixed template: plot SSE vs k, pick the bend, advantage = simple/visual, limitation = ambiguous/subjective, alternatives = silhouette/Calinski–Harabasz. Keep a ready list of outlier mitigations (remove/cap, scale, k-medoids, density methods) because the exam pairs the problem with the fix. Be able to describe VAT in one sentence (reordered distance matrix; dark diagonal blocks = clusters). This chapter and the next (hierarchical clustering) are often examined together, so know how k-means differs from a dendrogram-based approach.