University of Melbourne · S1 2026 · FACULTY OF INFORMATION TECHNOLOGY

COMP20008 · Elements Of Data Processing

- one subject, every graph, every model, every mark
50% final exam · hurdle14 Chapters7-page Bible
Our own words - no uploaded lecturer files
Built to mirror S1 2026 · updated this semester
Chapter 6 of 12 · COMP20008

Clustering: k-means, VAT & the Elbow

Clustering finds groups in unlabelled data. This chapter covers k-means (alternately assign points to the nearest centroid and recompute centroids, minimising within-cluster SSE), the elbow method for choosing k, k-means' sensitivity to outliers and how to mitigate it, and VAT, which reorders the distance matrix to reveal cluster tendency visually. It is examined as short-answer + critique: the 2024 exam's Q5 asks you to explain the elbow method with one advantage and one limitation, and how a single outlier affects k-means plus a mitigation.

In this chapter

What this chapter covers

  • 011. k-means: assign each point to the nearest centroid → recompute centroids → repeat to convergence
  • 022. The objective: minimise within-cluster sum of squared errors (SSE / inertia)
  • 033. The elbow method: plot SSE vs k and pick the k at the bend
  • 044. Elbow advantage (simple, visual) vs limitation (the bend can be ambiguous/subjective)
  • 055. Alternatives to the elbow: silhouette, Calinski–Harabasz
  • 066. Outlier sensitivity: a far outlier inflates SSE and pulls a centroid toward itself
  • 077. Mitigation: remove/cap outliers, scale features, use k-medoids or density-based methods
  • 088. VAT: reorder the pairwise-distance matrix as an image; dark diagonal blocks suggest clusters
Worked example · free

The elbow method and the effect of an outlier (mirrors 2024 Q5)

Q [4 marks]. Explain the elbow method for choosing the number of clusters k in k-means, giving one advantage and one limitation. Then explain how a single extreme outlier affects the k-means result and give one way to mitigate it.
  • 1 markElbow method: run k-means for k = 1, 2, 3, … and plot the within-cluster SSE against k. SSE falls as k rises; choose the k at the "elbow", where adding more clusters stops reducing SSE much.
  • 1 markAdvantage: it is simple and visual. Limitation: the bend is often ambiguous, so the choice of k can be subjective and there is not always a clear elbow.
  • 1 markOutlier effect: because k-means minimises squared distance, a far outlier inflates SSE and pulls a centroid toward itself, distorting that cluster's shape — or wasting an entire cluster on the single point.
  • 1 markMitigation: detect and remove or cap outliers before clustering, scale the features, or use a more robust method such as k-medoids or density-based clustering that is less swayed by extreme points.
The elbow picks k at the SSE bend (simple but subjective); a single outlier inflates SSE and drags a centroid, distorting clusters — mitigate by removing/capping outliers or using a robust method like k-medoids.
Sia tip — k-means minimises squared distance, so it is especially sensitive to far points — always say "squared" when explaining outlier sensitivity, because that is why one extreme value has such an outsized effect.
Glossary

Key terms

k-means
Partitions n points into k clusters by iterating two steps until stable: assign each point to its nearest centroid, then recompute each centroid as the mean of its assigned points. It minimises the within-cluster sum of squared errors (SSE/inertia).
SSE / inertia
The within-cluster sum of squared distances from each point to its centroid. k-means minimises it; SSE always falls as k increases (more centroids fit tighter), which is why you need the elbow to choose k rather than just minimising SSE.
Elbow method
A heuristic for choosing k: plot SSE against k and pick the value at the "elbow" where the curve bends and extra clusters stop cutting SSE much. Simple and visual, but the bend can be ambiguous, so the choice is subjective.
k-medoids
A clustering method like k-means but using actual data points (medoids) as cluster centres and minimising absolute rather than squared distance. It is more robust to outliers, which makes it a standard mitigation when k-means is distorted by extreme points.
VAT (Visual Assessment of cluster Tendency)
Reorders the pairwise-distance matrix and displays it as an image; dark blocks along the diagonal indicate groups of mutually close points, suggesting how many clusters exist before you even run k-means.
Outlier sensitivity
Because k-means minimises squared distance, a single far point contributes a large squared term, inflating SSE and pulling a centroid toward it. This can distort a cluster or consume a whole cluster on one point, so outliers must be handled first.
FAQ

Clustering: k-means, VAT & the Elbow FAQ

Why can't I just pick the k that minimises SSE?

Because SSE always decreases as k increases — with k equal to the number of points, every point is its own cluster and SSE is 0. Minimising SSE would always pick the largest k, which is meaningless. The elbow method instead looks for the k where SSE stops falling sharply, balancing fit against the number of clusters; silhouette and Calinski–Harabasz are alternatives that score cluster quality directly.

How exactly does an outlier hurt k-means?

k-means minimises squared distance, so a point far from everything contributes a very large squared term to SSE. To reduce it, a centroid is dragged toward the outlier, distorting that cluster's shape and pulling its mean away from the genuine group — or an entire cluster may be spent on the single outlier. Because the penalty is squared, one extreme value has a disproportionate effect. Fix it by removing/capping outliers, scaling, or using k-medoids/density methods.

What is VAT and when would I use it?

VAT (Visual Assessment of cluster Tendency) reorders the pairwise-distance matrix so that similar points are adjacent, then shows it as a heatmap. Dark blocks along the diagonal reveal groups of mutually close points, giving a visual estimate of how many clusters the data has before you commit to k-means and a value of k. It is a quick sanity check on whether clustering is even appropriate.

How is clustering examined in COMP20008?

As short-answer explanation plus critique, mirroring 2024 Q5: explain the elbow method with one advantage and one limitation, describe how an outlier affects k-means and how to mitigate it, or interpret a VAT matrix. Marks reward the standard limitations (the elbow is subjective; k-means is outlier-sensitive because it squares distance) and a concrete mitigation, not heavy maths.

Study strategy

Exam move

Internalise the k-means loop as two alternating steps (assign to nearest centroid → recompute centroids) and the fact that it minimises SSE — that single word "squared" explains both why SSE always falls with k and why outliers are so damaging. Rehearse the elbow answer as a fixed template: plot SSE vs k, pick the bend, advantage = simple/visual, limitation = ambiguous/subjective, alternatives = silhouette/Calinski–Harabasz. Keep a ready list of outlier mitigations (remove/cap, scale, k-medoids, density methods) because the exam pairs the problem with the fix. Be able to describe VAT in one sentence (reordered distance matrix; dark diagonal blocks = clusters). This chapter and the next (hierarchical clustering) are often examined together, so know how k-means differs from a dendrogram-based approach.

A+Everything unlocked
Unlocks this Bible + all 24 of your University of Melbourne subjects - and 1,000+ Bibles across every Australian university.
Sia - your COMP20008 tutor, unlimited, worked the way the exam marks it
The full 7-page Bible + practice bank with worked solutions
Chrome extension - sync your LMS so Sia knows your deadlines
Bilingual EN / Chinese on every Bible and every Sia answer
$25/ month
30-day money-back · cancel in one tap · how it works
Unlock the full COMP20008 Bible + 24 University of Melbourne subjects解锁完整 COMP20008 Bible + University of Melbourne 24 门科目
$25/mo