University of Sydney · FACULTY OF COMPUTER SCIENCE

COMP5318 · Machine Learning and Data Mining

- one subject, every graph, every model, every mark
Computer Science14 Chapters11-page Bible
Our own words - no uploaded lecturer files
Updated for this semester
Chapter 9 of 11 · COMP5318

Clustering

This Week 10 topic of COMP5318 Machine Learning and Data Mining at the University of Sydney moves from supervised prediction to unsupervised learning: grouping unlabelled data so that points inside a cluster are similar (high cohesion) and clusters are far apart (high separation). It covers the four algorithm families examined here — K-means (partitional), GMM with the EM algorithm (model-based), agglomerative (hierarchical), and DBSCAN (density-based) — plus how clustering quality is measured. All of it shows up as short by-hand problems in the closed-book final.

In this chapter

What this chapter covers

  • 01Define clustering as unsupervised grouping and explain cohesion vs separation as the quality goal
  • 02Measure similarity with Euclidean, Manhattan and cosine distances, and know when to scale features
  • 03Distinguish a cluster's centroid (the mean) from its medoid (the most central real point)
  • 04Run the K-means loop — assign to nearest centroid, recompute the mean, repeat until centroids stop moving
  • 05Explain why K-means minimises SSE, is sensitive to initialisation, and struggles with non-spherical or unequal clusters
  • 06Describe GMM and the EM algorithm as a soft, probabilistic generalisation of K-means
  • 07Perform agglomerative clustering with single/complete/average linkage and read off a dendrogram
  • 08Label points as core, border or noise in DBSCAN and find clusters from Eps and MinPts
  • 09Evaluate a clustering with SSE, BSE and the silhouette coefficient, and pick k with the elbow method
Worked example · free

Run K-means with k = 2 to convergence

Q [6 marks]. Five points: P₁(1,1), P₂(1,2), P₃(6,5), P₄(7,5), P₅(2,1). Use k = 2 with initial centroids c₁ = (1,1) and c₂ = (6,5) and squared-Euclidean distance. Show the clusters and centroids after each epoch, say when it stops, and give the final SSE.
  • +1Epoch 1 — assign each point to the nearer centroid. Squared distances to c₁(1,1) vs c₂(6,5): P₁ 0 vs 41 →c₁; P₂ 1 vs 34 →c₁; P₃ 41 vs 0 →c₂; P₄ 52 vs 1 →c₂; P₅ 1 vs 32 →c₁.
  • +1Clusters after epoch 1: C₁ = {(1,1),(1,2),(2,1)} and C₂ = {(6,5),(7,5)}.
  • +1Recompute centroids as the means: c₁ = ((1+1+2)/3, (1+2+1)/3) = (4/3, 4/3) ≈ (1.33, 1.33); c₂ = ((6+7)/2, (5+5)/2) = (6.5, 5).
  • +1Epoch 2 — reassign to the new centroids. P₁, P₂, P₅ stay far nearer c₁(1.33,1.33) than c₂(6.5,5), and P₃, P₄ stay nearer c₂ (e.g. P₂: d²(c₁)=0.11+0.44=0.56 vs d²(c₂)=30.25+9=39.25).
  • +1No point changes cluster, so the recomputed centroids equal those from epoch 1 — the stopping condition (centroids do not change) is met, and K-means has converged.
  • +1Final SSE = sum of squared distances to each own centroid: C₁ gives 0.22 + 0.56 + 0.56 = 1.33, C₂ gives 0.25 + 0.25 = 0.50, so SSE = 1.83.
Two clusters: C₁ = {(1,1),(1,2),(2,1)} with centroid ≈ (1.33, 1.33) and C₂ = {(6,5),(7,5)} with centroid (6.5, 5). K-means converges after the epoch-2 re-check (no point is reassigned), with a final within-cluster SSE ≈ 1.83.
Sia tip — Assignment only needs to know which centroid is nearer, and squaring is monotonic, so compare squared Euclidean distances for cleaner arithmetic — take the actual square root only if a distance value is asked for. Lay out the distance-to-each-centroid comparison in a small table so the marker can award the method marks even if one number slips.
Glossary

Key terms

Cohesion vs separation
The two goals of a good clustering: high cohesion means small distances within a cluster (points are similar); high separation means large distances between clusters (clusters are dissimilar).
Centroid / medoid
A cluster's centroid is the coordinate-wise mean of its points (usually not a real data point); its medoid is the most central actual data point. K-means uses centroids.
K-means
A partitional method: choose k centroids, assign each point to the nearest, recompute each centroid as the mean, and repeat until centroids stop changing. It minimises within-cluster SSE.
SSE (within-cluster)
Sum of Squared Errors, Σᵢ Σ(x∈Kᵢ) d(cᵢ,x)² — the squared distance from each point to its cluster centroid, summed; lower SSE means tighter, more cohesive clusters.
GMM / EM algorithm
A model-based method assuming the data comes from k Gaussians. Expectation–Maximization alternates an E-step (probability each point belongs to each cluster) with an M-step (re-estimate the parameters), giving soft, probabilistic assignment.
Agglomerative linkage
Bottom-up hierarchical clustering that repeatedly merges the two closest clusters. Single link uses the minimum pairwise distance, complete link the maximum, average link the average; the merges form a dendrogram.
DBSCAN core / border / noise
Density-based clustering with radius Eps and count MinPts. A core point has ≥ MinPts in its Eps-neighbourhood (including itself); a border point is within Eps of a core point; a noise point is neither and is discarded.
Silhouette coefficient
sᵢ = (bᵢ − aᵢ)/max(aᵢ,bᵢ), where aᵢ is a point's average distance to its own cluster and bᵢ the smallest average distance to another cluster; it runs from −1 to 1, and near +1 means well clustered.
FAQ

Clustering FAQ

How is clustering different from classification?

Classification is supervised — it is given class labels and learns to predict them. Clustering is unsupervised — there are no labels, so it can only group similar examples together, never 'predict a class'. That is why clustering quality is judged by internal measures such as cohesion and separation (or the silhouette coefficient), not by accuracy against labels, unless a separate ground-truth is explicitly provided for an external measure.

When should I use DBSCAN instead of K-means?

Use K-means when you know roughly how many clusters you want and the clusters are compact and roughly spherical — it is simple and efficient, but you must set k in advance and it is dragged around by outliers. Use DBSCAN when clusters have arbitrary shapes or you expect noise: it labels sparse points as noise and finds dense regions of any shape, and it does not need k up front. The trade-off is that DBSCAN is sensitive to its Eps and MinPts parameters and struggles when clusters have very different densities.

Can AI help me with clustering in COMP5318?

Yes, for understanding. Sia can explain each step — how a K-means iteration assigns points and recomputes centroids, how single-link merges build a dendrogram, or how DBSCAN labels core, border and noise points — and walk you through a practice problem with your own numbers so you learn the method. Use it to check your reasoning and rehearse the derivations, not to obtain answers to submitted assignments or the closed-book exam; the unit requires you to acknowledge any AI tools used in assessable work, so keep AI to learning and revision.

Studying with AI? Sia — free AI machine learning tutor works through COMP5318 step by step.

Study strategy

Exam move

Treat clustering as one goal and four algorithms. First lock the goal: high cohesion (tight clusters, low SSE) plus high separation (clusters far apart). Then drill each algorithm as a by-hand loop on small datasets — K-means (assign to nearest centroid, recompute the mean, repeat until nothing changes), a single-link or complete-link agglomerative merge that you turn into a dendrogram, and a DBSCAN pass that labels every point core, border or noise from Eps and MinPts. Keep GMM/EM at the conceptual level: it is a soft, probabilistic generalisation of K-means with elliptical clusters. Practise until a full K-means or dendrogram problem takes only a few minutes, and always show the assignment or merge step before the number. The final is 2 hours, closed book, with a non-programmable calculator only; clustering appears as a short numeric problem, so budget about one minute per mark — a 6-mark K-means question is roughly 6 minutes, which keeps the whole paper inside the two hours. Remember the hurdles: at least 40% on the final exam (or the unit mark is capped at 45) and at least 50% overall to pass, and confirm the exact exam date on the Canvas exam timetable.

A+Everything unlocked
Unlocks this Bible + all 25 of your University of Sydney subjects - and 1,000+ Bibles across every Australian university.
Sia - your COMP5318 tutor, unlimited, worked the way the exam marks it
The full 11-page Bible + practice bank with worked solutions
Chrome extension - sync your LMS so Sia knows your deadlines
Bilingual EN / Chinese on every Bible and every Sia answer
$25/ month
30-day money-back · cancel in one tap · how it works
Unlock the full COMP5318 Bible + 25 University of Sydney subjects解锁完整 COMP5318 Bible + University of Sydney 25 门科目
$25/mo