University of Melbourne · S1 2026 · FACULTY OF INFORMATION TECHNOLOGY

COMP20008 · Elements Of Data Processing

- one subject, every graph, every model, every mark
50% final exam · hurdle14 Chapters8-page Bible
Our own words - no uploaded lecturer files
Built to mirror S1 2026 · updated this semester
Chapter 9 of 12 · COMP20008

Supervised Learning: k-NN & Decision Trees

Supervised learning predicts a label (classification) or a value (regression) from labelled examples. This chapter covers k-Nearest Neighbours (classify by majority vote of the k closest points — sensitive to k, to feature scaling and to the cold-start problem in recommenders) and decision trees (recursively split on the feature giving the best purity gain, with depth controlling overfitting). It is examined as short-answer + critique: the 2025 exam's Q6 probes a high-cardinality categorical split that overfits, and Q7 critiques a Jaccard-based k-NN recommender's cold-start failure.

In this chapter

What this chapter covers

  • 011. Classification vs regression: predict a class/label vs a continuous value
  • 022. Train/test split: learn from labelled data, evaluate on held-out data
  • 033. k-NN: classify by majority vote of the k nearest neighbours (distance metric, scale features)
  • 044. k-NN pitfalls: choice of k, irrelevant features, lazy learner, cold-start in recommenders
  • 055. Jaccard-based user k-NN: new-user cold-start (tiny |A∩B|), niche-user irrelevance, drift
  • 066. Decision trees: split on the feature with the best information gain (entropy reduction) or Gini
  • 077. Depth controls complexity/overfitting
  • 088. High-cardinality split: a near-unique feature gives a pure-looking split that memorises and overfits
Worked example · free

High-cardinality split overfits a decision tree (mirrors 2025 Q6)

Q [4 marks]. A decision tree is trained to predict whether a car will sell, and one candidate feature is "colour" with about 80 unique values, almost one per row. The tree splits on "colour" and achieves a perfect fit on the training data. Explain why this split is a problem, and what goes wrong if you instead treat colour as a numeric value 1–80.
  • 1 markBecause "colour" has roughly one unique value per row, splitting on it sends almost every row to its own leaf, so each leaf is trivially pure — the tree effectively memorises the training rows rather than learning a pattern.
  • 1 markInformation gain is biased toward high-cardinality features: a feature with many distinct values can always partition the data into pure subsets, so it looks like the best split even though it carries no generalisable signal.
  • 1 markThe result is severe overfitting: the model scores perfectly on training data but fails on new cars, because a colour it has never seen (or seen once) gives it nothing to predict from.
  • 1 markTreating colour as numeric 1–80 is also wrong: it invents a false ordering (colour 80 is not "more" than colour 1) and arbitrary thresholds, imposing structure that does not exist in a nominal category.
A near-unique feature gives pure-looking splits that memorise the training data (information gain is biased toward high cardinality), so the tree overfits; encoding it as numeric 1–80 invents a false order and arbitrary thresholds.
Sia tip — The phrase the marker wants is "information gain is biased toward high-cardinality features" plus "memorises/overfits" — name both the cause and the symptom, and flag that numeric encoding of a nominal category fabricates an ordering.
Glossary

Key terms

Classification vs regression
Two supervised tasks: classification predicts a discrete class/label (spam or not), regression predicts a continuous value (house price). Both learn from labelled training data and are evaluated on a held-out test set.
k-Nearest Neighbours (k-NN)
A lazy classifier that labels a query point by the majority vote of its k nearest neighbours under some distance metric. It is simple but sensitive to the choice of k, to feature scaling, and to irrelevant features.
Cold-start problem
In a k-NN recommender, a new user has almost no history, so the overlap with other users (e.g. Jaccard similarity |A∩B|/|A∪B|) is tiny and unreliable, and the system cannot find good neighbours to recommend from. Fixes include content/hybrid methods, popularity priors and recency decay.
Decision tree
A model that recursively splits the data on the feature giving the best purity gain (information gain = entropy reduction, or Gini), forming a tree of decisions. Tree depth controls complexity: deeper trees fit more but overfit more.
Information gain
The reduction in entropy achieved by splitting on a feature; decision trees pick the split with the highest information gain. It is biased toward high-cardinality features, which can create pure-looking but overfitting splits.
High-cardinality overfitting
A categorical feature with many distinct values (near one per row) partitions the data into trivially pure leaves, so the tree memorises the training set and fails to generalise. Treating such a feature as numeric also wrongly imposes an order and thresholds.
FAQ

Supervised Learning: k-NN & Decision Trees FAQ

Why does feature scaling matter so much for k-NN?

k-NN decides neighbours by distance, so a feature measured on a large scale (income in dollars) will dominate the distance and swamp a small-scale feature (age in years), purely because of its units. Without scaling, the "nearest" neighbours are chosen mostly by the big-range feature, distorting the classification. Scaling each feature (min-max or z-score) puts them on a comparable footing so distance reflects genuine similarity.

Why does a k-NN recommender struggle with new users?

This is the cold-start problem. A new user has rated almost nothing, so the intersection with any other user's items (the |A∩B| in Jaccard similarity) is tiny, making the similarity scores noisy and unreliable. The system cannot find trustworthy neighbours to recommend from. Mitigations include content-based or hybrid recommenders, popularity priors for new users, recency/decay weighting, and dimensionality reduction or approximate nearest-neighbour search for scale.

Why is a feature with many unique values dangerous in a decision tree?

Because information gain is biased toward high-cardinality features: a feature with nearly one distinct value per row can always split the data into pure leaves, so it looks like the best split even though it has no generalisable signal. The tree ends up memorising training rows and overfits, failing on new data. Encoding the feature as a number to avoid this just creates a different problem — a false ordering and arbitrary thresholds on a nominal category.

How is supervised learning examined in COMP20008?

As short-answer plus critique, mirroring 2025 Q6 (the high-cardinality decision-tree split) and Q7 (the Jaccard k-NN recommender's cold-start). Marks reward naming the right failure mode — k-NN's scaling/cold-start sensitivity, information gain's high-cardinality bias and the resulting overfitting — and proposing a sensible fix, rather than coding a classifier.

Study strategy

Exam move

Anchor the chapter on the classification-vs-regression distinction and the train/test discipline, then learn each model by its standard failure mode, because that is what the exam tests. For k-NN, rehearse two critiques: scaling (a big-range feature dominates distance) and cold-start (new users have tiny overlap, so Jaccard similarity is unreliable), each with a fix. For decision trees, memorise the high-cardinality story — information gain is biased toward many-valued features, giving pure-looking splits that memorise and overfit, and numeric encoding of a nominal category fabricates an order. Keep "information gain = entropy reduction" and "depth controls overfitting" in active recall. These critique patterns recur, so practise stating cause, symptom and mitigation in three crisp lines.

A+Everything unlocked
Unlocks this Bible + all 24 of your University of Melbourne subjects - and 1,000+ Bibles across every Australian university.
Sia - your COMP20008 tutor, unlimited, worked the way the exam marks it
The full 8-page Bible + practice bank with worked solutions
Chrome extension - sync your LMS so Sia knows your deadlines
Bilingual EN / Chinese on every Bible and every Sia answer
$25/ month
30-day money-back · cancel in one tap · how it works
Unlock the full COMP20008 Bible + 24 University of Melbourne subjects解锁完整 COMP20008 Bible + University of Melbourne 24 门科目
$25/mo