COMP20008 · Elements Of Data Processing
Supervised Learning: k-NN & Decision Trees
Supervised learning predicts a label (classification) or a value (regression) from labelled examples. This chapter covers k-Nearest Neighbours (classify by majority vote of the k closest points — sensitive to k, to feature scaling and to the cold-start problem in recommenders) and decision trees (recursively split on the feature giving the best purity gain, with depth controlling overfitting). It is examined as short-answer + critique: the 2025 exam's Q6 probes a high-cardinality categorical split that overfits, and Q7 critiques a Jaccard-based k-NN recommender's cold-start failure.
What this chapter covers
- 011. Classification vs regression: predict a class/label vs a continuous value
- 022. Train/test split: learn from labelled data, evaluate on held-out data
- 033. k-NN: classify by majority vote of the k nearest neighbours (distance metric, scale features)
- 044. k-NN pitfalls: choice of k, irrelevant features, lazy learner, cold-start in recommenders
- 055. Jaccard-based user k-NN: new-user cold-start (tiny |A∩B|), niche-user irrelevance, drift
- 066. Decision trees: split on the feature with the best information gain (entropy reduction) or Gini
- 077. Depth controls complexity/overfitting
- 088. High-cardinality split: a near-unique feature gives a pure-looking split that memorises and overfits
High-cardinality split overfits a decision tree (mirrors 2025 Q6)
- 1 markBecause "colour" has roughly one unique value per row, splitting on it sends almost every row to its own leaf, so each leaf is trivially pure — the tree effectively memorises the training rows rather than learning a pattern.
- 1 markInformation gain is biased toward high-cardinality features: a feature with many distinct values can always partition the data into pure subsets, so it looks like the best split even though it carries no generalisable signal.
- 1 markThe result is severe overfitting: the model scores perfectly on training data but fails on new cars, because a colour it has never seen (or seen once) gives it nothing to predict from.
- 1 markTreating colour as numeric 1–80 is also wrong: it invents a false ordering (colour 80 is not "more" than colour 1) and arbitrary thresholds, imposing structure that does not exist in a nominal category.
Key terms
- Classification vs regression
- Two supervised tasks: classification predicts a discrete class/label (spam or not), regression predicts a continuous value (house price). Both learn from labelled training data and are evaluated on a held-out test set.
- k-Nearest Neighbours (k-NN)
- A lazy classifier that labels a query point by the majority vote of its k nearest neighbours under some distance metric. It is simple but sensitive to the choice of k, to feature scaling, and to irrelevant features.
- Cold-start problem
- In a k-NN recommender, a new user has almost no history, so the overlap with other users (e.g. Jaccard similarity |A∩B|/|A∪B|) is tiny and unreliable, and the system cannot find good neighbours to recommend from. Fixes include content/hybrid methods, popularity priors and recency decay.
- Decision tree
- A model that recursively splits the data on the feature giving the best purity gain (information gain = entropy reduction, or Gini), forming a tree of decisions. Tree depth controls complexity: deeper trees fit more but overfit more.
- Information gain
- The reduction in entropy achieved by splitting on a feature; decision trees pick the split with the highest information gain. It is biased toward high-cardinality features, which can create pure-looking but overfitting splits.
- High-cardinality overfitting
- A categorical feature with many distinct values (near one per row) partitions the data into trivially pure leaves, so the tree memorises the training set and fails to generalise. Treating such a feature as numeric also wrongly imposes an order and thresholds.
Supervised Learning: k-NN & Decision Trees FAQ
Why does feature scaling matter so much for k-NN?
k-NN decides neighbours by distance, so a feature measured on a large scale (income in dollars) will dominate the distance and swamp a small-scale feature (age in years), purely because of its units. Without scaling, the "nearest" neighbours are chosen mostly by the big-range feature, distorting the classification. Scaling each feature (min-max or z-score) puts them on a comparable footing so distance reflects genuine similarity.
Why does a k-NN recommender struggle with new users?
This is the cold-start problem. A new user has rated almost nothing, so the intersection with any other user's items (the |A∩B| in Jaccard similarity) is tiny, making the similarity scores noisy and unreliable. The system cannot find trustworthy neighbours to recommend from. Mitigations include content-based or hybrid recommenders, popularity priors for new users, recency/decay weighting, and dimensionality reduction or approximate nearest-neighbour search for scale.
Why is a feature with many unique values dangerous in a decision tree?
Because information gain is biased toward high-cardinality features: a feature with nearly one distinct value per row can always split the data into pure leaves, so it looks like the best split even though it has no generalisable signal. The tree ends up memorising training rows and overfits, failing on new data. Encoding the feature as a number to avoid this just creates a different problem — a false ordering and arbitrary thresholds on a nominal category.
How is supervised learning examined in COMP20008?
As short-answer plus critique, mirroring 2025 Q6 (the high-cardinality decision-tree split) and Q7 (the Jaccard k-NN recommender's cold-start). Marks reward naming the right failure mode — k-NN's scaling/cold-start sensitivity, information gain's high-cardinality bias and the resulting overfitting — and proposing a sensible fix, rather than coding a classifier.
Exam move
Anchor the chapter on the classification-vs-regression distinction and the train/test discipline, then learn each model by its standard failure mode, because that is what the exam tests. For k-NN, rehearse two critiques: scaling (a big-range feature dominates distance) and cold-start (new users have tiny overlap, so Jaccard similarity is unreliable), each with a fix. For decision trees, memorise the high-cardinality story — information gain is biased toward many-valued features, giving pure-looking splits that memorise and overfit, and numeric encoding of a nominal category fabricates an order. Keep "information gain = entropy reduction" and "depth controls overfitting" in active recall. These critique patterns recur, so practise stating cause, symptom and mitigation in three crisp lines.