COMP5318 · Machine Learning and Data Mining
k-Nearest Neighbour & Rule-Based Classifiers
This chapter covers k-Nearest Neighbour (k-NN) and the rule-based classifiers (1R and PRISM) from the Week 2 classification block of COMP5318 Machine Learning and Data Mining at the University of Sydney. You will learn how k-NN labels a new point by a majority vote of its closest neighbours, how to compute the distances that decide “closest”, and how the choice of k trades bias against variance. It is a dependable source of short numeric and explain questions in the final exam.
What this chapter covers
- 01Why k-NN is a lazy learner: it stores the data and defers all work to prediction time
- 02The four-step k-NN algorithm: distance, sort, take k nearest, majority vote (average for regression)
- 03Computing Euclidean (straight-line) and Manhattan (city-block) distances by hand
- 04Why you must scale features before measuring distance, so no large-range feature dominates
- 05How k controls the decision boundary: small k overfits, large k oversmooths, k=n predicts the majority
- 06Choosing k with validation error, and the rule of thumb k ≤ √(number of training examples)
- 07The speed and memory cost of k-NN, and why eager learners suit real-time prediction
- 08Rule-based baselines: 1R (one attribute) and PRISM (a covering rule set)
Classify a point with k-NN (k=3, Euclidean)
- +1Compute the squared Euclidean distance from (4,4) to each training point (comparing squares is enough to rank, so no square roots are needed): (1,1)→3²+3²=18; (2,3)→2²+1²=5; (5,4)→1²+0²=1; (6,2)→2²+2²=8; (3,5)→1²+1²=2.
- +1Sort ascending and take the three smallest: (5,4)=1, (3,5)=2, (2,3)=5 are the three nearest; (6,2)=8 and (1,1)=18 are further away.
- +1Read off the labels of those three neighbours: (5,4)→B, (3,5)→A, (2,3)→A, giving the set {B, A, A}.
- +1Take the majority vote: two A against one B → predict class A.
Key terms
- k-Nearest Neighbour (k-NN)
- A classifier that labels a new point by a majority vote of the k closest training points under a distance measure; for regression it averages their target values.
- Lazy learner
- A method that does no training — it simply stores the training set and does all computation at prediction time. k-NN is the standard example; Naïve Bayes and decision trees are eager.
- Euclidean distance
- The straight-line distance between two points: the square root of the summed squared coordinate differences, √(Σ(aᵢ−bᵢ)²).
- Manhattan distance
- The city-block distance: the sum of the absolute coordinate differences, Σ|aᵢ−bᵢ| — the total of axis-aligned steps between the points.
- Feature scaling
- Rescaling features (for example min-max to [0,1] or the z-score) so that a feature with a large numeric range does not dominate the distance calculation.
- Bias-variance effect of k
- Small k gives a jagged boundary that overfits noise (low bias, high variance); large k smooths the boundary (high bias, low variance); at k=n every point is labelled with the global majority class.
- 1R
- A rule-based baseline that builds a single rule on one attribute (a decision stump): each attribute value is assigned its majority class, and the attribute with the lowest total error is kept.
- PRISM
- A rule-based covering algorithm that builds a set of IF-THEN rules one class at a time, adding conditions to each rule to push it toward covering only examples of its class.
k-Nearest Neighbour & Rule-Based Classifiers FAQ
Do I have to know both Euclidean and Manhattan distance for the exam?
Yes. Short distance calculations appear in the k-NN style of question, and the paper may specify either metric, so practise both. Euclidean squares each coordinate difference then takes a square root; Manhattan sums the absolute differences. Because the exam is closed book with only a non-programmable calculator, the numbers are kept small enough to work by hand — always show the working, since method marks are awarded even if one slip changes the final vote.
Why is k-NN a poor choice for real-time classification?
k-NN is a lazy learner: it does no training and instead compares each new point to every stored training point, so a single prediction over m training rows of n features costs about m×n distance computations and the whole training set must stay in memory. For a task needing thousands of classifications per second, an eager learner that precomputes a model — Naïve Bayes is the usual answer — predicts far faster. State the reason (lazy versus eager, prediction cost), not just the name.
Can AI help me with k-Nearest Neighbour in COMP5318?
Yes, as a study aid. Sia, the AskSia tutor, can explain k-NN step by step — how to set up a distance table, sort the neighbours, run the majority vote, and reason about how the choice of k changes the boundary — and it can generate fresh practice points so you build the method yourself. It will not do your graded assignments or exam for you, and it cannot promise a mark or a pass; the University also requires you to acknowledge any AI tools used in assessable work. Use it to understand the method, then practise unaided under exam conditions.
Studying with AI? Sia — free AI machine learning tutor works through COMP5318 step by step.
Exam move
Make the k-NN procedure mechanical first: given a test point, build a distance table, sort it, take the k nearest, and vote — rehearse this on small integer coordinates until it is automatic, since that is exactly how the final exam frames it. Layer the concepts on top: be able to write both distance formulas from memory, explain in one sentence why you scale features before measuring distance, and describe how the boundary and the two errors move as k grows (small k overfits, large k oversmooths, k=n predicts the majority). Keep the trade-off ready for the recurring “which classifier?” question — k-NN's slow prediction is its deciding weakness against an eager learner. Budget roughly one minute per mark, so a 4-mark classification takes about four to five minutes, and finish by drilling the 1R and PRISM baselines. The final exam is worth the majority of the unit and carries a minimum-mark hurdle, so this quick-scoring topic is worth banking cleanly; confirm the exact exam date on the Canvas exam timetable.