COMP5318 · Machine Learning and Data Mining
Foundations of Machine Learning & Data Mining
This opening chapter of University of Sydney COMP5318 Machine Learning and Data Mining sets up the vocabulary and rules every later week reuses: the supervised vs unsupervised split, the empirical-risk vs true-risk trade-off that explains overfitting, feature normalisation, and the distance and similarity measures that most algorithms compare points with. These foundations are examined directly in the closed-book final and quietly underpin every bigger question on it.
What this chapter covers
- 01Tell supervised from unsupervised learning, and classification from regression
- 02Name what clustering and dimensionality reduction do when there are no labels
- 03Define empirical (training) risk vs true (generalisation) risk with 0/1 loss
- 04Read the training-vs-test error U-curve to spot underfitting and overfitting
- 05Apply min-max normalisation x' = (x - min) / (max - min) to rescale a feature
- 06Contrast min-max scaling with z-score standardisation and know when each helps
- 07Compute Euclidean and Manhattan distance between two vectors by hand
- 08Use cosine similarity to compare direction rather than magnitude
- 09Explain why distance-based methods must scale features before measuring
Normalise two records, then compute Euclidean distance
- +1Normalise X. age' = (28 - 18)/(68 - 18) = 10/50 = 0.20; spend' = (100 - 0)/(500 - 0) = 0.20. So X' = (0.20, 0.20).
- +1Normalise Y. age' = (48 - 18)/50 = 30/50 = 0.60; spend' = (350 - 0)/500 = 0.70. So Y' = (0.60, 0.70).
- +1Euclidean distance: subtract slot by slot, square, sum, square-root. D = sqrt((0.60 - 0.20)^2 + (0.70 - 0.20)^2) = sqrt(0.16 + 0.25) = sqrt(0.41) which is about 0.640.
- +1Why normalise first: on the raw values the spend gap (250) would dwarf the age gap (20) purely because spend has the larger range, so the distance would be decided almost entirely by spend. Rescaling to [0, 1] gives each feature an equal say.
Key terms
- Supervised learning
- Learning from labelled input-output pairs to predict a target: classification when the target is a category, regression when it is a number.
- Unsupervised learning
- Finding structure in unlabelled data, such as grouping similar points (clustering) or compressing many features into a few (dimensionality reduction).
- Empirical (training) risk
- Average loss measured on the data the model was trained on; with 0/1 loss it is just the training error rate. It can be driven toward zero by adding complexity.
- True (generalisation) risk
- Expected loss over all unseen data from the same distribution; the quantity you actually want small. It is estimated by the error on a held-out test set.
- Overfitting
- When a model fits training noise so that training error keeps falling while test error rises; seen as a widening gap between the two error curves.
- Min-max normalisation
- Rescaling a feature to [0, 1] via x' = (x - min)/(max - min), so no single large-range feature dominates a distance.
- Euclidean distance
- Straight-line distance sqrt(sum of squared per-feature differences) between two vectors.
- Cosine similarity
- The cosine of the angle between two vectors, A.B / (||A|| ||B||); it ranges from -1 to 1 and compares direction, ignoring magnitude.
Foundations of Machine Learning & Data Mining FAQ
Is machine learning examined by calculation or by explanation in COMP5318?
Both. The final is paper-based, 2 hours, closed book with only a non-programmable calculator, and mixes short-answer and True/False concept questions with small by-hand calculations. Foundations appear as quick items such as classifying a task as supervised or unsupervised, normalising a feature, or computing a Euclidean or Manhattan distance, so you are marked on method plus the correct final number plus a one-line reason.
What is the difference between empirical risk and true risk, and why does it matter?
Empirical (training) risk is the average error on the data you trained on, which you can always compute. True risk is the expected error on unseen data, which you can only estimate with a separate test set. They matter because minimising training error alone leads to overfitting: as complexity grows, training error keeps dropping while test error eventually rises, so you choose the model at the lowest test error, not the lowest training error.
Can AI help me learn the foundations of machine learning in COMP5318?
Yes, for understanding rather than for handing in answers. A study tool like Sia can explain, step by step, why you normalise before computing a distance, walk you through a Euclidean-vs-Manhattan calculation on practice numbers, or check your reasoning about overfitting. Use it to build the method so you can reproduce it under closed-book exam conditions; do not use it to obtain answers for assessed quizzes or assignments, and always acknowledge any AI tools you use, as the University's academic-integrity policy requires.
Studying with AI? Sia — free AI machine learning tutor works through COMP5318 step by step.
Exam move
Treat this chapter as the toolkit you will reach for all semester, so drill it until it is automatic. First, be able to classify any task in one line as supervised (classification or regression) or unsupervised (clustering or dimensionality reduction) with the reason. Second, understand the empirical-vs-true-risk U-curve well enough to explain underfitting, overfitting and the sweet spot in your own words. Third, make the calculations mechanical: min-max normalise a feature, then compute Euclidean and Manhattan distances by hand, remembering to scale before you measure. On a 2-hour (120-minute) paper, budget about one minute per mark, so a five-mark normalise-and-distance question deserves roughly five minutes. Keep the final exam's two hurdles in view: you need at least 40% in the exam to avoid your mark being capped at 45, and at least 50% overall to pass, and you should confirm the exact exam date on the Canvas exam timetable.