University of Sydney · FACULTY OF COMPUTER SCIENCE

COMP5318 · Machine Learning and Data Mining

- one subject, every graph, every model, every mark

Computer Science14 Chapters7-page Bible

Our own words - no uploaded lecturer files

Updated for this semester

Chapter 1 of 11 · COMP5318

Foundations of Machine Learning & Data Mining

This opening chapter of University of Sydney COMP5318 Machine Learning and Data Mining sets up the vocabulary and rules every later week reuses: the supervised vs unsupervised split, the empirical-risk vs true-risk trade-off that explains overfitting, feature normalisation, and the distance and similarity measures that most algorithms compare points with. These foundations are examined directly in the closed-book final and quietly underpin every bigger question on it.

In this chapter

What this chapter covers

01Tell supervised from unsupervised learning, and classification from regression
02Name what clustering and dimensionality reduction do when there are no labels
03Define empirical (training) risk vs true (generalisation) risk with 0/1 loss
04Read the training-vs-test error U-curve to spot underfitting and overfitting
05Apply min-max normalisation x' = (x - min) / (max - min) to rescale a feature
06Contrast min-max scaling with z-score standardisation and know when each helps
07Compute Euclidean and Manhattan distance between two vectors by hand
08Use cosine similarity to compare direction rather than magnitude
09Explain why distance-based methods must scale features before measuring

Worked example · free

Normalise two records, then compute Euclidean distance

Q [4 marks]. Two customers are described by age (years) and annual spend ($). Across the dataset, age ranges 18-68 and spend ranges 0-500. The customers are X = (age 28, spend 100) and Y = (age 48, spend 350). Apply min-max normalisation to both, then compute the Euclidean distance on the normalised values. Why must you normalise first?

+1Normalise X. age' = (28 - 18)/(68 - 18) = 10/50 = 0.20; spend' = (100 - 0)/(500 - 0) = 0.20. So X' = (0.20, 0.20).
+1Normalise Y. age' = (48 - 18)/50 = 30/50 = 0.60; spend' = (350 - 0)/500 = 0.70. So Y' = (0.60, 0.70).
+1Euclidean distance: subtract slot by slot, square, sum, square-root. D = sqrt((0.60 - 0.20)^2 + (0.70 - 0.20)^2) = sqrt(0.16 + 0.25) = sqrt(0.41) which is about 0.640.
+1Why normalise first: on the raw values the spend gap (250) would dwarf the age gap (20) purely because spend has the larger range, so the distance would be decided almost entirely by spend. Rescaling to [0, 1] gives each feature an equal say.

X' = (0.20, 0.20), Y' = (0.60, 0.70); Euclidean distance = sqrt(0.41), about 0.640. Normalising first stops the larger-range feature (spend) from dominating the distance.

Sia tip — Min-max uses (x - min)/(max - min), not x/max: forgetting to subtract the minimum shifts every value. And keep the distance formulae apart: Euclidean squares and roots, Manhattan just sums absolute differences.

Glossary

Key terms

Supervised learning: Learning from labelled input-output pairs to predict a target: classification when the target is a category, regression when it is a number.
Unsupervised learning: Finding structure in unlabelled data, such as grouping similar points (clustering) or compressing many features into a few (dimensionality reduction).
Empirical (training) risk: Average loss measured on the data the model was trained on; with 0/1 loss it is just the training error rate. It can be driven toward zero by adding complexity.
True (generalisation) risk: Expected loss over all unseen data from the same distribution; the quantity you actually want small. It is estimated by the error on a held-out test set.
Overfitting: When a model fits training noise so that training error keeps falling while test error rises; seen as a widening gap between the two error curves.
Min-max normalisation: Rescaling a feature to [0, 1] via x' = (x - min)/(max - min), so no single large-range feature dominates a distance.
Euclidean distance: Straight-line distance sqrt(sum of squared per-feature differences) between two vectors.
Cosine similarity: The cosine of the angle between two vectors, A.B / (||A|| ||B||); it ranges from -1 to 1 and compares direction, ignoring magnitude.

FAQ

Foundations of Machine Learning & Data Mining FAQ

Is machine learning examined by calculation or by explanation in COMP5318?

Both. The final is paper-based, 2 hours, closed book with only a non-programmable calculator, and mixes short-answer and True/False concept questions with small by-hand calculations. Foundations appear as quick items such as classifying a task as supervised or unsupervised, normalising a feature, or computing a Euclidean or Manhattan distance, so you are marked on method plus the correct final number plus a one-line reason.

What is the difference between empirical risk and true risk, and why does it matter?

Empirical (training) risk is the average error on the data you trained on, which you can always compute. True risk is the expected error on unseen data, which you can only estimate with a separate test set. They matter because minimising training error alone leads to overfitting: as complexity grows, training error keeps dropping while test error eventually rises, so you choose the model at the lowest test error, not the lowest training error.

Can AI help me learn the foundations of machine learning in COMP5318?

Yes, for understanding rather than for handing in answers. A study tool like Sia can explain, step by step, why you normalise before computing a distance, walk you through a Euclidean-vs-Manhattan calculation on practice numbers, or check your reasoning about overfitting. Use it to build the method so you can reproduce it under closed-book exam conditions; do not use it to obtain answers for assessed quizzes or assignments, and always acknowledge any AI tools you use, as the University's academic-integrity policy requires.

Studying with AI? Sia — free AI machine learning tutor works through COMP5318 step by step.

Study strategy

Exam move

Treat this chapter as the toolkit you will reach for all semester, so drill it until it is automatic. First, be able to classify any task in one line as supervised (classification or regression) or unsupervised (clustering or dimensionality reduction) with the reason. Second, understand the empirical-vs-true-risk U-curve well enough to explain underfitting, overfitting and the sweet spot in your own words. Third, make the calculations mechanical: min-max normalise a feature, then compute Euclidean and Manhattan distances by hand, remembering to scale before you measure. On a 2-hour (120-minute) paper, budget about one minute per mark, so a five-mark normalise-and-distance question deserves roughly five minutes. Keep the final exam's two hurdles in view: you need at least 40% in the exam to avoid your mark being capped at 45, and at least 50% overall to pass, and you should confirm the exact exam date on the Canvas exam timetable.

Keep going — explore the course

A+Everything unlocked

Unlocks this Bible + all 25 of your University of Sydney subjects - and 1,000+ Bibles across every Australian university.

Sia - your COMP5318 tutor, unlimited, worked the way the exam marks it

The full 7-page Bible + practice bank with worked solutions

Chrome extension - sync your LMS so Sia knows your deadlines

Bilingual EN / Chinese on every Bible and every Sia answer

$25/ month

30-day money-back · cancel in one tap · how it works