University of Sydney · FACULTY OF COMPUTER SCIENCE

COMP5318 · Machine Learning and Data Mining

- one subject, every graph, every model, every mark
Computer Science14 Chapters7-page Bible
Our own words - no uploaded lecturer files
Updated for this semester
Chapter 1 of 11 · COMP5318

Foundations of Machine Learning & Data Mining

This opening chapter of University of Sydney COMP5318 Machine Learning and Data Mining sets up the vocabulary and rules every later week reuses: the supervised vs unsupervised split, the empirical-risk vs true-risk trade-off that explains overfitting, feature normalisation, and the distance and similarity measures that most algorithms compare points with. These foundations are examined directly in the closed-book final and quietly underpin every bigger question on it.

In this chapter

What this chapter covers

  • 01Tell supervised from unsupervised learning, and classification from regression
  • 02Name what clustering and dimensionality reduction do when there are no labels
  • 03Define empirical (training) risk vs true (generalisation) risk with 0/1 loss
  • 04Read the training-vs-test error U-curve to spot underfitting and overfitting
  • 05Apply min-max normalisation x' = (x - min) / (max - min) to rescale a feature
  • 06Contrast min-max scaling with z-score standardisation and know when each helps
  • 07Compute Euclidean and Manhattan distance between two vectors by hand
  • 08Use cosine similarity to compare direction rather than magnitude
  • 09Explain why distance-based methods must scale features before measuring
Worked example · free

Normalise two records, then compute Euclidean distance

Q [4 marks]. Two customers are described by age (years) and annual spend ($). Across the dataset, age ranges 18-68 and spend ranges 0-500. The customers are X = (age 28, spend 100) and Y = (age 48, spend 350). Apply min-max normalisation to both, then compute the Euclidean distance on the normalised values. Why must you normalise first?
  • +1Normalise X. age' = (28 - 18)/(68 - 18) = 10/50 = 0.20; spend' = (100 - 0)/(500 - 0) = 0.20. So X' = (0.20, 0.20).
  • +1Normalise Y. age' = (48 - 18)/50 = 30/50 = 0.60; spend' = (350 - 0)/500 = 0.70. So Y' = (0.60, 0.70).
  • +1Euclidean distance: subtract slot by slot, square, sum, square-root. D = sqrt((0.60 - 0.20)^2 + (0.70 - 0.20)^2) = sqrt(0.16 + 0.25) = sqrt(0.41) which is about 0.640.
  • +1Why normalise first: on the raw values the spend gap (250) would dwarf the age gap (20) purely because spend has the larger range, so the distance would be decided almost entirely by spend. Rescaling to [0, 1] gives each feature an equal say.
X' = (0.20, 0.20), Y' = (0.60, 0.70); Euclidean distance = sqrt(0.41), about 0.640. Normalising first stops the larger-range feature (spend) from dominating the distance.
Sia tip — Min-max uses (x - min)/(max - min), not x/max: forgetting to subtract the minimum shifts every value. And keep the distance formulae apart: Euclidean squares and roots, Manhattan just sums absolute differences.
Glossary

Key terms

Supervised learning
Learning from labelled input-output pairs to predict a target: classification when the target is a category, regression when it is a number.
Unsupervised learning
Finding structure in unlabelled data, such as grouping similar points (clustering) or compressing many features into a few (dimensionality reduction).
Empirical (training) risk
Average loss measured on the data the model was trained on; with 0/1 loss it is just the training error rate. It can be driven toward zero by adding complexity.
True (generalisation) risk
Expected loss over all unseen data from the same distribution; the quantity you actually want small. It is estimated by the error on a held-out test set.
Overfitting
When a model fits training noise so that training error keeps falling while test error rises; seen as a widening gap between the two error curves.
Min-max normalisation
Rescaling a feature to [0, 1] via x' = (x - min)/(max - min), so no single large-range feature dominates a distance.
Euclidean distance
Straight-line distance sqrt(sum of squared per-feature differences) between two vectors.
Cosine similarity
The cosine of the angle between two vectors, A.B / (||A|| ||B||); it ranges from -1 to 1 and compares direction, ignoring magnitude.
FAQ

Foundations of Machine Learning & Data Mining FAQ

Is machine learning examined by calculation or by explanation in COMP5318?

Both. The final is paper-based, 2 hours, closed book with only a non-programmable calculator, and mixes short-answer and True/False concept questions with small by-hand calculations. Foundations appear as quick items such as classifying a task as supervised or unsupervised, normalising a feature, or computing a Euclidean or Manhattan distance, so you are marked on method plus the correct final number plus a one-line reason.

What is the difference between empirical risk and true risk, and why does it matter?

Empirical (training) risk is the average error on the data you trained on, which you can always compute. True risk is the expected error on unseen data, which you can only estimate with a separate test set. They matter because minimising training error alone leads to overfitting: as complexity grows, training error keeps dropping while test error eventually rises, so you choose the model at the lowest test error, not the lowest training error.

Can AI help me learn the foundations of machine learning in COMP5318?

Yes, for understanding rather than for handing in answers. A study tool like Sia can explain, step by step, why you normalise before computing a distance, walk you through a Euclidean-vs-Manhattan calculation on practice numbers, or check your reasoning about overfitting. Use it to build the method so you can reproduce it under closed-book exam conditions; do not use it to obtain answers for assessed quizzes or assignments, and always acknowledge any AI tools you use, as the University's academic-integrity policy requires.

Studying with AI? Sia — free AI machine learning tutor works through COMP5318 step by step.

Study strategy

Exam move

Treat this chapter as the toolkit you will reach for all semester, so drill it until it is automatic. First, be able to classify any task in one line as supervised (classification or regression) or unsupervised (clustering or dimensionality reduction) with the reason. Second, understand the empirical-vs-true-risk U-curve well enough to explain underfitting, overfitting and the sweet spot in your own words. Third, make the calculations mechanical: min-max normalise a feature, then compute Euclidean and Manhattan distances by hand, remembering to scale before you measure. On a 2-hour (120-minute) paper, budget about one minute per mark, so a five-mark normalise-and-distance question deserves roughly five minutes. Keep the final exam's two hurdles in view: you need at least 40% in the exam to avoid your mark being capped at 45, and at least 50% overall to pass, and you should confirm the exact exam date on the Canvas exam timetable.

A+Everything unlocked
Unlocks this Bible + all 25 of your University of Sydney subjects - and 1,000+ Bibles across every Australian university.
Sia - your COMP5318 tutor, unlimited, worked the way the exam marks it
The full 7-page Bible + practice bank with worked solutions
Chrome extension - sync your LMS so Sia knows your deadlines
Bilingual EN / Chinese on every Bible and every Sia answer
$25/ month
30-day money-back · cancel in one tap · how it works
Unlock the full COMP5318 Bible + 25 University of Sydney subjects解锁完整 COMP5318 Bible + University of Sydney 25 门科目
$25/mo