COMP5318 · Machine Learning and Data Mining
Machine Learning and Data Mining
COMP5318 Machine Learning and Data Mining is the University of Sydney's postgraduate introduction to the algorithms that learn from data — from k-nearest neighbour, Naïve Bayes and decision trees to support vector machines, neural networks and reinforcement learning. This free study guide re-derives every core method and works each calculation by hand, matched to how COMP5318 is assessed at the University of Sydney: a 60% closed-book final exam alongside weekly Canvas quizzes and two group assignments. It turns the Machine Learning and Data Mining syllabus into reliable, method-by-method exam marks.
What COMP5318 covers
Eleven chapters trace COMP5318 from foundations and distance measures through regression, Naïve Bayes, trees, SVMs and PCA, neural and deep networks, and on to clustering, Markov models and reinforcement learning.
How COMP5318 is assessed
| Component | Weight | Format |
|---|---|---|
| Weekly Homework Quizzes (×10) | 5% | Canvas quizzes weeks 2-11, direct application of that week's lecture; due Tuesdays |
| Assignment 1 (group) | 15% | A program to solve a given task + short report, ~Week 7 |
| Assignment 2 (group) | 20% | A program + report discussing results, ~Week 11 |
| Final Exam | 60% | Paper-based, 2 hours, closed book (non-programmable calculator only), formal exam period |
Decision-tree information gain of a split
- +1Parent entropy: the node is 8 Buy and 8 No-buy (50/50), so H(S) = −0.5·log₂(0.5) − 0.5·log₂(0.5) = 1.0 bit (maximum uncertainty).
- +1Child A entropy (6 Buy, 2 No-buy): H = −0.75·log₂(0.75) − 0.25·log₂(0.25) = 0.75·0.415 + 0.25·2 = 0.811 bits; by symmetry child B (2 Buy, 6 No-buy) is also 0.811 bits.
- +1Weighted child entropy = (8/16)·0.811 + (8/16)·0.811 = 0.811 bits.
- +1Information gain = H(S) − weighted child entropy = 1.0 − 0.811 = 0.189 bits — positive, so the split is worth making; choose the attribute with the highest gain.
Key terms
- Supervised learning
- Learning from labelled input–output pairs to predict a target: classification when the target is a category, regression when it is a number.
- Unsupervised learning
- Finding structure in unlabelled data, e.g. clustering points into groups or reducing dimensions with PCA.
- Overfitting
- When a model fits the training noise so training error keeps falling while true (test) error rises; the best model sits at the lowest test error, not lowest train.
- Euclidean distance
- Straight-line distance √(Σ(aᵢ−bᵢ)²) between two feature vectors; features must be scaled first so no large-range feature dominates.
- k-Nearest Neighbour (kNN)
- A lazy classifier that labels a query by the majority class of its k closest training points; small k overfits, large k over-smooths.
- Naïve Bayes
- A probabilistic classifier picking the class that maximises P(class)·∏P(featureⱼ|class), assuming features are conditionally independent given the class.
- Entropy
- H(S) = −Σpᵢlog₂pᵢ, the impurity of a node in bits: 0 when pure, 1 for a two-class 50/50 split.
- Information gain
- The drop in entropy from a split, H(S) − Σ(|Sₖ|/|S|)H(Sₖ); a decision tree splits on the attribute with the highest gain.
- Support vector
- A training point lying exactly on an SVM's ±1 margin plane; only the support vectors determine the maximum-margin boundary.
- Principal Component Analysis (PCA)
- An unsupervised method that finds orthogonal directions of maximum variance for dimensionality reduction and compression — it ignores labels.
- Backpropagation
- Training a neural net by propagating error back: output δ = (t−o)o(1−o) for sigmoid, and each weight updates by Δw = η·δ·(source output).
- Regularization
- A penalty on the weights to curb overfitting: Ridge (L2) shrinks weights toward zero, while Lasso (L1) drives some exactly to zero for feature selection.
- Confusion matrix
- The 2×2 table of TP, FP, FN, TN from which accuracy, precision, recall and F1 are read; on imbalanced data prefer precision/recall over accuracy.
- Q-learning
- A reinforcement-learning update Q(s,a) ← Q + α[r + γ·maxₐ′Q(s′,a′) − Q] that learns action values from rewards, not labelled examples.
COMP5318 FAQ
Can AI help me study COMP5318?
Yes — used as a study aid, AI is well suited to a method-heavy unit like COMP5318 Machine Learning and Data Mining. Sia, AskSia's AI tutor, explains each method step by step: it can walk you through a kNN vote, a Naïve Bayes product, an entropy and information-gain calculation, or a single backpropagation step, and check your reasoning. It is there to help you understand the material, not to complete your Canvas quizzes or assignments for you.
Where can I find past exam papers or practice for COMP5318?
Official past papers and sample exams for COMP5318 are released through Canvas and the University of Sydney library's exam-paper collection — always start there. This free guide adds a practice bank of exam-style problems with full worked solutions across the whole unit, and you can ask Sia to explain any step or generate similar practice so you can rehearse the method, not just read the answer.
What can Sia do that a textbook can't?
A textbook shows one worked example; Sia adapts to you. Ask it to re-explain a step a different way, vary the numbers so you can practise, or pinpoint where your working went wrong on a decision-tree or backpropagation question. Sia explains step by step and is a study aid — it will not sit your COMP5318 exam or complete your assessments, and it cannot promise a grade.
Is COMP5318 hard?
COMP5318 Machine Learning and Data Mining is a quantitative postgraduate unit at the University of Sydney: it assumes basic linear algebra, probability and competent programming, and it moves fast across many algorithms. Most students find the breadth — and the 40% exam hurdle — the real challenge rather than any single topic, so steady weekly work on the quizzes and the by-hand calculations pays off. Confirm current prerequisites and details on Canvas and the handbook.
Is the COMP5318 exam open or closed book?
The final exam is closed book and paper-based, worth 60% of the unit, with 2 hours of writing time; a non-programmable calculator is the only permitted aid (no notes, laptops or phones). Answers are written in the exam booklet, and there is no coding in the exam. Confirm the exact date and venue on the Canvas exam timetable.
What are the COMP5318 hurdle requirements?
There are two hurdles at the University of Sydney: you must score at least 40% in the final exam or your final mark is capped at a maximum of 45, and you must reach an overall mark of at least 50 to pass the unit. Because the exam is 60% and hurdled, strong coursework alone cannot rescue a weak exam — always check the current rules on Canvas.
Do I need strong maths and programming for COMP5318?
The unit lists basic linear algebra, probability theory and competent programming in a high-level language as assumed knowledge, and Python (Jupyter) is used in the practicals and assignments. You do not need to be an expert, but comfort with vectors, probabilities and writing code will make COMP5318 Machine Learning and Data Mining much smoother — see the University of Sydney handbook for the official prerequisites.
How to study for the exam
Treat COMP5318 Machine Learning and Data Mining at the University of Sydney as a breadth unit assessed mostly by a 60% closed-book final: keep pace with the weekly Canvas quizzes, start the two group assignments early, and above all rehearse the by-hand calculations that recur on the paper — a distance, a kNN vote, a Naïve Bayes product, entropy and information gain, precision/recall/F1, a PCA judgement, a backpropagation step, an HMM Forward pass and a Q-learning update. For each, name the method, do one clean calculation, and write the one-line reason; because the exam carries a 40% hurdle, practise touching every topic rather than perfecting one, and use Sia to explain any step you get stuck on so it lifts your WAM without doing the work for you.
Your AI Computer Science tutor for COMP5318
Stuck on a hard COMP5318 question? Sia is AskSia’s AI Computer Science tutor — ask any COMP5318 Machine Learning and Data Mining question and get a clear, step-by-step explanation grounded in how the course is actually taught and assessed. Read this whole study guide free, then take your hardest questions to Sia.