University of Melbourne · S1 2026 · FACULTY OF INFORMATION TECHNOLOGY

COMP20008 · Elements Of Data Processing

- one subject, every graph, every model, every mark

50% final exam · hurdle14 Chapters7-page Bible

Our own words - no uploaded lecturer files

Built to mirror S1 2026 · updated this semester

Chapter 10 of 12 · COMP20008

Regression, Evaluation & Feature Selection

This chapter covers how to fit, evaluate and trust a predictive model. Linear regression fits a line by least squares; k-fold cross-validation gives an honest performance estimate; the confusion matrix yields accuracy, precision, recall and F1; and you choose between precision and recall by the cost of each error. The chapter ends on data leakage — the classic trap where using the target as a feature drives training error to zero. It is examined as calculation + critique: the 2024 exam's Q6 computes confusion-matrix metrics and asks which to optimise, and the 2025 exam's Q8 is the target-as-feature leakage question.

In this chapter

What this chapter covers

011. Linear regression: ŷ = b₀ + b₁x by least squares; interpret slope; residuals; R²
022. Experimental design: train / validation / test split, hold-out
033. k-fold cross-validation: rotate the held-out fold to estimate performance more reliably
044. Confusion matrix: TP, FP, FN, TN
055. Metrics from the formula sheet: accuracy, precision, recall, F1
066. Precision vs recall trade-off: choose by the cost of false positives vs false negatives
077. Feature selection: filter (correlation/MI), wrapper, embedded methods
088. Data leakage: using info unavailable at prediction time; target-as-feature gives training MSE = 0

Worked example · free

Confusion-matrix metrics and which to optimise (mirrors 2024 Q6 + formula sheet)

Q [5 marks]. A fraud classifier is run on 1000 transactions and gives TP = 30, FP = 20, FN = 10, TN = 940. Compute accuracy, precision, recall and F1, then state which metric matters most for fraud detection and why.

1 markAccuracy = (TP + TN) / total = (30 + 940) / 1000 = 0.97.
1 markPrecision = TP / (TP + FP) = 30 / (30 + 20) = 30 / 50 = 0.60.
1 markRecall = TP / (TP + FN) = 30 / (30 + 10) = 30 / 40 = 0.75.
1 markF1 = 2 · (Precision · Recall) / (Precision + Recall) = 2 · (0.60 · 0.75) / (0.60 + 0.75) = 0.90 / 1.35 ≈ 0.667.
1 markFor fraud, a missed fraud (false negative) is the costly error, so prioritise recall. Note that the 0.97 accuracy is misleading because the data is heavily imbalanced — almost all transactions are legitimate, so a model could score high accuracy while catching little fraud.

Accuracy = 0.97, Precision = 0.60, Recall = 0.75, F1 ≈ 0.667; optimise recall for fraud because the cost of missing fraud dominates and accuracy is misleading on imbalanced data.

Sia tip — On imbalanced data, accuracy lies — always quote precision, recall and F1 and justify the choice by the cost of each error type (false negatives in fraud/cancer screening → recall; false positives in a high-stakes hire → precision).

Glossary

Key terms

Linear regression: Fits ŷ = b₀ + b₁x (or several predictors) by least squares, minimising the squared residuals. The slope b₁ is the change in ŷ per unit of x, and R² is the proportion of variation in y the model explains.
k-fold cross-validation: Splits the data into k folds, trains on k−1 and tests on the held-out fold, rotating through all k folds and averaging the scores. It gives a more reliable performance estimate than a single train/test split by using all data for both training and testing across folds.
Confusion matrix: A 2×2 table of true positives (TP), false positives (FP), false negatives (FN) and true negatives (TN) from which accuracy, precision, recall and F1 are computed.
Precision vs recall: Precision = TP/(TP+FP) is how many predicted positives are correct; recall = TP/(TP+FN) is how many actual positives were caught. You trade them off by the cost of errors — recall when missing a positive is costly (fraud, cancer), precision when a false positive is costly (a high-stakes hire).
Feature selection: Choosing a subset of useful features: filter methods rank by correlation/MI with the target, wrapper methods search subsets by model performance, embedded methods select during model fitting (e.g. tree splits). It reduces overfitting, cost and noise.
Data leakage: Using information at training time that will not be available (or that encodes the target) at prediction time, giving optimistic, invalid performance. The extreme case is adding the target y as a feature, which lets the model copy it and gives a training MSE of 0.

FAQ

Regression, Evaluation & Feature Selection FAQ

Why can a model with 97% accuracy still be bad?

Because accuracy is misleading on imbalanced data. If 95% of transactions are legitimate, a model that predicts "not fraud" for everything scores 95% accuracy while catching zero fraud. The metric you trust depends on the task: for fraud or disease screening, recall (catching the rare positives) matters far more, and you should report precision, recall and F1 rather than accuracy alone.

How do I decide whether to optimise precision or recall?

By the relative cost of the two error types. If a false negative is the expensive mistake — missing a fraud, missing a cancer — optimise recall so you catch as many true positives as possible. If a false positive is the expensive mistake — wrongly hiring a CEO, flagging a good customer — optimise precision so your positive predictions are trustworthy. State the costlier error first, then name the metric it implies.

What is data leakage and how do I spot it?

Data leakage is when the model is trained on information it won't have at prediction time, or that secretly encodes the answer — giving over-optimistic, invalid results. The textbook case (2025 Q8) is adding the target y itself as a feature: the model sets that coefficient to 1, copies y, and achieves a training MSE of 0, yet it is useless because y is unavailable when predicting new cases. The tell is performance that is "too good to be true" — audit every feature for whether it would actually be available and whether it leaks the target.

Why use k-fold cross-validation instead of a single train/test split?

A single split gives a performance estimate that depends heavily on which rows happened to land in the test set, so it can be lucky or unlucky. k-fold cross-validation rotates the held-out fold through all k folds and averages the results, so every row is used for both training and testing across folds. This gives a more stable, less variance-prone estimate of how the model generalises, which is why it is the standard evaluation protocol.

Study strategy

Exam move

Make the confusion-matrix calculations second nature — accuracy, precision, recall, F1 — and always finish a metrics question with the "which to optimise" judgement, because that interpretation carries the final mark. Tie precision/recall to the cost of errors with two stock examples (recall for fraud/cancer; precision for a high-stakes hire) and remember accuracy lies on imbalanced data. Treat data leakage as a guaranteed exam thread: rehearse the target-as-feature story (coefficient 1, copies y, training MSE 0, useless at deployment) and the general tell of "too good to be true" scores. Know why k-fold beats a single split (lower-variance estimate, all data used) and be able to name the three families of feature selection. The formula sheet gives the formulas, so practise applying and interpreting them, not memorising them.

Keep going — explore the course

A+Everything unlocked

Unlocks this Bible + all 24 of your University of Melbourne subjects - and 1,000+ Bibles across every Australian university.

Sia - your COMP20008 tutor, unlimited, worked the way the exam marks it

The full 7-page Bible + practice bank with worked solutions

Chrome extension - sync your LMS so Sia knows your deadlines

Bilingual EN / Chinese on every Bible and every Sia answer

$25/ month

30-day money-back · cancel in one tap · how it works