COMP20008 · Elements Of Data Processing
Regression, Evaluation & Feature Selection
This chapter covers how to fit, evaluate and trust a predictive model. Linear regression fits a line by least squares; k-fold cross-validation gives an honest performance estimate; the confusion matrix yields accuracy, precision, recall and F1; and you choose between precision and recall by the cost of each error. The chapter ends on data leakage — the classic trap where using the target as a feature drives training error to zero. It is examined as calculation + critique: the 2024 exam's Q6 computes confusion-matrix metrics and asks which to optimise, and the 2025 exam's Q8 is the target-as-feature leakage question.
What this chapter covers
- 011. Linear regression: ŷ = b₀ + b₁x by least squares; interpret slope; residuals; R²
- 022. Experimental design: train / validation / test split, hold-out
- 033. k-fold cross-validation: rotate the held-out fold to estimate performance more reliably
- 044. Confusion matrix: TP, FP, FN, TN
- 055. Metrics from the formula sheet: accuracy, precision, recall, F1
- 066. Precision vs recall trade-off: choose by the cost of false positives vs false negatives
- 077. Feature selection: filter (correlation/MI), wrapper, embedded methods
- 088. Data leakage: using info unavailable at prediction time; target-as-feature gives training MSE = 0
Confusion-matrix metrics and which to optimise (mirrors 2024 Q6 + formula sheet)
- 1 markAccuracy = (TP + TN) / total = (30 + 940) / 1000 = 0.97.
- 1 markPrecision = TP / (TP + FP) = 30 / (30 + 20) = 30 / 50 = 0.60.
- 1 markRecall = TP / (TP + FN) = 30 / (30 + 10) = 30 / 40 = 0.75.
- 1 markF1 = 2 · (Precision · Recall) / (Precision + Recall) = 2 · (0.60 · 0.75) / (0.60 + 0.75) = 0.90 / 1.35 ≈ 0.667.
- 1 markFor fraud, a missed fraud (false negative) is the costly error, so prioritise recall. Note that the 0.97 accuracy is misleading because the data is heavily imbalanced — almost all transactions are legitimate, so a model could score high accuracy while catching little fraud.
Key terms
- Linear regression
- Fits ŷ = b₀ + b₁x (or several predictors) by least squares, minimising the squared residuals. The slope b₁ is the change in ŷ per unit of x, and R² is the proportion of variation in y the model explains.
- k-fold cross-validation
- Splits the data into k folds, trains on k−1 and tests on the held-out fold, rotating through all k folds and averaging the scores. It gives a more reliable performance estimate than a single train/test split by using all data for both training and testing across folds.
- Confusion matrix
- A 2×2 table of true positives (TP), false positives (FP), false negatives (FN) and true negatives (TN) from which accuracy, precision, recall and F1 are computed.
- Precision vs recall
- Precision = TP/(TP+FP) is how many predicted positives are correct; recall = TP/(TP+FN) is how many actual positives were caught. You trade them off by the cost of errors — recall when missing a positive is costly (fraud, cancer), precision when a false positive is costly (a high-stakes hire).
- Feature selection
- Choosing a subset of useful features: filter methods rank by correlation/MI with the target, wrapper methods search subsets by model performance, embedded methods select during model fitting (e.g. tree splits). It reduces overfitting, cost and noise.
- Data leakage
- Using information at training time that will not be available (or that encodes the target) at prediction time, giving optimistic, invalid performance. The extreme case is adding the target y as a feature, which lets the model copy it and gives a training MSE of 0.
Regression, Evaluation & Feature Selection FAQ
Why can a model with 97% accuracy still be bad?
Because accuracy is misleading on imbalanced data. If 95% of transactions are legitimate, a model that predicts "not fraud" for everything scores 95% accuracy while catching zero fraud. The metric you trust depends on the task: for fraud or disease screening, recall (catching the rare positives) matters far more, and you should report precision, recall and F1 rather than accuracy alone.
How do I decide whether to optimise precision or recall?
By the relative cost of the two error types. If a false negative is the expensive mistake — missing a fraud, missing a cancer — optimise recall so you catch as many true positives as possible. If a false positive is the expensive mistake — wrongly hiring a CEO, flagging a good customer — optimise precision so your positive predictions are trustworthy. State the costlier error first, then name the metric it implies.
What is data leakage and how do I spot it?
Data leakage is when the model is trained on information it won't have at prediction time, or that secretly encodes the answer — giving over-optimistic, invalid results. The textbook case (2025 Q8) is adding the target y itself as a feature: the model sets that coefficient to 1, copies y, and achieves a training MSE of 0, yet it is useless because y is unavailable when predicting new cases. The tell is performance that is "too good to be true" — audit every feature for whether it would actually be available and whether it leaks the target.
Why use k-fold cross-validation instead of a single train/test split?
A single split gives a performance estimate that depends heavily on which rows happened to land in the test set, so it can be lucky or unlucky. k-fold cross-validation rotates the held-out fold through all k folds and averages the results, so every row is used for both training and testing across folds. This gives a more stable, less variance-prone estimate of how the model generalises, which is why it is the standard evaluation protocol.
Exam move
Make the confusion-matrix calculations second nature — accuracy, precision, recall, F1 — and always finish a metrics question with the "which to optimise" judgement, because that interpretation carries the final mark. Tie precision/recall to the cost of errors with two stock examples (recall for fraud/cancer; precision for a high-stakes hire) and remember accuracy lies on imbalanced data. Treat data leakage as a guaranteed exam thread: rehearse the target-as-feature story (coefficient 1, copies y, training MSE 0, useless at deployment) and the general tell of "too good to be true" scores. Know why k-fold beats a single split (lower-variance estimate, all data used) and be able to name the three families of feature selection. The formula sheet gives the formulas, so practise applying and interpreting them, not memorising them.