STAT7038 · Regression Modelling
Model Selection
Adding predictors can never make R² worse, so R² alone always picks the biggest model and is useless for comparison. Every selection criterion is really R² (or SSE) plus a penalty for size; they differ only in how hard they penalise, and the goal is the parsimony / bias-variance trade-off — the simplest model that still fits. The chapter compares the standard criteria: adjusted R² (maximise; weak penalty), Mallows' Cp (minimise, want Cp ≈ p; a bias gauge), AIC (minimise; penalty 2p), BIC (minimise; heavier penalty, leans smaller), and PRESS (minimise; leave-one-out predictive ability). It then surveys the selection procedures — best-subset, forward, backward, and stepwise ('both') driven by AIC — and teaches you to read a step() trace, where the move with the lowest resulting AIC is taken and the search stops when <none> sits at the top. The closing caution is essential exam material: a stepwise-selected model's p-values and CIs are over-optimistic because the data already chose the terms.
What this chapter covers
- 01Why R² cannot rank models of different size
- 02Adjusted R², Mallows' Cp, AIC, BIC, PRESS — formulas and directions
- 03Cp ≈ p as a bias gauge; PRESS as leave-one-out prediction
- 04Selection procedures: best-subset, forward, backward, stepwise
- 05Reading a step() trace: lowest resulting AIC wins;
stops - 06The parsimony / bias-variance trade-off
- 07Why a stepwise model's p-values are over-optimistic
Worked example: describe a step() refinement
- +1How step() chooses. At each block it evaluates every single-term add or drop and ranks them by the AIC that would result, then takes the move with the lowest resulting AIC.
- +1Block 1. Dropping brain gives AIC 81.6, below the current 84.2, so brain is removed — it overlapped with body (high VIF) and added little once the others were present.
- +1Block 2. From the four-predictor model, dropping body lowers AIC further to 79.9, so body goes too.
- +1Block 3. Now <none> sits at the top — no add or drop lowers AIC below 79.9 — so the search stops.
- +1Final model. The retained predictors are lifespan + gestation + predation; AIC fell monotonically 84.2 → 81.6 → 79.9 at every accepted step.
- +1Caveat. The final model's p-values are over-optimistic because the data chose the terms — check residual plots and ideally validate out of sample.
Key terms
- Mallows' Cp
- Cp = SSEp/MSEfull − (n − 2p), minimised, with a good model having Cp ≈ p. It reads as a bias gauge: Cp ≈ p signals low bias (the systematic structure is captured), Cp >> p signals omitted terms, and Cp < p can flag overfitting noise.
- AIC vs BIC
- AIC = n·log(SSE/n) + 2p and BIC = n·log(SSE/n) + p·log n, both minimised. BIC's penalty p·log n is heavier than AIC's 2p once n ≥ 8, so BIC leans toward smaller models; when criteria disagree, BIC picks the smallest and adjusted R² the largest.
- PRESS statistic
- PRESS = Σ[ei/(1−hii)]², the leave-one-out predictive sum of squares, minimised. Computed from the full-data residuals and leverages without refitting, it estimates genuine out-of-sample error; a model with small SSE but large PRESS is overfitting.
- Stepwise selection
- A procedure ('both' direction) that, at each step, considers adding or dropping a single term and takes the move with the best criterion (AIC by default in R's step()). It re-checks earlier variables for removal as it goes, and stops when no move improves the criterion (
at the top). - Bias–variance trade-off
- Too few terms biases the estimates (underfitting); too many fits noise, inflating variance and destabilising coefficients (overfitting). Every selection criterion formalises a penalty on model size to balance the two — prefer the simplest model that adequately fits (parsimony / Occam).
Model Selection FAQ
Why can't I just use R² to pick the best model?
Adding any predictor — even pure noise — cannot raise SSE, so R² only ever climbs and always favours the biggest model. It cannot referee a comparison. Adjusted R² divides the sums of squares by their degrees of freedom, so it falls when an added term does not pull its weight; Mallows' Cp, AIC, BIC and PRESS each penalise size in their own way. Use those, not plain R².
Which criteria do I maximise and which do I minimise?
Maximise adjusted R². Minimise Cp, AIC, BIC and PRESS (and for Cp, look for Cp ≈ p with the fewest parameters). Mixing up the directions is an easy mark to lose, so write the direction next to each number before you compare. When the criteria disagree, BIC leans to the smallest model and adjusted R² to the largest.
How do I read a step() trace?
Each block lists candidate add (+) and drop (−) moves ranked by the resulting AIC; the move with the lowest AIC is taken.
Can I trust the p-values from a stepwise-selected model?
Not at face value. Stepwise selection is data-driven, so it has already used the data to cherry-pick significant-looking terms, making the final model's p-values and CIs over-optimistic. Different criteria (AIC vs BIC) can choose different models. Treat the selected model as a hypothesis, not confirmed truth: run residual diagnostics on the winner and, ideally, validate it out of sample.
Exam move
Write the criteria table on your A4 sheet with three columns — criterion, formula, and direction (maximise adjusted R²; minimise Cp, AIC, BIC, PRESS; want Cp ≈ p) — and annotate each candidate with its direction before comparing, because mixing up max/min is the easy mark to lose. Drill the step() reading template: state the starting AIC, then for each block name the move taken (lowest resulting AIC), explain it (often a collinear term dropped), and stop when