Australian National University · S1 2026 · FACULTY OF SCIENCE

STAT7038 · Regression Modelling

Q: Why can't I just use R² to pick the best model?

Adding any predictor — even pure noise — cannot raise SSE, so R² only ever climbs and always favours the biggest model. It cannot referee a comparison. Adjusted R² divides the sums of squares by their degrees of freedom, so it falls when an added term does not pull its weight; Mallows' Cp, AIC, BIC and PRESS each penalise size in their own way. Use those, not plain R².

Q: Which criteria do I maximise and which do I minimise?

Maximise adjusted R². Minimise Cp, AIC, BIC and PRESS (and for Cp, look for Cp ≈ p with the fewest parameters). Mixing up the directions is an easy mark to lose, so write the direction next to each number before you compare. When the criteria disagree, BIC leans to the smallest model and adjusted R² to the largest.

Q: How do I read a step() trace?

Each block lists candidate add (+) and drop (−) moves ranked by the resulting AIC; the move with the lowest AIC is taken. is the current model, and when sits at the top — no move lowers AIC — the search stops. A good narration states the starting AIC, names the term dropped or added at each block and why (lowest resulting AIC), and notes the AIC falling at every accepted step until wins.

Q: Can I trust the p-values from a stepwise-selected model?

Not at face value. Stepwise selection is data-driven, so it has already used the data to cherry-pick significant-looking terms, making the final model's p-values and CIs over-optimistic. Different criteria (AIC vs BIC) can choose different models. Treat the selected model as a hypothesis, not confirmed truth: run residual diagnostics on the winner and, ideally, validate it out of sample.

- one subject, every graph, every model, every mark

50% final exam · hurdle14 Chapters4-page Bible

Our own words - no uploaded lecturer files

Built to mirror S1 2026 · updated this semester

Chapter 7 of 7 · STAT7038

Model Selection

Adding predictors can never make R² worse, so R² alone always picks the biggest model and is useless for comparison. Every selection criterion is really R² (or SSE) plus a penalty for size; they differ only in how hard they penalise, and the goal is the parsimony / bias-variance trade-off — the simplest model that still fits. The chapter compares the standard criteria: adjusted R² (maximise; weak penalty), Mallows' C_p (minimise, want C_p ≈ p; a bias gauge), AIC (minimise; penalty 2p), BIC (minimise; heavier penalty, leans smaller), and PRESS (minimise; leave-one-out predictive ability). It then surveys the selection procedures — best-subset, forward, backward, and stepwise ('both') driven by AIC — and teaches you to read a step() trace, where the move with the lowest resulting AIC is taken and the search stops when <none> sits at the top. The closing caution is essential exam material: a stepwise-selected model's p-values and CIs are over-optimistic because the data already chose the terms.

In this chapter

What this chapter covers

01Why R² cannot rank models of different size
02Adjusted R², Mallows' Cp, AIC, BIC, PRESS — formulas and directions
03Cp ≈ p as a bias gauge; PRESS as leave-one-out prediction
04Selection procedures: best-subset, forward, backward, stepwise
05Reading a step() trace: lowest resulting AIC wins; stops
06The parsimony / bias-variance trade-off
07Why a stepwise model's p-values are over-optimistic

Worked example · free

Worked example: describe a step() refinement

Q [6 marks]. A stepwise search (direction = 'both', AIC-driven) starts from a full five-predictor model with AIC = 84.2 and prints three blocks: Block 1 takes − brain → AIC 81.6 (vs <none> 84.2); Block 2 takes − body → AIC 79.9 (vs <none> 81.6); Block 3 has <none> 79.9 at the top. Describe the model-refinement process and state the final model.

+1How step() chooses. At each block it evaluates every single-term add or drop and ranks them by the AIC that would result, then takes the move with the lowest resulting AIC.
+1Block 1. Dropping brain gives AIC 81.6, below the current 84.2, so brain is removed — it overlapped with body (high VIF) and added little once the others were present.
+1Block 2. From the four-predictor model, dropping body lowers AIC further to 79.9, so body goes too.
+1Block 3. Now <none> sits at the top — no add or drop lowers AIC below 79.9 — so the search stops.
+1Final model. The retained predictors are lifespan + gestation + predation; AIC fell monotonically 84.2 → 81.6 → 79.9 at every accepted step.
+1Caveat. The final model's p-values are over-optimistic because the data chose the terms — check residual plots and ideally validate out of sample.

step() ranks each candidate move by its resulting AIC and takes the lowest: drop brain (84.2→81.6), then drop body (81.6→79.9), then stop when <none> is best, returning lifespan + gestation + predation. Two collinear predictors were trimmed; the final p-values are over-optimistic and need validating.

Glossary

Key terms

Mallows' Cp: C_p = SSE_p/MSE_full − (n − 2p), minimised, with a good model having C_p ≈ p. It reads as a bias gauge: C_p ≈ p signals low bias (the systematic structure is captured), C_p >> p signals omitted terms, and C_p < p can flag overfitting noise.
AIC vs BIC: AIC = n·log(SSE/n) + 2p and BIC = n·log(SSE/n) + p·log n, both minimised. BIC's penalty p·log n is heavier than AIC's 2p once n ≥ 8, so BIC leans toward smaller models; when criteria disagree, BIC picks the smallest and adjusted R² the largest.
PRESS statistic: PRESS = Σ[e_i/(1−h_ii)]², the leave-one-out predictive sum of squares, minimised. Computed from the full-data residuals and leverages without refitting, it estimates genuine out-of-sample error; a model with small SSE but large PRESS is overfitting.
Stepwise selection: A procedure ('both' direction) that, at each step, considers adding or dropping a single term and takes the move with the best criterion (AIC by default in R's step()). It re-checks earlier variables for removal as it goes, and stops when no move improves the criterion ( at the top).
Bias–variance trade-off: Too few terms biases the estimates (underfitting); too many fits noise, inflating variance and destabilising coefficients (overfitting). Every selection criterion formalises a penalty on model size to balance the two — prefer the simplest model that adequately fits (parsimony / Occam).

FAQ

Model Selection FAQ

Why can't I just use R² to pick the best model?

Adding any predictor — even pure noise — cannot raise SSE, so R² only ever climbs and always favours the biggest model. It cannot referee a comparison. Adjusted R² divides the sums of squares by their degrees of freedom, so it falls when an added term does not pull its weight; Mallows' C_p, AIC, BIC and PRESS each penalise size in their own way. Use those, not plain R².

Which criteria do I maximise and which do I minimise?

Maximise adjusted R². Minimise C_p, AIC, BIC and PRESS (and for C_p, look for C_p ≈ p with the fewest parameters). Mixing up the directions is an easy mark to lose, so write the direction next to each number before you compare. When the criteria disagree, BIC leans to the smallest model and adjusted R² to the largest.

How do I read a step() trace?

Each block lists candidate add (+) and drop (−) moves ranked by the resulting AIC; the move with the lowest AIC is taken. is the current model, and when sits at the top — no move lowers AIC — the search stops. A good narration states the starting AIC, names the term dropped or added at each block and why (lowest resulting AIC), and notes the AIC falling at every accepted step until wins.

Can I trust the p-values from a stepwise-selected model?

Not at face value. Stepwise selection is data-driven, so it has already used the data to cherry-pick significant-looking terms, making the final model's p-values and CIs over-optimistic. Different criteria (AIC vs BIC) can choose different models. Treat the selected model as a hypothesis, not confirmed truth: run residual diagnostics on the winner and, ideally, validate it out of sample.

Study strategy

Exam move

Write the criteria table on your A4 sheet with three columns — criterion, formula, and direction (maximise adjusted R²; minimise C_p, AIC, BIC, PRESS; want C_p ≈ p) — and annotate each candidate with its direction before comparing, because mixing up max/min is the easy mark to lose. Drill the step() reading template: state the starting AIC, then for each block name the move taken (lowest resulting AIC), explain it (often a collinear term dropped), and stop when is best, narrating the monotone AIC decrease. Tie selection back to the bias-variance trade-off (too few = bias, too many = variance) and always close with the over-optimism caveat: the selected model's p-values are inflated, so validate it and run residual diagnostics on the winner.

Keep going — explore the course

A+Everything unlocked

Unlocks this Bible + all 8 of your Australian National University subjects - and 1,000+ Bibles across every Australian university.

Sia - your STAT7038 tutor, unlimited, worked the way the exam marks it

The full 4-page Bible + practice bank with worked solutions

Chrome extension - sync your LMS so Sia knows your deadlines

Bilingual EN / Chinese on every Bible and every Sia answer

$25/ month

30-day money-back · cancel in one tap · how it works