MAST20034 · Critical Thinking With Data
Statistical Modelling
Week 8 teaches you to read a model, not fit one — the verb the exam rewards is interpret, because there is no software in the room. The anchor idea is George Box's: all models are wrong, but some are useful. A model is a deliberate simplification that splits the world into signal + noise; the art is judging whether the signal it captures is useful, not whether it is ‘true’. You learn to read a regression output on three axes — the sign (direction of the relationship), the size (the coefficient's practical magnitude in context) and the significance (is it distinguishable from zero?) — and to say what each coefficient means in plain words. The chapter generalises to GLMs (the same linear idea with a link on the front, e.g. logistic regression for a yes/no outcome), then to diagnostics: what a violated assumption looks like (patterned residuals, non-constant spread). It closes on the two great cautions — never extrapolate beyond the data, and a regression coefficient is not a causal effect unless the design earned it. Exam prompts hand you output and ask you to read and critique it.
What this chapter covers
- 018.1 What a model is: signal + noise, and ‘all models are wrong’
- 028.2 Reading a regression: sign, size, significance
- 038.3 GLMs — the same idea with a link function on the front
- 048.4 Diagnostics: what an assumption violation looks like
- 058.5 The two great cautions — extrapolation and causation
Reading a regression coefficient in context, mark by mark
- +1Read sign + size: the slope −4.2 means each extra km from the city is associated with a ~$4,200 lower price on average; sign negative, magnitude modest.
- +1Read significance + fit: p < 0.001 says the slope is reliably non-zero, but R² = 0.31 means distance explains only ~31% of price variation — much is left to other factors.
- +1Catch the causal error: this is observational regression, so the slope is an association; ‘moving a house’ implies a causal intervention the data do not support (location confounds with suburb amenities, schools, size).
- +1Catch the extrapolation/wording: the claim also treats a between-house comparison as a within-house change — you cannot literally move a house — and may extrapolate beyond the observed distance range.
Key terms
- All models are wrong
- Box's dictum: every model simplifies reality, so none is literally true; the question is whether its captured signal is useful for the purpose. Frees you to judge usefulness, not correctness.
- Signal + noise
- The decomposition a model assumes: a systematic part (signal, what the model explains) plus random variation (noise, the residuals). Good modelling captures real signal without mistaking noise for it (overfitting).
- Regression coefficient (sign / size / significance)
- The three things to read off a slope: its direction, its magnitude in context (per one-unit change), and whether it is reliably different from zero. Significance without size, or size without context, is an incomplete read.
- GLM (generalised linear model)
- An extension of linear regression that puts a link function on the front so the same linear predictor can model non-normal outcomes — e.g. logistic regression for binary data, Poisson for counts.
- Extrapolation
- Using a model to predict outside the range of the data it was fitted on, where the fitted relationship may not hold. One of the two great modelling cautions, alongside reading association as causation.
Statistical Modelling FAQ
What does ‘all models are wrong but some are useful’ mean for the exam?
That you judge a model by usefulness for its purpose, not truth. So a good answer interprets what the model usefully tells us and names where it simplifies or fails — never claims the model is ‘right’.
How do I read a regression coefficient?
On three axes: sign (direction), size (magnitude per unit, stated in real-world context) and significance (is it distinguishable from zero?). Then sanity-check fit (R²) and the two cautions — causation and extrapolation.
When is a regression slope a causal effect?
Only when the design earns it — a randomised experiment, or thorough adjustment for confounders. In ordinary observational regression the slope is an association; causal language (‘changing X by 1 raises Y’) is an over-claim.
Exam move
Treat model questions as reading exercises: rehearse the three-axis read (sign / size-in-context / significance) plus the fit (R²) until it is a single fluent sentence. Put the two great cautions — no extrapolation, association≠causation — at the top of the modelling section of your notes, because they catch the over-claims examiners plant. Know GLMs conceptually as ‘linear model + a link’, and be able to say in one line what a diagnostic violation (patterned residuals, fanning spread) signals. Always interpret coefficients in real units and real context, never as bare numbers.