Australian National University · S1 2026 · FACULTY OF SCIENCE

STAT7038 · Regression Modelling

Q: Why is β̂ = (X'X)⁻¹X'y worth memorising?

It is the same least-squares solution for simple and multiple regression — only the width of the design matrix X changes. In simple regression it reduces algebraically to b1 = Sxy/Sxx and b0 = ŷ − b1x̄. The exam asks you to read matrix output, not to invert by hand, but you must know what each factor is and that it exists only when X'X is invertible (no exact collinearity).

Q: How can a coefficient's sign flip between the simple and the multiple regression?

Because the multiple-regression coefficient is a partial effect (holding the other predictors constant) while the simple correlation is a marginal association. With correlated predictors the two can genuinely differ in sign. That is the difference between a conditional and an unconditional relationship — not a mistake to fix — so always state the 'holding the others constant' clause when you interpret.

Q: Where do the standard errors come from in matrix form?

From Var(β̂) = σ²(X'X)−1, estimated by MSE·(X'X)−1. R's vcov(model) prints exactly that matrix; the square roots of its diagonal are the Std. Error column in summary(). Examiners hand you the matrix and ask for a particular se(bj) or a covariance — pull the right entry and, for a standard error, take the square root. No inversion by hand.

Q: Why use adjusted R² instead of R² to compare models?

Plain R² never decreases when you add a predictor — even pure noise — so it always favours the biggest model and cannot referee a comparison. Adjusted R² divides the sums of squares by their degrees of freedom, adding a penalty for each parameter, so it falls when an added term does not pull its weight. Maximise adjusted R² (or minimise an information criterion) to compare models of different size.

- one subject, every graph, every model, every mark

50% final exam · hurdle14 Chapters4-page Bible

Our own words - no uploaded lecturer files

Built to mirror S1 2026 · updated this semester

Chapter 4 of 7 · STAT7038

Matrix Form & Multiple Regression

Writing the n equations one at a time is clumsy; stacking them into vectors and matrices collapses the entire least-squares machinery into a handful of expressions that work for any number of predictors. The model becomes y = Xβ + ε with ε ~ N(0, σ²I), the least-squares solution is the single formula β̂ = (X'X)⁻¹X'y, the fitted values are ŷ = Hy through the symmetric, idempotent hat matrix H = X(X'X)⁻¹X' whose diagonal entries are the leverages, and the full uncertainty is one clean expression Var(β̂) = σ²(X'X)⁻¹. The same formula serves simple and multiple regression — only the width of X changes. Multiple linear regression adds one genuinely new idea: a coefficient is now a partial effect, the change in y per unit of x_j holding the other predictors constant, which can even flip sign relative to the simple correlation. The chapter closes with the two inference tools — the overall F (joint usefulness), the individual t (marginal-given-the-rest) — and adjusted R² for comparing models of different size.

In this chapter

What this chapter covers

01The stacked model y = Xβ + ε and the design matrix X
02The least-squares solution β̂ = (X'X)⁻¹X'y for any number of predictors
03The hat matrix H: symmetric, idempotent, diagonal = leverages
04The variance of the estimates, Var(β̂) = σ²(X'X)⁻¹, and reading vcov(lm)
05The MLR model and the 'partial' (holding-others-constant) coefficient
06Overall F vs individual t — two different questions
07Adjusted R² for comparing models of different size

Worked example · free

Worked example: read an MLR summary and interpret a partial coefficient

Q [6 marks]. A model cost ~ size + age + dist on n = 30 houses (p = 4) reports: size b = 0.580 (se 0.090, t = 6.44, p < 0.001), age b = −0.310 (t = −1.24, p = 0.226), dist b = −0.140 (t = −0.78, p = 0.443), residual SE 8.4 on 26 df, R² = 0.74, F = 24.7 on (3, 26) df, p ≈ 10⁻⁷. (a) Is the model jointly useful? (b) Which predictors earn their place? (c) Interpret the size coefficient and recover MSE.

+1(a) Overall F. F = 24.7 on (3, 26) df, p ≈ 10⁻⁷ ⇒ reject H₀: at least one slope is non-zero — the model is jointly useful.
+1(b) Individual t's. Only size is significant (t = 6.44); age and dist are not — each adds little given the others.
+1(b) Note the pattern. A big F with most t's small is the multicollinearity signature — worth checking VIFs / the correlation matrix.
+1(c) Interpret size. Holding age and distance fixed, each extra unit of size adds 0.58 to expected cost — a partial effect, not a raw pairwise one.
+1(c) CI for size. 0.58 ± t₂₆(0.975)×0.090 = 0.58 ± 2.056(0.090) = (0.395, 0.765).
+1(c) Recover MSE. MSE = (residual SE)² = 8.4² = 70.6 on n−p = 26 df.

F = 24.7 (p ≈ 10⁻⁷) makes the model jointly useful; only size is individually significant (t = 6.44), so holding age and distance fixed each extra unit of size adds 0.58 to expected cost, CI (0.395, 0.765); MSE = 8.4² = 70.6 on 26 df.

Glossary

Key terms

Design matrix (X): The n×p matrix of predictor values: a column of 1s for the intercept, then one column per predictor. The least-squares solution β̂ = (X'X)⁻¹X'y exists only when X's columns are linearly independent — exact collinearity makes X'X singular.
Hat matrix (H): H = X(X'X)⁻¹X', the matrix that maps y to the fitted values, ŷ = Hy. It is symmetric (H' = H) and idempotent (H² = H, a projection); its trace is p, and its diagonal entries h_ii are the leverages. The residuals are e = (I − H)y.
Variance–covariance matrix of β̂: Var(β̂) = σ²(X'X)⁻¹, estimated by MSE·(X'X)⁻¹ and printed by R's vcov(model). The square roots of its diagonal are the Std. Error column; its off-diagonals are the covariances Cov(b_j, b_k).
Partial coefficient: In multiple regression β_j is the expected change in y per one-unit increase in x_j holding all other predictors constant — a conditional effect. It can differ in sign from the simple pairwise correlation, which is a hallmark of correlated predictors, not an error.
Adjusted R²: R²_adj = 1 − (SSE/(n−p))/(SST/(n−1)) = 1 − (1−R²)(n−1)/(n−p). Unlike plain R² it penalises each added parameter, so it can fall when a term does not pull its weight — which is why it, not R², can referee bigger-vs-smaller models.

FAQ

Matrix Form & Multiple Regression FAQ

Why is β̂ = (X'X)⁻¹X'y worth memorising?

It is the same least-squares solution for simple and multiple regression — only the width of the design matrix X changes. In simple regression it reduces algebraically to b₁ = S_xy/S_xx and b₀ = ŷ − b₁x̄. The exam asks you to read matrix output, not to invert by hand, but you must know what each factor is and that it exists only when X'X is invertible (no exact collinearity).

How can a coefficient's sign flip between the simple and the multiple regression?

Because the multiple-regression coefficient is a partial effect (holding the other predictors constant) while the simple correlation is a marginal association. With correlated predictors the two can genuinely differ in sign. That is the difference between a conditional and an unconditional relationship — not a mistake to fix — so always state the 'holding the others constant' clause when you interpret.

Where do the standard errors come from in matrix form?

From Var(β̂) = σ²(X'X)⁻¹, estimated by MSE·(X'X)⁻¹. R's vcov(model) prints exactly that matrix; the square roots of its diagonal are the Std. Error column in summary(). Examiners hand you the matrix and ask for a particular se(b_j) or a covariance — pull the right entry and, for a standard error, take the square root. No inversion by hand.

Why use adjusted R² instead of R² to compare models?

Plain R² never decreases when you add a predictor — even pure noise — so it always favours the biggest model and cannot referee a comparison. Adjusted R² divides the sums of squares by their degrees of freedom, adding a penalty for each parameter, so it falls when an added term does not pull its weight. Maximise adjusted R² (or minimise an information criterion) to compare models of different size.

Study strategy

Exam move

Memorise the four matrix expressions as one connected story and put them on your sheet: β̂ = (X'X)⁻¹X'y (the estimate), H = X(X'X)⁻¹X' with ŷ = Hy (the fit, whose diagonal is leverage), e = (I−H)y with Var(e) = σ²(I−H) (the residuals), and Var(β̂) = σ²(X'X)⁻¹ (the uncertainty). Be fluent at reading the matrices R supplies — pull se(b_j) as the root of the j-th diagonal of vcov, recover MSE = (residual SE)² on n−p df. For interpretation, always say 'holding the other predictors constant' and expect sign-flips from correlated predictors. Distinguish the overall F (joint) from the individual t (marginal-given-the-rest), and use adjusted R² — never plain R² — to compare models of different size.

Keep going — explore the course

A+Everything unlocked

Unlocks this Bible + all 8 of your Australian National University subjects - and 1,000+ Bibles across every Australian university.

Sia - your STAT7038 tutor, unlimited, worked the way the exam marks it

The full 4-page Bible + practice bank with worked solutions

Chrome extension - sync your LMS so Sia knows your deadlines

Bilingual EN / Chinese on every Bible and every Sia answer

$25/ month

30-day money-back · cancel in one tap · how it works