Australian National University · S1 2026 · FACULTY OF SCIENCE

STAT7038 · Regression Modelling

- one subject, every graph, every model, every mark
50% final exam · hurdle14 Chapters6-page Bible
Our own words - no uploaded lecturer files
Built to mirror S1 2026 · updated this semester
Chapter 6 of 7 · STAT7038

Advanced Diagnostics

This chapter sharpens the diagnostic toolkit for the multiple-regression setting and adds the dedicated treatment of multicollinearity. Raw residuals are not comparable — Var(ei) = σ²(1−hii) varies with leverage — so they are rescaled into standardised (internally studentised) and studentised (externally / deletion) residuals, the latter giving an exact tn−p−1 outlier test; the PRESS residual feeds the leave-one-out predictive statistic. Then comes the deletion question: how much does removing a case change the fit? The standard influence measures — Cook's D (all fitted values), DFFITS (its own fitted value), DFBETAS (one coefficient), COVRATIO (precision) — each combine a residual term with a leverage term, confirming influence ≈ outlier × leverage, and each comes with a course cut-off. Finally, multicollinearity — near-linear dependence among the predictors — inflates the coefficient variances; its tell-tale giveaway is a highly significant overall F with no individual t significant, diagnosed by the variance inflation factor VIF = 1/(1−R²j) and the condition number κ.

In this chapter

What this chapter covers

  • 01Three ideas kept straight: outlier, leverage, influence
  • 02Types of residuals: raw, standardised, studentised, PRESS
  • 03The studentised-residual outlier test and the Bonferroni correction
  • 04Influence measures: Cook's D, DFFITS, DFBETAS, COVRATIO with cut-offs
  • 05Reading an influence.measures() table and the never-auto-delete rule
  • 06Multicollinearity: inflated SEs, the F/t split, sign-flips
  • 07VIF = 1/(1−R²ⱼ), √VIF as the SE inflation factor, and the condition number κ
Worked example · free

Worked example: read an influence.measures() row

Q [6 marks]. A model on n = 30 cases with p = 3 parameters flags case 15: hat = 0.28, dffit = 0.98, dfb.H2S = 0.61, dfb.Lact = −0.74, cook.d = 0.30, cov.r = 1.31. Apply the course cut-offs and give a verdict.
  • +1Leverage. Cut-off 2p/n = 2(3)/30 = 0.20; hat = 0.28 > 0.20 ⇒ high leverage.
  • +1DFFITS. Cut-off 2√(p/n) = 2√(3/30) = 0.632; |0.98| > 0.632 ⇒ moves its own fitted value.
  • +1DFBETAS. Cut-off 2/√n = 2/√30 = 0.365; both |0.61| and |0.74| exceed it ⇒ moves the H2S and Lactic coefficients.
  • +1COVRATIO. Cut-off |COVRATIO − 1| > 3p/n = 0.30; |1.31 − 1| = 0.31 > 0.30 ⇒ just flagged (>1 means deleting it would increase variance).
  • +1Cook's D. 4/n = 0.133; D = 0.30 > 0.133 (but < 1) ⇒ influential on the overall fit.
  • +1Verdict & action. Influential, high-leverage. With n = 30 these are large-sample cut-offs, so do not delete: refit without case 15, report both fits, and check whether the conclusions survive.
Every measure crosses its threshold, so case 15 is a genuinely influential, high-leverage point. But with n = 30 the cut-offs are large-sample, so do not delete it — refit with and without it, report both, and see whether the H2S and Lactic conclusions hold.
Glossary

Key terms

Standardised vs studentised residual
Both rescale ei for the unequal variance σ²(1−hii). The standardised (internally studentised) ri uses σ̂ from all cases; the studentised (externally / deletion) ti uses σ̂(i) with case i deleted, so a genuine outlier no longer inflates its own yardstick — giving an exact tn−p−1 outlier test.
PRESS residual
The leave-one-out residual e(i) = ei/(1−hii), computed from the full-data fit without refitting. Its sum of squares, PRESS = Σ[ei/(1−hii)]², estimates genuine out-of-sample error; a small SSE but large PRESS signals overfitting.
Cook's distance (Dᵢ)
Di = (ri²/p)·hii/(1−hii), a single number for a case's effect on all fitted values at once. It is a product of a residual term and a leverage term — the algebra behind influence ≈ outlier × leverage. Flag Di > 4/n (or > 1).
Variance inflation factor (VIF)
VIFj = 1/(1−R²j), where R²j comes from regressing xj on all the other predictors. √VIFj is the factor by which se(bj) is inflated versus orthogonal predictors. Flag VIF > 5 (concerning) and > 10 (serious).
Multicollinearity
Near-linear dependence among the predictors: X'X is near-singular, so Var(β̂) = σ²(X'X)−1 blows up and coefficients become unstable with ballooning standard errors. Predictions and R² within the data range are unharmed; the giveaway is a significant overall F with no significant individual t.
FAQ

Advanced Diagnostics FAQ

What is the multicollinearity giveaway?

A highly significant overall F with no individual t significant. The model jointly explains a lot, but no single predictor can be credited because they overlap. Partner symptoms: implausible coefficient signs or signs that flip when other predictors enter, very wide CIs, and sequential anova() verdicts that swing under reordering. When you see the F/t split, immediately check vif() and the correlation matrix.

How do I read a VIF, and when is it a problem?

VIFj = 1/(1−R²j): VIF = 1 means orthogonal, and √VIF is how many times se(bj) is inflated versus that ideal. Flag VIF > 5 as concerning and > 10 as serious; a condition number κ > 30 is also problematic. But VIFs that come purely from deliberately included higher-order terms (a variable and its square, or main effects and their interaction) are expected and benign — centring the predictors removes that artificial collinearity.

Should I delete a case that all the influence measures flag?

No, not automatically. The DFFITS / DFBETAS / COVRATIO thresholds are large-sample approximations, so in a small sample (say n = 30) they are indicative rather than decisive — reason from the relative magnitudes and the gap to the next point. Keep valid observations, refit with and without the case, report both fits, and delete only for a genuine data error or a clearly different population.

Why must raw residuals be rescaled before I judge them?

Because Var(ei) = σ²(1−hii) depends on leverage: a high-leverage point pulls the line toward itself and shrinks its own residual, so raw residuals are not comparable across cases. Standardising (dividing by σ̂√(1−hii)) makes them comparable; the externally studentised version, which re-estimates σ with the case deleted, gives an honest outlier test.

Study strategy

Exam move

This chapter is cut-off-driven, so put the whole table on your A4 sheet with the formula and the threshold: hii > 2p/n; |ti| > tn−p−1(0.975) (informally 2, with a Bonferroni 1−α/2n when testing all n); |DFFITS| > 2√(p/n); |DFBETAS| > 2/√n; |COVRATIO−1| > 3p/n; Di > 4/n; VIF > 5 / > 10; κ > 30. Practise reading an influence.measures() row by computing each cut-off from n and p, then deciding. Memorise the multicollinearity giveaway (F significant, no t significant) and the chain that follows: check vif(), look for sign-flips, note that predictions and R² are unharmed, then remedy by dropping a redundant predictor or combining variables. And state the never-auto-delete rule: refit with and without, report both.

A+Everything unlocked
Unlocks this Bible + all 8 of your Australian National University subjects - and 1,000+ Bibles across every Australian university.
Sia - your STAT7038 tutor, unlimited, worked the way the exam marks it
The full 6-page Bible + practice bank with worked solutions
Chrome extension - sync your LMS so Sia knows your deadlines
Bilingual EN / Chinese on every Bible and every Sia answer
$25/ month
30-day money-back · cancel in one tap · how it works
Unlock the full STAT7038 Bible + 8 Australian National University subjects解锁完整 STAT7038 Bible + Australian National University 8 门科目
$25/mo