Australian National University · S1 2026 · FACULTY OF SCIENCE

STAT7038 · Regression Modelling

- one subject, every graph, every model, every mark
50% final exam · hurdle14 Chapters5-page Bible
Our own words - no uploaded lecturer files
Built to mirror S1 2026 · updated this semester
Chapter 3 of 7 · STAT7038

Regression Diagnostics

Least squares always returns a line — even through data that has no business being modelled linearly — and the t-tests, F-tests, CIs and PIs are only valid when the error assumptions hold. Diagnostics are the plots and statistics that interrogate them rather than assume them. The workhorse is the residuals-vs-fitted plot: a flat, formless band of constant width confirms both linearity and equal variance, while a bend signals the wrong functional form and a funnel signals changing variance. The normal Q–Q plot checks normality (points on the 45° line). The chapter then draws the classic exam distinction between three different ideas — an outlier (unusual in y, large residual), high leverage (unusual in x-space, large hii), and an influential point (its removal moves the fit, flagged by Cook's D, DFFITS, DFBETAS) — with the mental model influence ≈ outlier × leverage. It closes with transformations: the power ladder, the log, and the Box–Cox family used to bend the data back to the assumptions.

In this chapter

What this chapter covers

  • 01Why diagnostics: inference is only valid if LINE holds
  • 02The residuals-vs-fitted plot: curvature (L) and funnel (E)
  • 03The normal Q–Q plot for the normality assumption
  • 04Reading R's four plot(lm) panels
  • 05Outlier ≠ leverage ≠ influence — three distinct ideas
  • 06Leverage hₖₖ, studentised residuals, and influence measures
  • 07Transformations: the power ladder, the log, and Box–Cox
Worked example · free

Worked example: diagnose and classify an unusual point

Q [5 marks]. In a fit with n = 25 cases and p = 2 parameters, one observation has hat value hii = 0.28 and externally studentised residual ti = 1.1. (a) Is it high leverage? (b) Is it a vertical outlier? (c) Is it likely to be influential, and what should you do?
  • +1(a) Leverage cut-off. 2p/n = 2(2)/25 = 0.16. Since hii = 0.28 > 0.16, the point is high leverage — unusual in x-space.
  • +1(b) Outlier check. The externally studentised residual ti = 1.1, and |1.1| < 2, so it is not a vertical outlier — its y is in line with its x.
  • +1(c) Influence logic. Influence ≈ outlier × leverage; here leverage is high but the residual is small, so the point sits near the line and is unlikely to be influential.
  • +1Confirm with a measure. Cook's D combines ri² with hii/(1−hii); a tiny residual keeps D small even at high leverage — check D against 4/n.
  • +1Action. Do not delete it: a high-leverage point on the line just sharpens the slope. Keep it, and only consider removal for a genuine data error or a clearly different population.
High leverage (0.28 > 2p/n = 0.16) but not an outlier (|ti| = 1.1 < 2) and therefore unlikely to be influential, since influence ≈ outlier × leverage. Keep the point; do not auto-delete.
Glossary

Key terms

Residuals-vs-fitted plot
A scatter of each residual ei against its fitted value ŷi. Under the assumptions it is a structureless band of constant width about zero. A bend signals a failure of linearity (wrong functional form); a funnel or megaphone signals non-constant variance (heteroscedasticity).
Normal Q–Q plot
Plots the ordered standardised residuals against the standard-normal quantiles they should have if normal. Points on the 45° line support normality; an S-shape signals skew, and curling ends signal heavy or light tails. Mild departures matter less in large samples (the CLT).
Leverage (hₖₖ)
hii = 1/n + (xi − x̄)²/Sxx measures how far a point's x is from the centre. It lies between 0 and 1, the leverages sum to p, and average leverage is p/n. Flag hii > 2p/n; high leverage alone is harmless.
Studentised residual
A residual rescaled to have comparable variance. The externally studentised ti = ei/[σ̂(i)√(1−hii)] re-estimates σ with case i deleted, giving an exact t-test for a mean-shift outlier (informal flag |ti| > 2).
Box–Cox transformation
A family y(λ) = (yλ − 1)/λ (with λ = 0 giving log y) whose power λ is chosen by maximum likelihood. R's boxcox() draws a profile-likelihood curve; read λ off the peak and round to a tidy value such as 0 (the log).
FAQ

Regression Diagnostics FAQ

What is the difference between an outlier, a high-leverage point and an influential point?

They answer different questions. An outlier is unusual in y given its x (a large residual, detected by the studentised ti). High leverage is unusual in x-space, far from x̄ (a large hii). An influential point is one whose removal moves the fit (Cook's D, DFFITS, DFBETAS). A point can be any combination — the rule is influence ≈ outlier × leverage, so it takes both a big residual and high leverage to be influential.

Does R labelling a point mean it is an outlier?

No. R automatically prints the case numbers of the three most extreme points on every diagnostic panel — that is just labelling, not a verdict. A point flagged on the Q–Q plot but with |studentised residual| < 2 is not an outlier. Always judge against the cut-off and the size of the gap to the next point, never against the label alone.

My residual plot shows a funnel — what should I do?

A funnel (spread growing or shrinking with the fitted value) is heteroscedasticity, a failure of equal variance. The standard fix is to transform y — the log when the standard deviation grows with the mean, √y for count-like data, arcsin√y for proportions — or to use weighted least squares. A bend instead means the wrong functional form: add a term such as x² or transform x.

Should I delete a point that all the influence measures flag?

Not automatically. Removal is justified only for a genuine data error or a clearly different population. Otherwise keep the point, refit with and without it, and report both fits so the reader can see whether the conclusions are robust. The cut-offs (DFFITS, DFBETAS, COVRATIO) are large-sample approximations, so in small samples reason from the relative magnitudes and the gap to the next point, not blind threshold-crossing.

Study strategy

Exam move

Make 'plot the residuals before you trust a single p-value' a reflex: every fit yields a line, but the residual plots tell you whether it should have. Learn the two readings of the residuals-vs-fitted plot at once — a bend means non-linearity (transform x), a funnel means non-constant variance (transform y or use WLS). For the three-ideas question, write a one-row table linking each idea to its statistic: outlier → studentised ti, leverage → hii > 2p/n, influence → Cook's D / DFFITS / DFBETAS, with the slogan influence ≈ outlier × leverage. Keep the leverage and outlier cut-offs and the transformation menu (log for a funnel, √y for counts, transform x for curvature, Box–Cox for the optimal power) on your A4 sheet, and never extrapolate a back-transformed model outside the observed x-range.

A+Everything unlocked
Unlocks this Bible + all 8 of your Australian National University subjects - and 1,000+ Bibles across every Australian university.
Sia - your STAT7038 tutor, unlimited, worked the way the exam marks it
The full 5-page Bible + practice bank with worked solutions
Chrome extension - sync your LMS so Sia knows your deadlines
Bilingual EN / Chinese on every Bible and every Sia answer
$25/ month
30-day money-back · cancel in one tap · how it works
Unlock the full STAT7038 Bible + 8 Australian National University subjects解锁完整 STAT7038 Bible + Australian National University 8 门科目
$25/mo