STAT7038 · Regression Modelling
Regression Diagnostics
Least squares always returns a line — even through data that has no business being modelled linearly — and the t-tests, F-tests, CIs and PIs are only valid when the error assumptions hold. Diagnostics are the plots and statistics that interrogate them rather than assume them. The workhorse is the residuals-vs-fitted plot: a flat, formless band of constant width confirms both linearity and equal variance, while a bend signals the wrong functional form and a funnel signals changing variance. The normal Q–Q plot checks normality (points on the 45° line). The chapter then draws the classic exam distinction between three different ideas — an outlier (unusual in y, large residual), high leverage (unusual in x-space, large hii), and an influential point (its removal moves the fit, flagged by Cook's D, DFFITS, DFBETAS) — with the mental model influence ≈ outlier × leverage. It closes with transformations: the power ladder, the log, and the Box–Cox family used to bend the data back to the assumptions.
What this chapter covers
- 01Why diagnostics: inference is only valid if LINE holds
- 02The residuals-vs-fitted plot: curvature (L) and funnel (E)
- 03The normal Q–Q plot for the normality assumption
- 04Reading R's four plot(lm) panels
- 05Outlier ≠ leverage ≠ influence — three distinct ideas
- 06Leverage hₖₖ, studentised residuals, and influence measures
- 07Transformations: the power ladder, the log, and Box–Cox
Worked example: diagnose and classify an unusual point
- +1(a) Leverage cut-off. 2p/n = 2(2)/25 = 0.16. Since hii = 0.28 > 0.16, the point is high leverage — unusual in x-space.
- +1(b) Outlier check. The externally studentised residual ti = 1.1, and |1.1| < 2, so it is not a vertical outlier — its y is in line with its x.
- +1(c) Influence logic. Influence ≈ outlier × leverage; here leverage is high but the residual is small, so the point sits near the line and is unlikely to be influential.
- +1Confirm with a measure. Cook's D combines ri² with hii/(1−hii); a tiny residual keeps D small even at high leverage — check D against 4/n.
- +1Action. Do not delete it: a high-leverage point on the line just sharpens the slope. Keep it, and only consider removal for a genuine data error or a clearly different population.
Key terms
- Residuals-vs-fitted plot
- A scatter of each residual ei against its fitted value ŷi. Under the assumptions it is a structureless band of constant width about zero. A bend signals a failure of linearity (wrong functional form); a funnel or megaphone signals non-constant variance (heteroscedasticity).
- Normal Q–Q plot
- Plots the ordered standardised residuals against the standard-normal quantiles they should have if normal. Points on the 45° line support normality; an S-shape signals skew, and curling ends signal heavy or light tails. Mild departures matter less in large samples (the CLT).
- Leverage (hₖₖ)
- hii = 1/n + (xi − x̄)²/Sxx measures how far a point's x is from the centre. It lies between 0 and 1, the leverages sum to p, and average leverage is p/n. Flag hii > 2p/n; high leverage alone is harmless.
- Studentised residual
- A residual rescaled to have comparable variance. The externally studentised ti = ei/[σ̂(i)√(1−hii)] re-estimates σ with case i deleted, giving an exact t-test for a mean-shift outlier (informal flag |ti| > 2).
- Box–Cox transformation
- A family y(λ) = (yλ − 1)/λ (with λ = 0 giving log y) whose power λ is chosen by maximum likelihood. R's boxcox() draws a profile-likelihood curve; read λ off the peak and round to a tidy value such as 0 (the log).
Regression Diagnostics FAQ
What is the difference between an outlier, a high-leverage point and an influential point?
They answer different questions. An outlier is unusual in y given its x (a large residual, detected by the studentised ti). High leverage is unusual in x-space, far from x̄ (a large hii). An influential point is one whose removal moves the fit (Cook's D, DFFITS, DFBETAS). A point can be any combination — the rule is influence ≈ outlier × leverage, so it takes both a big residual and high leverage to be influential.
Does R labelling a point mean it is an outlier?
No. R automatically prints the case numbers of the three most extreme points on every diagnostic panel — that is just labelling, not a verdict. A point flagged on the Q–Q plot but with |studentised residual| < 2 is not an outlier. Always judge against the cut-off and the size of the gap to the next point, never against the label alone.
My residual plot shows a funnel — what should I do?
A funnel (spread growing or shrinking with the fitted value) is heteroscedasticity, a failure of equal variance. The standard fix is to transform y — the log when the standard deviation grows with the mean, √y for count-like data, arcsin√y for proportions — or to use weighted least squares. A bend instead means the wrong functional form: add a term such as x² or transform x.
Should I delete a point that all the influence measures flag?
Not automatically. Removal is justified only for a genuine data error or a clearly different population. Otherwise keep the point, refit with and without it, and report both fits so the reader can see whether the conclusions are robust. The cut-offs (DFFITS, DFBETAS, COVRATIO) are large-sample approximations, so in small samples reason from the relative magnitudes and the gap to the next point, not blind threshold-crossing.
Exam move
Make 'plot the residuals before you trust a single p-value' a reflex: every fit yields a line, but the residual plots tell you whether it should have. Learn the two readings of the residuals-vs-fitted plot at once — a bend means non-linearity (transform x), a funnel means non-constant variance (transform y or use WLS). For the three-ideas question, write a one-row table linking each idea to its statistic: outlier → studentised ti, leverage → hii > 2p/n, influence → Cook's D / DFFITS / DFBETAS, with the slogan influence ≈ outlier × leverage. Keep the leverage and outlier cut-offs and the transformation menu (log for a funnel, √y for counts, transform x for curvature, Box–Cox for the optimal power) on your A4 sheet, and never extrapolate a back-transformed model outside the observed x-range.