Australian National University · S1 2026 · FACULTY OF SCIENCE

STAT7038 · Regression Modelling

Q: What is the difference between an outlier, a high-leverage point and an influential point?

They answer different questions. An outlier is unusual in y given its x (a large residual, detected by the studentised ti). High leverage is unusual in x-space, far from x̄ (a large hii). An influential point is one whose removal moves the fit (Cook's D, DFFITS, DFBETAS). A point can be any combination — the rule is influence ≈ outlier × leverage, so it takes both a big residual and high leverage to be influential.

Q: My residual plot shows a funnel — what should I do?

A funnel (spread growing or shrinking with the fitted value) is heteroscedasticity, a failure of equal variance. The standard fix is to transform y — the log when the standard deviation grows with the mean, √y for count-like data, arcsin√y for proportions — or to use weighted least squares. A bend instead means the wrong functional form: add a term such as x² or transform x.

Q: Should I delete a point that all the influence measures flag?

Not automatically. Removal is justified only for a genuine data error or a clearly different population. Otherwise keep the point, refit with and without it, and report both fits so the reader can see whether the conclusions are robust. The cut-offs (DFFITS, DFBETAS, COVRATIO) are large-sample approximations, so in small samples reason from the relative magnitudes and the gap to the next point, not blind threshold-crossing.

- one subject, every graph, every model, every mark

50% final exam · hurdle14 Chapters5-page Bible

Our own words - no uploaded lecturer files

Built to mirror S1 2026 · updated this semester

Chapter 3 of 7 · STAT7038

Regression Diagnostics

Least squares always returns a line — even through data that has no business being modelled linearly — and the t-tests, F-tests, CIs and PIs are only valid when the error assumptions hold. Diagnostics are the plots and statistics that interrogate them rather than assume them. The workhorse is the residuals-vs-fitted plot: a flat, formless band of constant width confirms both linearity and equal variance, while a bend signals the wrong functional form and a funnel signals changing variance. The normal Q–Q plot checks normality (points on the 45° line). The chapter then draws the classic exam distinction between three different ideas — an outlier (unusual in y, large residual), high leverage (unusual in x-space, large h_ii), and an influential point (its removal moves the fit, flagged by Cook's D, DFFITS, DFBETAS) — with the mental model influence ≈ outlier × leverage. It closes with transformations: the power ladder, the log, and the Box–Cox family used to bend the data back to the assumptions.

In this chapter

What this chapter covers

01Why diagnostics: inference is only valid if LINE holds
02The residuals-vs-fitted plot: curvature (L) and funnel (E)
03The normal Q–Q plot for the normality assumption
04Reading R's four plot(lm) panels
05Outlier ≠ leverage ≠ influence — three distinct ideas
06Leverage hₖₖ, studentised residuals, and influence measures
07Transformations: the power ladder, the log, and Box–Cox

Worked example · free

Worked example: diagnose and classify an unusual point

Q [5 marks]. In a fit with n = 25 cases and p = 2 parameters, one observation has hat value h_ii = 0.28 and externally studentised residual t_i = 1.1. (a) Is it high leverage? (b) Is it a vertical outlier? (c) Is it likely to be influential, and what should you do?

+1(a) Leverage cut-off. 2p/n = 2(2)/25 = 0.16. Since h_ii = 0.28 > 0.16, the point is high leverage — unusual in x-space.
+1(b) Outlier check. The externally studentised residual t_i = 1.1, and |1.1| < 2, so it is not a vertical outlier — its y is in line with its x.
+1(c) Influence logic. Influence ≈ outlier × leverage; here leverage is high but the residual is small, so the point sits near the line and is unlikely to be influential.
+1Confirm with a measure. Cook's D combines r_i² with h_ii/(1−h_ii); a tiny residual keeps D small even at high leverage — check D against 4/n.
+1Action. Do not delete it: a high-leverage point on the line just sharpens the slope. Keep it, and only consider removal for a genuine data error or a clearly different population.

High leverage (0.28 > 2p/n = 0.16) but not an outlier (|t_i| = 1.1 < 2) and therefore unlikely to be influential, since influence ≈ outlier × leverage. Keep the point; do not auto-delete.

Glossary

Key terms

Residuals-vs-fitted plot: A scatter of each residual e_i against its fitted value ŷ_i. Under the assumptions it is a structureless band of constant width about zero. A bend signals a failure of linearity (wrong functional form); a funnel or megaphone signals non-constant variance (heteroscedasticity).
Normal Q–Q plot: Plots the ordered standardised residuals against the standard-normal quantiles they should have if normal. Points on the 45° line support normality; an S-shape signals skew, and curling ends signal heavy or light tails. Mild departures matter less in large samples (the CLT).
Leverage (hₖₖ): h_ii = 1/n + (x_i − x̄)²/S_xx measures how far a point's x is from the centre. It lies between 0 and 1, the leverages sum to p, and average leverage is p/n. Flag h_ii > 2p/n; high leverage alone is harmless.
Studentised residual: A residual rescaled to have comparable variance. The externally studentised t_i = e_i/[σ̂_(i)√(1−h_ii)] re-estimates σ with case i deleted, giving an exact t-test for a mean-shift outlier (informal flag |t_i| > 2).
Box–Cox transformation: A family y^(λ) = (y^λ − 1)/λ (with λ = 0 giving log y) whose power λ is chosen by maximum likelihood. R's boxcox() draws a profile-likelihood curve; read λ off the peak and round to a tidy value such as 0 (the log).

FAQ

Regression Diagnostics FAQ

What is the difference between an outlier, a high-leverage point and an influential point?

They answer different questions. An outlier is unusual in y given its x (a large residual, detected by the studentised t_i). High leverage is unusual in x-space, far from x̄ (a large h_ii). An influential point is one whose removal moves the fit (Cook's D, DFFITS, DFBETAS). A point can be any combination — the rule is influence ≈ outlier × leverage, so it takes both a big residual and high leverage to be influential.

Does R labelling a point mean it is an outlier?

No. R automatically prints the case numbers of the three most extreme points on every diagnostic panel — that is just labelling, not a verdict. A point flagged on the Q–Q plot but with |studentised residual| < 2 is not an outlier. Always judge against the cut-off and the size of the gap to the next point, never against the label alone.

My residual plot shows a funnel — what should I do?

A funnel (spread growing or shrinking with the fitted value) is heteroscedasticity, a failure of equal variance. The standard fix is to transform y — the log when the standard deviation grows with the mean, √y for count-like data, arcsin√y for proportions — or to use weighted least squares. A bend instead means the wrong functional form: add a term such as x² or transform x.

Should I delete a point that all the influence measures flag?

Not automatically. Removal is justified only for a genuine data error or a clearly different population. Otherwise keep the point, refit with and without it, and report both fits so the reader can see whether the conclusions are robust. The cut-offs (DFFITS, DFBETAS, COVRATIO) are large-sample approximations, so in small samples reason from the relative magnitudes and the gap to the next point, not blind threshold-crossing.

Study strategy

Exam move

Make 'plot the residuals before you trust a single p-value' a reflex: every fit yields a line, but the residual plots tell you whether it should have. Learn the two readings of the residuals-vs-fitted plot at once — a bend means non-linearity (transform x), a funnel means non-constant variance (transform y or use WLS). For the three-ideas question, write a one-row table linking each idea to its statistic: outlier → studentised t_i, leverage → h_ii > 2p/n, influence → Cook's D / DFFITS / DFBETAS, with the slogan influence ≈ outlier × leverage. Keep the leverage and outlier cut-offs and the transformation menu (log for a funnel, √y for counts, transform x for curvature, Box–Cox for the optimal power) on your A4 sheet, and never extrapolate a back-transformed model outside the observed x-range.

Keep going — explore the course

A+Everything unlocked

Unlocks this Bible + all 8 of your Australian National University subjects - and 1,000+ Bibles across every Australian university.

Sia - your STAT7038 tutor, unlimited, worked the way the exam marks it

The full 5-page Bible + practice bank with worked solutions

Chrome extension - sync your LMS so Sia knows your deadlines

Bilingual EN / Chinese on every Bible and every Sia answer

$25/ month

30-day money-back · cancel in one tap · how it works