QBUS5001 · Foundation In Data Analytics For Business
Regression Diagnostics & Inference
Module 11 asks whether a fitted line can be trusted. The four L.I.N.E. assumptions — Linearity, Independence of errors, Normality of errors, Equal variance — are checked through residual plots. You then do inference: a t-test on the slope (usually testing β₁ = 0), confidence and prediction intervals for Y, and a t-test on the correlation coefficient.
For time-series data, autocorrelation is diagnosed with the Durbin–Watson statistic (near 2 = none, below 2 = positive, above 2 = negative). The module closes with the classic pitfalls: extrapolation, confusing correlation with causation, and influential outliers.
What this chapter covers
- 01The four L.I.N.E. assumptions and residual-plot checks
- 02t-test on the slope: T = (b₁ − β₁)/s(b₁) ~ t(n−2)
- 03Confidence interval for the slope
- 04t-test on the correlation coefficient
- 05Confidence interval for mean Y at a given X
- 06Prediction interval for an individual Y
- 07Autocorrelation and the Durbin–Watson statistic
- 08Pitfalls: extrapolation, causation vs correlation, outliers
t-test on a regression slope
- 1 markHypotheses: H₀: β₁ = 0 versus H₁: β₁ ≠ 0 (the variable X has no linear effect on Y under H₀).
- 1 markDegrees of freedom: n − 2 = 22 − 2 = 20.
- 2 marksTest statistic: T = (b₁ − 0)/s(b₁) = 1.40/0.45 = 3.1111.
- 1 markCritical value: reject if |T| > t(0.025, 20) ≈ 2.086.
- 1 markDecision and conclusion: 3.1111 > 2.086, so reject H₀ — the slope is significantly different from zero, so X is a significant linear predictor of Y at the 5% level.
Key terms
- L.I.N.E. assumptions
- The four conditions for valid simple regression inference: Linearity, Independence of errors, Normality of errors, and Equal variance (homoscedasticity).
- Homoscedasticity
- The assumption that the error variance is constant across all values of X; its violation (a fan-shaped residual plot) is heteroscedasticity.
- Durbin–Watson statistic
- A diagnostic for autocorrelation of residuals lying in [0, 4]: near 2 indicates no autocorrelation, below 2 positive autocorrelation, above 2 negative.
- Prediction interval
- An interval for an individual future Y at a given X, wider than the confidence interval for the mean Y because it adds the irreducible scatter of single observations.
- Influential outlier
- An observation with an extreme X or large residual that disproportionately changes the fitted line; flagged in residual analysis as a pitfall.
Regression Diagnostics & Inference FAQ
What is the difference between a confidence interval and a prediction interval here?
The confidence interval estimates the mean of Y at a given X; the prediction interval estimates a single new Y at that X. The prediction interval is always wider because it includes the variability of an individual observation (the “1 +” term under the root).
How do I read the Durbin–Watson statistic?
It ranges from 0 to 4. A value near 2 means no autocorrelation; below 2 indicates positive autocorrelation (common in time-series); above 2 indicates negative. Between the dL and dU table bounds the test is inconclusive.
Does a significant slope prove X causes Y?
No. Regression establishes a statistical association, not causation. A significant slope can arise from confounding, reverse causation or coincidence; causal claims need experimental design or strong theory, which is a key pitfall the module emphasises.
Exam move
Memorise L.I.N.E. as a checklist and pair each letter with the residual plot that diagnoses it (curvature → linearity, runs/patterns → independence, QQ-plot → normality, funnel → equal variance). For inference, note that the slope t-test uses n−2 df, and practise reading Excel regression output directly, since the exam often hands you the coefficient table and asks you to test and interpret rather than compute from raw data.