ECMT1010 · Introduction To Economic Statistics
Simple Linear Regression
Week 10 models a straight-line relationship between two quantitative variables: the least-squares line ŷ = b₀ + b₁x, with slope b₁ = r·(sy/sx), the meaning of the slope and intercept, residuals, inference for the slope (t with df = n − 2), and goodness of fit via R² = r² and the ANOVA decomposition. It is examined as short-answer: interpret the slope, predict (and avoid extrapolation), test H₀: β₁ = 0, and report R² in words.
What this chapter covers
- 011. The least-squares line ŷ = b₀ + b₁x, fitted to minimise Σ(yᵢ − ŷᵢ)²
- 022. Slope b₁ = r·(sy/sx) and intercept b₀ = ȳ − b₁·x̄
- 033. Interpreting the slope (change in ŷ per 1-unit x) and intercept (ŷ at x = 0)
- 044. Residuals eᵢ = yᵢ − ŷᵢ and the extrapolation warning (only predict within the x range)
- 055. Inference for the slope: t = b₁/SE(b₁), df = n − 2, and a CI for β₁
- 066. The correlation test t = r√(n − 2)/√(1 − r²) on df = n − 2
- 077. R² = r² (simple regression): the share of variation in y explained by x
- 088. The ANOVA decomposition SST = SSModel + SSE and the standard error of the regression sₑ
Regression slope, prediction, fit and a slope test
- 2 marks(a) Interpret the slope: each extra $1,000 of advertising spend is associated with $1,400 more weekly revenue (slope 1.40, both variables measured in $000).
- 2 marks(b) Predict at x = 5: ŷ = 9.6 + 1.40(5) = 9.6 + 7.0 = 16.6 → about $16,600 weekly revenue.
- 2 marks(c) Goodness of fit: R² = r² = 0.72² = 0.518, so about 52% of the variation in revenue is explained by advertising spend.
- 2 marks(d) Test the slope: t = b₁/SE(b₁) = 1.40/0.38 ≈ 3.68 on df = n − 2 = 20; the two-sided 5% critical value t(20) ≈ 2.086. Since 3.68 > 2.086, reject H₀ — there is strong evidence revenue is positively related to ad spend.
Key terms
- Least-squares line
- The fitted line ŷ = b₀ + b₁x chosen to minimise the sum of squared residuals Σ(yᵢ − ŷᵢ)². It is the best straight-line summary of how y depends on x.
- Slope (b₁)
- The predicted change in y per one-unit increase in x, computed as b₁ = r·(sy/sx). Its sign matches the sign of the correlation, and inference about it tests whether x and y are linearly related.
- Residual
- The vertical gap between an observed and a predicted value, eᵢ = yᵢ − ŷᵢ. Residual plots are used to check the model assumptions of linearity and constant spread.
- R² (coefficient of determination)
- The proportion of the variation in y explained by the regression, between 0 and 1. In simple linear regression R² = r², so an r of 0.72 gives R² ≈ 0.52, meaning about 52% of the variation in y is explained by x.
- Slope inference
- Testing H₀: β₁ = 0 with t = b₁/SE(b₁) on df = n − 2, or building a CI b₁ ± t*(df, α/2)·SE(b₁). Rejecting H₀ is evidence of a real linear relationship.
- ANOVA decomposition
- The split of total variation SST = SSModel + SSE into the part explained by the model and the leftover error. R² = SSModel/SST, and the standard error of the regression is sₑ = √(SSE/(n − 2)).
Simple Linear Regression FAQ
How do I interpret the slope and intercept correctly?
The slope is the predicted change in y for a one-unit increase in x, stated in the real units of both variables and with the right sign — e.g. '+$1,400 of revenue per extra $1,000 of ad spend'. The intercept is the predicted y when x = 0; it is only meaningful if x = 0 is sensible and within the data range, otherwise it is just where the line crosses the axis (often an extrapolation) and should not be over-interpreted.
What does R² actually mean and why does R² = r²?
R² is the proportion of the variation in y that the regression explains, ranging from 0 (no explanatory power) to 1 (a perfect fit). You report it as 'X% of the variation in y is explained by x'. In simple linear regression — one predictor — it equals the square of the correlation, R² = r², because the single predictor's linear association with y is exactly what r measures. With more predictors (a later unit) this identity no longer holds.
Why is the slope test on n − 2 degrees of freedom?
Because fitting a line estimates two parameters from the data — the intercept b₀ and the slope b₁ — so two degrees of freedom are 'used up', leaving n − 2 for the error. That is the df you read the t table at when testing H₀: β₁ = 0 with t = b₁/SE(b₁), and also the divisor in the standard error of the regression sₑ = √(SSE/(n − 2)). State the df explicitly so the marker can check the table line.
What is extrapolation and why is it a problem?
Extrapolation is using the fitted line to predict y for an x value outside the range observed in the data. It is risky because you have no evidence the linear relationship continues beyond that range — the true relationship could bend, flatten or reverse. The line is only validated where there are data, so predictions inside the x range are reasonable, but a prediction far outside should be flagged as unreliable rather than reported as fact.
Exam move
Treat a regression question as a checklist of standard sub-tasks the exam strings together: interpret the slope (units + direction), predict a value (and check it is not extrapolation), report R² in words, and test the slope with t = b₁/SE(b₁) on n − 2 df. Practise writing slope interpretations as full sentences in the variables' real units, because a bare number rarely earns the mark. Memorise that R² = r² only in simple regression and that fit means 'share of variation in y explained'. Keep the ANOVA picture in mind — SST splits into explained (SSModel) and leftover (SSE) — so you can connect R², sₑ and the slope test. Always flag extrapolation and finish a slope test with a strength-of-evidence sentence about the relationship.