ECON2515 · Intermediate Applied Econometrics Ii
Simple Linear Regression and the OLS Estimator
Week 2 is the hinge of ECON 2515: it moves from describing data to modelling a relationship with the population line E[y|x] = β₁ + β₂x + u, estimated from one sample by ordinary least squares (OLS) — the slope and intercept that minimise the sum of squared residuals Σû². The examined skills are separating the population line (β, u) from the estimated one (β̂, û), computing β̂₂ and β̂₁ by hand, listing the SLR.1–SLR.5 assumptions, and explaining unbiasedness and the standard error of the slope. The exam rewards interpretation and judgement over formula recall, since a formula sheet and statistical tables are provided.
What this chapter covers
- 011. Population regression function — E[y|x] = β₁ + β₂x + u; β are fixed unknown parameters, u holds all other factors
- 022. Linear in the parameters — β₁ + β₂ln(x) is fine; β₂²x is not (you may transform x, not the β's)
- 033. Sample regression function — ŷ = β̂₁ + β̂₂x; β̂ are statistics that vary sample to sample, û = y − ŷ
- 044. Error u vs residual û — u is the gap to the true line (unseen); û is the gap to the fitted line (observed)
- 055. Least-squares criterion — choose β̂₁, β̂₂ to minimise Σû² = Σ(y − ŷ)²; the line runs through (x̄, ȳ)
- 066. OLS slope & intercept — β̂₂ = Σ(x−x̄)(y−ȳ) / Σ(x−x̄)², β̂₁ = ȳ − β̂₂x̄; interpret in the variables' units
- 077. Assumptions SLR.1–SLR.5 — linearity, random sampling, variation in x, zero conditional mean E[u|x]=0, homoskedasticity
- 088. Unbiasedness (SLR.1–4) & variance (needs SLR.5) — Var(β̂₂) = σ² / Σ(x−x̄)², se = √(σ̂²/Σ(x−x̄)²), σ̂² = Σû²/(n−k)
Hand-compute the OLS line and read the residual
- +2Means: x̄ = (2+3+4+6+5)/5 = 4; ȳ = (18+21+30+39+27)/5 = 27.
- +2Deviation table (x−x̄, y−ȳ): (−2,−9), (−1,−6), (0,3), (2,12), (1,0). Products (x−x̄)(y−ȳ): 18, 6, 0, 24, 0 → Σ = 48. Squares (x−x̄)²: 4, 1, 0, 4, 1 → Σ = 10.
- +2(a) β̂₂ = 48 ÷ 10 = 4.8; β̂₁ = ȳ − β̂₂x̄ = 27 − 4.8·4 = 7.8. So ŷ = 7.8 + 4.8x. Check the centroid: 7.8 + 4.8·4 = 27 = ȳ ✓.
- +1(b) β̂₂ = 4.8: each extra 100 passers-by is associated with about 4.8 more loaves sold, on average. β̂₁ = 7.8: predicted sales at zero foot-traffic — an anchor only (x = 0 is outside the data), not economically meaningful.
- +1(c) At x = 4: ŷ = 7.8 + 4.8·4 = 27.0; residual û = y − ŷ = 30 − 27.0 = +3.0 (that day sold 3 loaves above the line).
Key terms
- Population regression function (PRF)
- The true relationship in the whole population, y = β₁ + β₂x + u, with conditional mean E[y|x] = β₁ + β₂x. β₁ and β₂ are fixed, unknown parameters; the error u collects every other factor affecting y and is never observed.
- Sample regression function (SRF)
- The line OLS fits to one sample, ŷ = β̂₁ + β̂₂x. The estimators β̂₁, β̂₂ are sample statistics — a different sample gives a different line — and û = y − ŷ is the residual.
- Error u vs residual û
- u is the population error: a point's vertical gap to the true (PRF) line, unobservable. û is the sample residual: the gap to the fitted (SRF) line, observable. They are not interchangeable, and OLS never sees u.
- OLS / least-squares criterion
- Ordinary least squares chooses β̂₁, β̂₂ to minimise the sum of squared residuals Σû² = Σ(y − ŷ)². Squaring treats over- and under-predictions symmetrically and yields closed-form estimators; the resulting line always passes through the centroid (x̄, ȳ).
- OLS slope and intercept
- β̂₂ = Σ(xᵢ−x̄)(yᵢ−ȳ) / Σ(xᵢ−x̄)² (the covariance of x and y over the variance of x) and β̂₁ = ȳ − β̂₂x̄. The slope is the ceteris-paribus change in mean y per one-unit rise in x; the intercept is often just a mathematical anchor.
- SLR.1–SLR.5 assumptions
- SLR.1 linear in parameters; SLR.2 random sampling; SLR.3 variation in x (needed to estimate a slope); SLR.4 zero conditional mean E[u|x]=0 (gives unbiasedness and a causal reading); SLR.5 homoskedasticity Var(u|x)=σ² (validates the usual standard errors).
- Unbiasedness
- Under SLR.1–SLR.4, E[β̂] = β: averaged over many samples the estimates centre on the truth. It is a property of the estimator across repeated samples, NOT a claim that any single estimate equals the true parameter.
- Variance and standard error of β̂₂
- Under SLR.1–SLR.5, Var(β̂₂) = σ² / Σ(xᵢ−x̄)² — it rises with error variance σ² and falls with spread in x and sample size. Since σ² is unknown it is estimated by σ̂² = Σû²/(n−k) (k = parameters), and se(β̂₂) = √(σ̂²/Σ(xᵢ−x̄)²) is the denominator of every slope t-test.
Simple Linear Regression and the OLS Estimator FAQ
What is the difference between the PRF and the SRF?
The population regression function (PRF) is the true, unseen line y = β₁ + β₂x + u, whose coefficients β are fixed unknown constants. The sample regression function (SRF) ŷ = β̂₁ + β̂₂x is the line OLS estimates from one sample, whose coefficients β̂ are statistics that change from sample to sample. In one phrase: β is the truth, β̂ is the guess from data. This distinction shows up on almost every ECON 2515 paper, so state it in symbols and in words.
Is the residual û the same as the error u?
No, and mixing them up is a classic mark-loser. The error u is the vertical gap from a point to the TRUE population line, and it is unobservable. The residual û = y − ŷ is the gap to the FITTED sample line, and it is observable and sample-specific. OLS minimises Σû²; it never observes u. Also, 'OLS estimates u' is wrong — OLS estimates the parameters β (producing β̂).
How do I compute β̂₂ and β̂₁ by hand?
Build one deviation table. Find x̄ and ȳ, then the columns (x−x̄), (y−ȳ), their product, and (x−x̄)². The slope is β̂₂ = Σ(product) ÷ Σ(x−x̄)², and the intercept is β̂₁ = ȳ − β̂₂x̄. Finish with the free check: β̂₁ + β̂₂x̄ should equal ȳ exactly, because the OLS line always passes through the centroid (x̄, ȳ).
What does each SLR assumption actually buy me?
SLR.1–SLR.3 (linearity, random sampling, variation in x) let you fit a line at all — without variation in x the slope is undefined. SLR.4, zero conditional mean E[u|x]=0, is the one that makes β̂ unbiased and lets you read the slope causally; its failure (endogeneity) is the headline bias. SLR.5, homoskedasticity, makes the usual variance formula and standard errors valid. Unbiasedness needs SLR.1–4; the variance formula also needs SLR.5.
Does 'unbiased' mean my single estimate equals the true value?
No. Unbiasedness (E[β̂] = β) is a property of the estimator over MANY hypothetical samples: if you re-sampled endlessly and averaged the slope estimates they would centre on the truth. It says nothing about your one sample's number being 'right'. In the exam, phrase it as 'on average, across repeated samples' — that wording earns the mark.
When does a regression slope give a causal effect?
Only when SLR.4 (E[u|x]=0) holds. If an omitted variable — like ability in a wage-on-education regression — is correlated with x and also affects y, the estimate is biased and the slope is just an association. To sign the omitted-variable bias, sign two channels: (omitted → x) and (omitted → y); same sign gives upward bias, opposite signs give downward bias. Say 'associated with', not 'causes', until you can defend exogeneity.
Exam move
Drill three moves until they are automatic. (1) The vocabulary check: for any statement, decide whether it is about the population (β, u, PRF) or the sample (β̂, û, SRF) and use the matching symbol — this alone protects easy MCQ and short-answer marks. (2) The hand computation: from a small dataset, build the deviation table, get β̂₂ = Σ(x−x̄)(y−ȳ)/Σ(x−x̄)² and β̂₁ = ȳ − β̂₂x̄, then verify the centroid — this is the recurring Part-B calculation. (3) The assumption map: memorise which SLR condition does which job (SLR.3 estimability, SLR.4 unbiasedness/causality, SLR.5 valid standard errors) so you can answer 'which assumption is violated and what breaks?' — including the trap that heteroskedasticity leaves β̂ unbiased and only wrecks the SEs. Because the exam is application-focused with a formula sheet provided, spend your revision on interpreting and defending results, not on memorising formulae.