DATA1001 · Foundations Of Data Science
Regression
Regression fits a straight line to a scatter of two quantitative variables and turns the relationship into a prediction. It starts with the correlation r, a unit-free number in [−1, 1] measuring the strength and direction of a linear relationship. DATA1001's signature distinction is between the SD line (slope ±SDᴳ/SDₓ, through the point of averages) and the regression line, which is flatter — its slope is r·SDᴳ/SDₓ. That flattening is regression to the mean: a point that is +2 SD on x is predicted only +r×2 SD on y, not +2. Misreading this as a real effect is the regression fallacy, one of the most examined traps in the subject. The fit's quality is summarised by r², the fraction of the variance in y explained by x, and checked with a residual plot (a good fit shows a structureless band; a funnel or a curve signals trouble).
What this chapter covers
- 01Correlation r: strength and direction of a linear relationship
- 02The SD line vs the regression line (slope r·SDᴳ/SDₓ)
- 03Regression to the mean and the regression fallacy
- 04r²: the fraction of variance explained
- 05Residuals, homoscedasticity, and predicting with the line
Worked example: regression and regression to the mean
- +2(a) Regression slope = r·SDᴳ/SDₓ = 0.6×10/12 = 0.5; the SD-line slope = SDᴳ/SDₓ = 10/12 ≈ 0.83, so the regression line is flatter.
- +2(b) The line passes through (60, 65): ŷ = 65 + 0.5×(72 − 60) = 65 + 6 = 71.
- +1(c) +2 SD on x → predicted r×2 = 0.6×2 = +1.2 SD on y, not +2.
- +1(c) This pull-back toward the mean is regression to the mean; treating it as a real effect is the regression fallacy.
Key terms
- Correlation (r)
- A unit-free number in [−1, 1] measuring the strength and direction of a linear relationship between two quantitative variables. r near ±1 means a tight line; r near 0 means no linear relationship (though a curved one may still exist). r measures linear association only, and association is not causation.
- SD line
- The line through the point of averages with slope ±SDᴳ/SDₓ, passing through the cloud's main diagonal. It is steeper than the regression line; the difference between the two is exactly regression to the mean.
- Regression line
- The least-squares line of best fit, through the point of averages with slope r·SDᴳ/SDₓ. Because |r| ≤ 1 it is flatter than the SD line, which builds regression to the mean into every prediction.
- Regression to the mean
- The tendency for a value extreme on x to be predicted less extreme on y: +k SD on x maps to only +r·k SD on y. It is a mathematical consequence of |r| < 1, not a causal effect; mistaking it for one is the regression fallacy.
- r-squared (r²)
- The fraction of the variance in y explained by the regression on x; r = 0.6 gives r² = 0.36, i.e. 36% of the variation explained. It summarises fit quality but says nothing about whether the linear model is appropriate — check the residual plot for that.
Regression FAQ
What's the difference between the SD line and the regression line?
Both pass through the point of averages, but the SD line has slope SDᴳ/SDₓ while the regression line has slope r·SDᴳ/SDₓ — flatter by the factor r. The regression line is the one you use to predict y from x; the gap between the two lines is exactly regression to the mean. DATA1001 treats this distinction as a signature idea, so be ready to compute both slopes.
What is regression to the mean, and why is it a trap?
A point that is k SD above average on x is predicted only r×k SD above average on y, because |r| < 1 flattens the line. So extreme cases tend to be followed by less extreme ones, with no cause at all. The regression fallacy is reading this built-in pull-back as a real effect — for example, crediting a coaching course when high (or low) scorers naturally drift back toward the mean on retest.
What does r-squared tell me, and what doesn't it?
r² is the fraction of the variance in y explained by x; r = 0.6 means 36% explained, leaving 64% to other factors and chance. It is a measure of how much the line accounts for, but a high r² does not prove the linear model is right, does not prove causation, and can be inflated by outliers. Always read it alongside the residual plot.
How do I check whether the line is a good fit?
Plot the residuals (observed minus predicted) against the fitted values. A good fit shows a structureless horizontal band of roughly constant width (homoscedastic). A funnel that widens signals non-constant variance; a curve signals that a straight line is the wrong model. The residual plot, not r² alone, tells you whether the linear model is appropriate.
Exam move
Lock down the two slopes first: the SD line is SDᴳ/SDₓ, the regression line is r·SDᴳ/SDₓ (flatter), and both pass through the point of averages, so any prediction is ŷ = ȳ + slope×(x − x̄). Practise the regression-to-the-mean conversion in SD units (+k SD on x → +r·k SD on y) and rehearse the regression-fallacy answer, because the exam loves to dress it up as a real effect. Read r² as the share of variance explained, and always pair it with a residual-plot check — band good, funnel or curve bad. Remember throughout that r measures linear association only, and association is not causation.