University of Sydney · S1 2026 · FACULTY OF SCIENCE

DATA1001 · Foundations Of Data Science

- one subject, every graph, every model, every mark
50% final exam · hurdle14 Chapters5-page Bible
Our own words - no uploaded lecturer files
Built to mirror S1 2026 · updated this semester
Chapter 4 of 7 · DATA1001

Regression

Regression fits a straight line to a scatter of two quantitative variables and turns the relationship into a prediction. It starts with the correlation r, a unit-free number in [−1, 1] measuring the strength and direction of a linear relationship. DATA1001's signature distinction is between the SD line (slope ±SDᴳ/SDₓ, through the point of averages) and the regression line, which is flatter — its slope is r·SDᴳ/SDₓ. That flattening is regression to the mean: a point that is +2 SD on x is predicted only +r×2 SD on y, not +2. Misreading this as a real effect is the regression fallacy, one of the most examined traps in the subject. The fit's quality is summarised by , the fraction of the variance in y explained by x, and checked with a residual plot (a good fit shows a structureless band; a funnel or a curve signals trouble).

In this chapter

What this chapter covers

  • 01Correlation r: strength and direction of a linear relationship
  • 02The SD line vs the regression line (slope r·SDᴳ/SDₓ)
  • 03Regression to the mean and the regression fallacy
  • 04r²: the fraction of variance explained
  • 05Residuals, homoscedasticity, and predicting with the line
Worked example · free

Worked example: regression and regression to the mean

Q [6 marks]. Midterm (x) and final (y) scores have x̄ = 60, SDₓ = 12, ȳ = 65, SDᴳ = 10 and r = 0.6. (a) Find the regression slope and compare it to the SD-line slope. (b) A student scores 72 on the midterm; predict their final. (c) A student is +2 SD on the midterm — how many SD above average is their predicted final, and what is this phenomenon called?
  • +2(a) Regression slope = r·SDᴳ/SDₓ = 0.6×10/12 = 0.5; the SD-line slope = SDᴳ/SDₓ = 10/12 ≈ 0.83, so the regression line is flatter.
  • +2(b) The line passes through (60, 65): ŷ = 65 + 0.5×(72 − 60) = 65 + 6 = 71.
  • +1(c) +2 SD on x → predicted r×2 = 0.6×2 = +1.2 SD on y, not +2.
  • +1(c) This pull-back toward the mean is regression to the mean; treating it as a real effect is the regression fallacy.
The regression slope is 0.5 (flatter than the SD-line slope 0.83); the predicted final at midterm 72 is 71; and a student +2 SD on the midterm is predicted only +1.2 SD on the final — regression to the mean, which is not a causal effect.
Sia tip — The flattening is built into the line, not a finding. The classic exam trap is to read a high scorer's lower follow-up (or a coached student's improvement) as a real change when it is just regression to the mean — always ask whether the pull-back alone explains the result.
Glossary

Key terms

Correlation (r)
A unit-free number in [−1, 1] measuring the strength and direction of a linear relationship between two quantitative variables. r near ±1 means a tight line; r near 0 means no linear relationship (though a curved one may still exist). r measures linear association only, and association is not causation.
SD line
The line through the point of averages with slope ±SDᴳ/SDₓ, passing through the cloud's main diagonal. It is steeper than the regression line; the difference between the two is exactly regression to the mean.
Regression line
The least-squares line of best fit, through the point of averages with slope r·SDᴳ/SDₓ. Because |r| ≤ 1 it is flatter than the SD line, which builds regression to the mean into every prediction.
Regression to the mean
The tendency for a value extreme on x to be predicted less extreme on y: +k SD on x maps to only +r·k SD on y. It is a mathematical consequence of |r| < 1, not a causal effect; mistaking it for one is the regression fallacy.
r-squared (r²)
The fraction of the variance in y explained by the regression on x; r = 0.6 gives r² = 0.36, i.e. 36% of the variation explained. It summarises fit quality but says nothing about whether the linear model is appropriate — check the residual plot for that.
FAQ

Regression FAQ

What's the difference between the SD line and the regression line?

Both pass through the point of averages, but the SD line has slope SDᴳ/SDₓ while the regression line has slope r·SDᴳ/SDₓ — flatter by the factor r. The regression line is the one you use to predict y from x; the gap between the two lines is exactly regression to the mean. DATA1001 treats this distinction as a signature idea, so be ready to compute both slopes.

What is regression to the mean, and why is it a trap?

A point that is k SD above average on x is predicted only r×k SD above average on y, because |r| < 1 flattens the line. So extreme cases tend to be followed by less extreme ones, with no cause at all. The regression fallacy is reading this built-in pull-back as a real effect — for example, crediting a coaching course when high (or low) scorers naturally drift back toward the mean on retest.

What does r-squared tell me, and what doesn't it?

r² is the fraction of the variance in y explained by x; r = 0.6 means 36% explained, leaving 64% to other factors and chance. It is a measure of how much the line accounts for, but a high r² does not prove the linear model is right, does not prove causation, and can be inflated by outliers. Always read it alongside the residual plot.

How do I check whether the line is a good fit?

Plot the residuals (observed minus predicted) against the fitted values. A good fit shows a structureless horizontal band of roughly constant width (homoscedastic). A funnel that widens signals non-constant variance; a curve signals that a straight line is the wrong model. The residual plot, not r² alone, tells you whether the linear model is appropriate.

Study strategy

Exam move

Lock down the two slopes first: the SD line is SDᴳ/SDₓ, the regression line is r·SDᴳ/SDₓ (flatter), and both pass through the point of averages, so any prediction is ŷ = ȳ + slope×(x − x̄). Practise the regression-to-the-mean conversion in SD units (+k SD on x → +r·k SD on y) and rehearse the regression-fallacy answer, because the exam loves to dress it up as a real effect. Read r² as the share of variance explained, and always pair it with a residual-plot check — band good, funnel or curve bad. Remember throughout that r measures linear association only, and association is not causation.

A+Everything unlocked
Unlocks this Bible + all 25 of your University of Sydney subjects - and 1,000+ Bibles across every Australian university.
Sia - your DATA1001 tutor, unlimited, worked the way the exam marks it
The full 5-page Bible + practice bank with worked solutions
Chrome extension - sync your LMS so Sia knows your deadlines
Bilingual EN / Chinese on every Bible and every Sia answer
$25/ month
30-day money-back · cancel in one tap · how it works
Unlock the full DATA1001 Bible + 25 University of Sydney subjects解锁完整 DATA1001 Bible + University of Sydney 25 门科目
$25/mo