ECON1012 · Data Analytics
Correlation & the Regression Model
Correlation & the Regression Model (Module 9, Week 9) is where ECON 1012 starts studying the relationship between two numerical variables. You start with the scatter diagram and its pattern gallery, then measure association two ways: covariance s_xy, whose sign gives the direction but whose size is hard to judge, and the coefficient of correlation r = s_xy/(s_x·s_y), a unit-free number always between −1 and +1. The module then builds the simple linear regression model Yᵢ = β₀ + β₁Xᵢ + εᵢ — an explanatory X, a response Y and an error term — and introduces the OLS estimator, which chooses β̂₀ and β̂₁ to minimise the sum of squared residuals, giving a fitted line Ŷ = β̂₀ + β̂₁X to interpret and predict with. 'Correlation is not causation' is itself examinable wording; Week 10 carries the same model into inference and model fit.
What this chapter covers
- 01Scatter diagrams: positive linear, negative linear, nonlinear, or no relationship
- 02Sample covariance s_xy = Σ(xᵢ − x̄)(yᵢ − ȳ)/(n − 1) — sign gives direction, but size is 'all relative'
- 03Correlation r = s_xy/(s_x·s_y): unit-free, always between −1 and +1
- 04Correlation ≠ causation — a huge r never proves X drives Y
- 05Roles: explanatory X (independent) on the x-axis, response Y (dependent) on the y-axis
- 06Population model Yᵢ = β₀ + β₁Xᵢ + εᵢ: slope β₁ = change in Y associated with a unit change in X
- 07OLS picks β̂₀ and β̂₁ to minimise the sum of squared residuals Σûᵢ²
- 08Prediction: plug X into Ŷ = β̂₀ + β̂₁X; residual ûᵢ = Yᵢ − Ŷᵢ is not the error term εᵢ
Covariance, correlation and a first fitted line
- 2 marks(a) Shortcut formula: s_xy = [Σxy − (Σx)(Σy)/n]/(n − 1) = [1380 − (30 × 215)/5]/4 = (1380 − 1290)/4 = 90/4 = 22.5.
- 2 marks(b) Standard deviations first: s_x² = [Σx² − (Σx)²/n]/(n − 1) = (220 − 180)/4 = 10, so s_x = √10 ≈ 3.162; s_y² = (9519 − 9245)/4 = 68.5, so s_y = √68.5 ≈ 8.276.
- 2 marks(b) r = s_xy/(s_x·s_y) = 22.5/(3.162 × 8.276) = 22.5/26.17 ≈ 0.86 — a strong positive linear relationship: weeks with more classes tend to sell more passes.
- 1 mark(c) Classes run is what the gym changes — the driver — so X = classes; passes sold is the outcome, so Y = passes. Assign roles from the story, never from which column is listed first.
- 2 marks(d) Slope: each additional class per week is, on average, associated with about 2.25 more casual passes sold — state size and direction, and say 'associated with', not causal wording.
- 1 mark(d) Prediction at X = 7: Ŷ = 29.5 + 2.25 × 7 = 29.5 + 15.75 = 45.25, i.e. about 45 passes.
Key terms
- Covariance
- A measure of how two variables move together: for a sample, s_xy = Σ(xᵢ − x̄)(yᵢ − ȳ)/(n − 1). Positive means same-direction movement, negative means opposite — but its magnitude depends on the units, so it cannot grade strength.
- Coefficient of correlation
- The covariance divided by the standard deviations of the variables: sample r = s_xy/(s_x·s_y), population ρ = σ_xy/(σ_x·σ_y). It always lies between −1 and +1; ±1 is a perfect linear relationship and 0 is no linear relationship.
- Explanatory and response variables
- The explanatory (independent) variable X is the driver, plotted on the x-axis; the response (dependent) variable Y is the outcome, plotted on the y-axis. Deciding the roles from the question's story is itself an examinable step.
- Simple linear regression model
- The population relationship Yᵢ = β₀ + β₁Xᵢ + εᵢ: the line β₀ + β₁X holds on average over the population, β₁ is the change in Y associated with a unit change in X, β₀ is the value of Y when X = 0, and the error term ε captures everything else. ECON 1012 uses one X variable only.
- OLS estimator
- Ordinary least squares: the rule that chooses β̂₀ and β̂₁ so the fitted regression line is as close as possible to the observed data, by minimising the sum of squared residuals Σûᵢ².
- Residual
- The gap between an observed value and the fitted line, ûᵢ = Yᵢ − Ŷᵢ. It estimates — but is not the same thing as — the unobservable error term εᵢ = Yᵢ − β₀ − β₁Xᵢ, which is measured from the population line.
Correlation & the Regression Model FAQ
What is the difference between covariance and correlation in ECON 1012?
Both measure how two variables move together. Covariance gives the direction — positive, negative, or roughly zero — but its magnitude depends on the units, so judging whether a particular covariance is large is, as the module puts it, 'all relative'. Correlation divides the covariance by both standard deviations, producing a unit-free number between −1 and +1 that grades strength as well as direction. MCQs regularly contrast exactly these two facts.
Does a strong correlation prove that X causes Y?
No. 'Correlation is not causation' is taught explicitly, using spurious pairs with r above 0.95 between absurd, unrelated variables. A high r means a strong linear association, which can come from coincidence or a third variable driving both. When interpreting a slope, write 'associated with' rather than causal language — that wording is exactly what markers look for.
Do I have to compute the OLS coefficients by hand in Week 9?
Week 9 introduces the model and what OLS does — choose the line that minimises the sum of squared residuals — and the workshop's Excel activity computes the covariance and correlation coefficient. The estimation mechanics, slope significance test and R² arrive in Week 10. The final exam is hand-calculation with a calculator and provided tables, so practise the shortcut sums for s_xy and r now; check myLearning for exactly what the Module 9 quiz covers.
How do I decide which variable is X and which is Y?
X is the explanatory (independent) variable — the one you change or observe as the driver — and Y is the response (dependent) variable, the outcome you want to explain or predict. Read the question's story: 'the effect of advertising on sales' makes advertising X and sales Y, regardless of which column the table lists first.
Studying with AI? Sia — free AI economics tutor works through ECON 1012 step by step.
Exam move
Three r facts are near-guaranteed MCQ material: r lives in [−1, +1]; points exactly on a straight line mean r = +1 or −1, sign set by the slope; and r = 0.80 squares to r² = 0.64, read as '64% of the variation in Y is explained by X' — watch percent-versus-proportion distractors. In case-study answers, interpret a slope with size, direction and 'associated with'; causal wording is a marked error, as is calling a covariance 'strong' when its size is relative to the units. Keep the residual ûᵢ = Yᵢ − Ŷᵢ distinct from the error term εᵢ. Put the shortcut formulas for s_xy, s_x², s_y² and r on your A4 note sheet, and re-attempt the randomised Module 9 quiz on myLearning until they run without thinking.