BUSS6002 · Data Science In Business
Matrices & Linear Regression
Week 6 turns the unit's linear-algebra notation into its single most-examined model. A matrix stacks the data into rows (observations) and columns (variables), and once the dataset is written that way, fitting a straight line through a cloud of points collapses to one closed-form expression: the OLS estimator β̂ = (XᵀX)⁻¹Xᵀy. This chapter covers the matrix operations you must do by hand (transpose, the row-times-column product and its dimension rule), how to read and interpret regression coefficients, the goodness-of-fit measure R², and the residual-plot diagnostics that decide whether the model is even correctly specified. It is examined across all three question types — MCQ, short-answer derivations, and hand-written Python.
What this chapter covers
- 011. The data matrix — rows = observations, columns = variables; bold UPPER-case notation
- 022. Matrix operations — transpose, identity, inverse, determinant (element-wise add/scalar)
- 033. The matrix product — row·column, the inner-dimension rule, and why AB ≠ BA
- 044. The regression model — simple y = β₀ + β₁x + ε and matrix form y = Xβ + ε
- 055. OLS as the closed-form RSS minimiser — β̂ = (XᵀX)⁻¹Xᵀy
- 066. Coefficient interpretation — an associated average change, never a cause
- 077. Error assumptions — zero mean / constant variance for estimation; Normality only for inference
- 088. Goodness of fit and diagnostics — R² = 1 − RSS/TSS, and reading the residual plot
Fit an OLS line from summary statistics and find R²
- +1Slope. β̂₁ = Sₓᵧ / Sₓₓ = 900 / 200 = 4.5.
- +1Intercept. β̂₀ = ȳ − β̂₁x̄ = 50 − 4.5×10 = 50 − 45 = 5.
- +1Equation. The fitted line is ŷ = 5 + 4.5x.
- +1Explained variation. The regression sum of squares is SSR = β̂₁·Sₓᵧ = 4.5×900 = 4050.
- +1R². R² = SSR / TSS = 4050 / 5000 = 0.81 (equivalently 1 − RSS/TSS with RSS = 5000 − 4050 = 950).
- +1Interpret. A one-unit increase in x is associated with an average increase of 4.5 in y; about 81% of the variation in y is explained by x.
Key terms
- Matrix
- A rectangular table of numbers A ∈ ℝᵐˣⁿ (m rows, n columns), written bold UPPER-case. When it holds data the convention is fixed: rows are observations and columns are variables.
- Transpose (Aᵀ)
- The matrix obtained by swapping rows and columns, so (Aᵀ)ᵢⱼ = Aⱼᵢ. It satisfies (Aᵀ)ᵀ = A and (A+B)ᵀ = Aᵀ + Bᵀ, and appears throughout OLS as XᵀX and Xᵀy.
- Matrix product (AB)
- Defined only when the columns of A equal the rows of B; entry cᵢⱼ = Σₖ aᵢₖbₖⱼ is row i of A dotted with column j of B. It is NOT element-wise and NOT commutative — in general AB ≠ BA.
- OLS estimator
- Ordinary least squares: the coefficient vector that minimises the residual sum of squares ‖y − Xβ‖². For the linear model it has the closed form β̂ = (XᵀX)⁻¹Xᵀy — an exact, one-shot solution requiring no iteration.
- Residual
- The vertical gap between an observed and fitted value, eᵢ = yᵢ − ŷᵢ. Residuals are the raw material for diagnostics and for the residual sum of squares RSS = Σeᵢ².
- R² (coefficient of determination)
- R² = 1 − RSS/TSS = (TSS − RSS)/TSS, the proportion of variation in y explained by the model, lying in [0, 1]. It can only rise when predictors are added — even useless ones — which is why model selection uses adjusted R².
- Standard error of a coefficient
- SE(β̂₁) measures the sampling variability of a slope estimate; SE(β̂₁)² = σ²/Σ(xᵢ − x̄)². It drives the t-statistic t = β̂₁/SE(β̂₁) and the approximate 95% confidence interval β̂₁ ± 2·SE.
- Heteroscedasticity
- Non-constant error variance — var(ε|x) changes with x. It shows up as a funnel / fan-out in the residual-vs-fitted plot and breaks one of the OLS estimation assumptions; a response transform such as log y is the usual remedy.
Matrices & Linear Regression FAQ
Is matrix multiplication commutative?
No. In general AB ≠ BA — order matters, and assuming otherwise is a guaranteed wrong MCQ answer. The product is also only defined when the inner dimensions match (columns of A = rows of B); otherwise it is undefined. And it is never element-wise: each entry is a row of A dotted with a column of B.
Does OLS require the errors to be Normally distributed?
Not for estimation. β̂ = (XᵀX)⁻¹Xᵀy is unbiased — E(β̂) = β — under just three assumptions: zero-mean errors E(ε) = 0, constant variance var(ε) = σ², and errors uncorrelated with x. You only add ε ~ N(0, σ²) when you want inference — standard errors, t-tests and confidence intervals. Normality for estimation is a classic exam trap.
What does the slope coefficient actually mean?
β̂₁ is the average change in y associated with a one-unit increase in x, holding other predictors fixed. Phrase it as 'associated with', not 'causes' — regression on observational data shows association, not causation, and the causal wording loses marks.
How do I diagnose a residual plot in the exam?
Plot residuals against the fitted values and look for structure. A flat, formless band of constant spread = a good fit. A curve or U-shape means the zero-conditional-mean assumption E(ε|x) = 0 has failed (a missed nonlinearity — add an x² term). A funnel / fan-out means the constant-variance assumption var(ε|x) = σ² has failed (heteroscedasticity — transform y, e.g. log y). Name the specific assumption, not just 'the model is bad'.
Why isn't a high R² enough to choose a model?
Because R² can only increase (or stay flat) when you add a predictor, even a completely useless one. So a bigger model almost always has a higher R² without being better. That is why the unit introduces adjusted R², which penalises each extra term and can fall — and motivates the bias–variance model-selection ideas in a later chapter.
Exam move
Treat Week 6 as the quantitative core of both exams and drill it for speed by hand. Get matrix multiplication automatic — check inner dimensions, then row-dot-column — because the 1-mark MCQ and every regression quantity (XᵀX, Xᵀy, Xβ̂) depend on it. Memorise the OLS pipeline as a fixed routine: β̂₁ = Sₓᵧ/Sₓₓ, β̂₀ = ȳ − β̂₁x̄, then verify a fitted point, then R² = 1 − RSS/TSS. Practise interpreting a slope as an associated (not causal) average change, and rehearse the residual-plot short-answer until you can instantly map curvature → zero-mean failure and funnel → constant-variance failure with a one-line remedy. Finally, keep the estimation-vs-inference distinction sharp (Normality is only for inference) and be ready to write the equivalent NumPy/statsmodels code from memory.