BUSS6002 · Data Science In Business
Classification & Logistic Regression
This is the Week 8 block where BUSS6002 stops predicting numbers and starts predicting categories — will a customer churn, is a transaction fraud, will a lead convert. The chapter has two halves you must keep separate: how you evaluate a classifier (the confusion matrix and the rates that come off it — accuracy, recall, precision, specificity, FPR and F1) and logistic regression as a generalised linear model that outputs a probability through the logit link and sigmoid activation. The single most-tested ideas are that the linear predictor xᵀβ lives on the log-odds scale, not the probability scale, that logistic regression is fit by maximum likelihood, not least squares, and that accuracy is misleading under class imbalance so you switch to precision/recall/F1. It is double-weighted: the 30% assignment is itself a logistic-regression task in Python, and the 45% final examines it with an MCQ, a short-answer confusion matrix and a concept question.
What this chapter covers
- 011. Classifier vs probabilistic classifier — label f(x) vs a predicted probability P̂(y|x)
- 022. The decision threshold τ — default 0.5, equal-cost assumption (BUSS6002 does not optimise it)
- 033. The confusion matrix — TP, FN, FP, TN as a 2×2 table of prediction vs truth
- 044. Metrics off the matrix — accuracy, recall (TPR), specificity (TNR), FPR, precision, F1
- 055. The GLM template — g(E(y|x)) = xᵀβ, identity link = linear regression
- 066. Logistic regression — logit link log(p/(1−p)) = xᵀβ and the sigmoid activation
- 077. Prediction and odds — push the log-odds through the sigmoid; exp(βⱼ) is the odds ratio
- 088. Class imbalance, ROC and marketing models — propensity vs campaign-response vs uplift
Build a confusion matrix and compute the metrics
- +1Build the counts by comparing each pair. Positives (y = 1) are at positions 1, 2, 4, 7, 9: predicted positive at 1, 4, 7, 9 → TP = 4, and position 2 is missed → FN = 1. Negatives (y = 0) are at positions 3, 5, 6, 8, 10: wrongly flagged at 5 and 10 → FP = 2, correctly cleared at 3, 6, 8 → TN = 3.
- +1Accuracy = (TP + TN) / n = (4 + 3) / 10 = 0.70 — the overall fraction correct across all four cells.
- +1Recall (TPR) = TP / (TP + FN) = 4 / (4 + 1) = 4/5 = 0.80 — of the 5 real positives, four were caught.
- +1Precision = TP / (TP + FP) = 4 / (4 + 2) = 4/6 ≈ 0.667 — of the 6 cases flagged positive, four were genuinely positive.
- +1F1 = 2·TP / (2·TP + FP + FN) = 8 / (8 + 2 + 1) = 8/11 ≈ 0.727 — the harmonic mean balancing the 0.80 recall against the 0.667 precision.
Key terms
- Classifier vs probabilistic classifier
- A classifier ŷ = f(x) maps predictors straight to a class label. A probabilistic classifier g(x) = P̂(y|x) returns a probability over the classes; the hard rule then picks the most probable one. The probabilistic version keeps the confidence, which is strictly more informative.
- Decision threshold τ
- The cut-off that turns a predicted probability into a label: ŷ = 1 if P̂(y=1|x) > τ, else 0. The default τ = 0.5 assumes a false positive and a false negative are equally costly. BUSS6002 fixes τ = 0.5 and does not ask you to optimise it.
- Confusion matrix
- The 2×2 table cross-tabulating predicted class against actual class, with cells TP (true positive), FP (false positive / Type I error), FN (false negative / Type II error) and TN (true negative). Every classification metric is a ratio of two of these four cells.
- Recall, precision and F1
- Recall (TPR) = TP/(TP+FN) ≈ P(ŷ=1|y=1) asks how many real positives you caught; precision = TP/(TP+FP) ≈ P(y=1|ŷ=1) asks how trustworthy your flags are. F1 = 2·TP/(2·TP+FP+FN) is their harmonic mean — one number when both matter, and it deliberately ignores TN.
- GLM (link and activation)
- A generalised linear model keeps the linear predictor xᵀβ but passes the conditional mean through a link function g: g(E(y|x)) = xᵀβ. Its inverse g⁻¹ is the activation. Linear regression is the GLM with the identity link; logistic regression is the GLM with a Bernoulli response and the logit link.
- Logit link and sigmoid activation
- The logit (link) sends a probability to the log-odds: log(p/(1−p)) = xᵀβ. The sigmoid (activation) is its inverse, mapping the log-odds back to a probability: p = 1/(1+exp(−xᵀβ)). Keeping these two straight — and never reading xᵀβ as a probability — is the chapter's signature exam skill.
- Odds ratio
- A one-unit increase in xⱼ adds βⱼ to the log-odds, which multiplies the odds by exp(βⱼ) — the odds ratio. A positive coefficient therefore raises the odds and the predicted probability; this is the natural way to interpret a logistic coefficient.
- Uplift model
- A predictive-marketing model that estimates the causal, incremental effect of a campaign by comparing treatment and control groups — finding the 'persuadables'. It differs from a propensity model (natural likelihood of an action) and a campaign-response model (likelihood given exposure), neither of which isolates the causal lift.
Classification & Logistic Regression FAQ
Is the linear predictor xᵀβ a probability?
No — this is the most punished mistake in the chapter. xᵀβ is on the log-odds scale. To get a probability you must push it through the sigmoid: P̂ = 1/(1+exp(−xᵀβ)). Sign check: a positive log-odds always gives P̂ > 0.5, and the more positive it is, the closer the probability gets to 1.
What is the difference between the logit and the sigmoid?
They are inverses. The logit is the link, taking a probability to the log-odds: log(p/(1−p)) = xᵀβ. The sigmoid is the activation, taking the log-odds back to a probability. In one direction you compress (0,1) onto the whole real line; in the other you squash the real line back into (0,1). Swapping their names is a classic exam trap.
Why is logistic regression fit by maximum likelihood and not least squares?
Because the response is Bernoulli, not Gaussian, there is no closed-form OLS solution. You instead maximise the Bernoulli log-likelihood (equivalently minimise cross-entropy / log-loss), which has no closed form and is solved iteratively by gradient descent. This links forward to the maximum-likelihood and optimisation chapter.
Why is accuracy a bad metric under class imbalance?
When the positive class is rare — say 2% fraud — a do-nothing classifier that predicts 'never positive' scores 98% accuracy and 100% specificity while catching zero positives. Accuracy rewards ignoring the cases you actually care about, so you report precision, recall and F1 instead, which expose the missed positives.
How do I tell recall and precision apart?
Condition on different things. Recall conditions on the truth (of the real positives, how many did I catch?) and uses the actual-positive column TP/(TP+FN). Precision conditions on the prediction (of what I flagged, how many were real?) and uses the predicted-positive row TP/(TP+FP). Flag everything and recall hits 1 while precision collapses — F1 stops you cheating either one.
Is this guide official or affiliated with the University of Sydney?
No. This is an independent AskSia study resource for BUSS6002, written to mirror the examinable ideas in our own words. It is not produced, endorsed by or affiliated with the University of Sydney — always confirm definitions, notation and assessment details against your official Canvas unit materials.
Exam move
Treat this chapter as two separate skills and drill each to reflex. First, the confusion matrix: practise turning a pair of label vectors into TP/FN/FP/TN and then reading off accuracy, recall, precision, specificity, FPR and F1 — always drawing the table before computing, since every rate is just two cells over a margin, and keeping the FPR-vs-false-discovery and recall-vs-precision distinctions crisp. Second, logistic regression: rehearse the two-step prediction (compute the log-odds η = xᵀβ, then push it through the sigmoid to a probability, then threshold at 0.5) until the 1-mark MCQ is automatic, and memorise exp(−1.5) ≈ 0.22 and exp(−2) ≈ 0.14 for the calculator-restricted exam. Lock the three conceptual anchors that examiners reward — xᵀβ is log-odds not probability, logit is the link while sigmoid is the activation, and logistic regression is fit by maximum likelihood not OLS — and be ready to explain why accuracy fails under imbalance and how an uplift model differs from a campaign-response model. Because the 30% assignment is a coded logistic-regression task, get comfortable writing the same logic in Python/sklearn from memory, not just by hand.