MAST90139 · Statistical Modelling For Data Science
Poisson Regression
When the response is a count — emails per hour, claims per policy, cases per region — the linear model breaks: counts are non-negative integers whose variance grows with their mean. The fix is a GLM with the Poisson random component and the log link, log(μ) = Xβ. Two skills carry almost every mark: reading eβ as a rate ratio (the multiplicative effect on the expected count), and knowing when to add an offset log(exposure) so you model a rate (events per person-year, per 1000 policies) rather than a raw count. The chapter builds the Poisson pmf and its mean = variance signature, the multiplicative log-link model, offsets, deviance and Pearson goodness-of-fit, nested-model comparison by ΔD, and the all-important overdispersion fix — quasi-Poisson, where coefficients stay fixed but standard errors inflate by √φ̂.
What this chapter covers
- 01The Poisson distribution: mean = variance
- 02The Poisson regression model and the log link
- 03Reading eβ as a rate ratio (★ the examined skill)
- 04Offsets: modelling rates instead of counts (★)
- 05Deviance and Pearson X² goodness-of-fit
- 06Comparing nested models by ΔD vs χ²
- 07Overdispersion and the quasi-Poisson fix
Worked example: rate ratio, offset and goodness-of-fit from a Poisson fit
- +2(a) Rate ratio: e0.405 ≈ 1.50. The coastal incidence rate is about 1.5 times the inland rate — per head of population, because of the offset.
- +2(b) The offset: regions have unequal populations, so raw counts aren't comparable. offset = log(pop) pins a coefficient of 1 on log(pop), converting the model from counts to a rate per head; without it, 'coastal' would partly soak up population size.
- +2(c) Goodness-of-fit: D/df = 58.1/48 = 1.21, and χ²0.95(48) ≈ 65.2. Since 58.1 < 65.2, there is no evidence of lack of fit and dispersion is near 1 — the Poisson assumption holds.
Key terms
- Log link
- g(μ) = log(μ), the canonical link for the Poisson. Modelling log(μ) linearly keeps the fitted mean positive and makes covariate effects multiplicative on the count scale: μ = eXβ.
- Rate ratio
- e raised to a Poisson coefficient, eβ — the factor by which the expected count (or rate, with an offset) multiplies for a one-unit rise in the predictor. The count analogue of the logistic odds ratio.
- Offset
- A term log(exposure) added to the linear predictor with its coefficient fixed at 1, so the model describes a rate (events per unit exposure) rather than a raw count. Essential when rows have different exposures (population, time, area).
- Poisson deviance
- D = 2Σ[y log(y/μ̂) − (y − μ̂)], the goodness-of-fit statistic that under a correct model is approximately χ²(n−q). D well above its df signals lack of fit or overdispersion.
- Quasi-Poisson
- A fit that keeps the log-link mean model but estimates a dispersion φ, so Var = φμ. Coefficients (and rate ratios) are identical to the Poisson fit; standard errors inflate by √φ̂, and nested comparisons use an F-test. AIC is undefined for quasi-families.
Poisson Regression FAQ
Why use Poisson regression instead of a linear model on the counts?
Three failures of the linear model on counts: the fitted mean can go negative (impossible for a count); the variance is not constant but equals the mean, so large counts are noisier; and the integer, right-skewed shape is far from normal. The Poisson GLM fixes all three at once by modelling log(μ) linearly and letting Var = μ.
How do I interpret a Poisson coefficient?
Exponentiate it: eβ is a rate ratio — the multiplicative change in the expected count (or rate, if there is an offset) per one-unit rise in the predictor. Say it as 'the expected count multiplies by eβ.' It is the count-data twin of the logistic odds ratio.
When do I need an offset?
Whenever rows have different exposures — populations, time at risk, areas, numbers of trials — and you want to model a rate rather than a raw count. Add offset = log(exposure); its coefficient is fixed at 1, so eβ becomes a rate per unit exposure. Omitting it lets your covariates partly measure how big each unit is.
What is overdispersion in a Poisson model and how do I fix it?
Overdispersion is when the counts vary more than the mean = variance assumption allows, shown by residual deviance far above its df. Fix it with quasi-Poisson (family = quasipoisson): the coefficients and rate ratios are unchanged, but standard errors inflate by √φ̂, giving honest p-values. For severe overdispersion with a true likelihood, switch to a negative-binomial model.
Exam move
Make two moves automatic. First, the rate-ratio sentence: exponentiate any Poisson coefficient and state 'the expected count/rate multiplies by eβ per unit of x'. Second, the offset decision: whenever exposures differ, add log(exposure) as an offset so you model a rate — and never enter exposure as a free covariate. Know that the Poisson's signature is mean = variance, that residual deviance is a valid goodness-of-fit test for counts (compare D to df), and the overdispersion fix: quasi-Poisson leaves coefficients alone, scales standard errors by √φ̂, and switches comparisons to the F-test (with AIC undefined). The exam reads R output, so rehearse turning a printed summary() into the rate ratio, the offset justification, and the fit verdict.