MAST90139 · Statistical Modelling For Data Science
Binomial Models
When binary outcomes are grouped — r successes out of n trials at each covariate setting — the response is binomial and you fit grouped logistic regression. The big payoff over ungrouped data is that the residual deviance becomes a genuine goodness-of-fit statistic: under a correct model D is approximately χ² on n−q degrees of freedom, so you can test whether the model fits. The chapter covers binomial counts and the deviance / Pearson X² goodness-of-fit test, the dose–response setting (and LD50, the dose giving a 50% response), and the crucial complication of overdispersion — data wobbling more than the binomial allows — which you detect from D ≫ df, fix by estimating a dispersion φ and refitting quasi-binomial, and then test with an F-test rather than χ². A full worked dose–response example ties it together end to end.
What this chapter covers
- 01Grouped (binomial) logistic regression: r successes out of n
- 02Residual deviance and Pearson X² as goodness-of-fit statistics
- 03The χ² goodness-of-fit test on n−q df
- 04Dose–response models and LD50
- 05Overdispersion: detecting D ≫ df
- 06Estimating the dispersion φ and refitting quasi-binomial
- 07The F-test for quasi-binomial model comparison
Worked example: goodness-of-fit and overdispersion for a grouped binomial
- +2(a) GoF test: compare D = 28.0 to χ²0.95(10) ≈ 18.3. Since 28.0 > 18.3, reject the model — there is significant lack of fit.
- +2(b) Estimate φ: φ̂ ≈ D/df = 28.0/10 = 2.8 (Pearson X²/df is the better estimate, but the deviance ratio flags it). φ̂ ≫ 1 suggests overdispersion rather than a structural fault.
- +2(c) Quasi-binomial refit: the coefficients are unchanged; only the standard errors inflate by √φ̂ ≈ 1.67, widening CIs and shrinking z-statistics.
Key terms
- Grouped logistic regression
- Logistic regression where each row is r successes out of n trials at a covariate setting, so the response is binomial(n, π) rather than a single 0/1. Grouping makes the residual deviance a usable goodness-of-fit statistic.
- Goodness-of-fit deviance
- For grouped data, the residual deviance D measures how far the fitted model is from the saturated model; under a correct model D ~ χ²(n−q). D much larger than its df signals lack of fit or overdispersion. It is not valid for ungrouped binary data.
- LD50
- The dose at which the modelled probability of response is 0.5 — the lethal/effective dose for half the population. From logit(π) = β₀ + β₁(dose), LD50 = −β₀/β₁, the dose where the linear predictor is zero.
- Overdispersion
- Binomial (or Poisson) data showing more variability than the model allows, revealed by residual deviance far above its degrees of freedom. It inflates the true standard errors, so ignoring it makes p-values too small. Fixed by estimating a dispersion φ and refitting a quasi-model.
- Quasi-binomial model
- A fit that keeps the binomial mean structure but estimates a dispersion parameter φ, so Var = φnπ(1−π). The coefficients are unchanged from the ordinary fit; standard errors scale by √φ̂, and model comparison uses an F-test rather than χ².
Binomial Models FAQ
What is the difference between binomial and ordinary logistic regression?
They are the same model with the data organised differently. Ungrouped logistic regression has one 0/1 row per individual; binomial (grouped) logistic regression collapses individuals with identical covariates into r successes out of n trials. The coefficients and odds ratios are the same, but grouping makes the residual deviance a valid goodness-of-fit test.
When can I use the residual deviance as a goodness-of-fit test?
For grouped binomial data (and Poisson counts) with reasonable cell counts, where D ~ χ²(n−q) under a correct model. You cannot use it for ungrouped binary 0/1 data — the chi-square approximation fails there, so a small or large residual deviance says nothing about fit. This distinction is a frequent exam point.
How do I spot and handle overdispersion?
Spot it when the residual deviance is much larger than its degrees of freedom (D/df well above 1) with no obvious structural fault like a missing term or wrong link. Handle it by estimating the dispersion φ (Pearson X²/df) and refitting with family = quasibinomial; the coefficients stay the same, the standard errors inflate by √φ̂, and you compare models with an F-test.
What is LD50 and how do I get it from the fit?
LD50 is the dose at which half the subjects respond — the dose where the modelled probability equals 0.5, i.e. where the linear predictor is zero. From logit(π) = β₀ + β₁·dose, set the right side to 0 and solve: LD50 = −β₀/β₁. It is the standard summary of a dose–response curve.
Exam move
Drill the goodness-of-fit reflex: for grouped data, compare residual deviance D to its df = n−q against χ², and know that this test is invalid for ungrouped 0/1 data. Practise the overdispersion workflow end to end — detect D ≫ df, estimate φ̂ from Pearson X²/df, refit quasi-binomial, and remember that coefficients stay put while standard errors scale by √φ̂ and comparisons switch to the F-test. For dose–response, be able to compute LD50 = −β₀/β₁ and read a fitted curve. The signature exam item is a full dose–response read: fit → goodness-of-fit → spot overdispersion → refit → F-test → odds/effect, so rehearse that whole chain.