MAST90139
May 29, 2026
All files
- 当然可以。我会只按你这份“考前宝典/cheatsheet + bible”里明确写到的内容,把 MAST90139(Statistical Modelling for Data Science)拆成一张期末 Final 复习地图:你照着练,就能把最容易拿分的部分先抓牢。
- 重要提醒:你材料里没有给出期末权重、考试时长、开卷/闭卷等细节;它反复强调要去查官方 MAST90139info.pdf / Handbook 核实,别自己猜。[3]Source: asksia-bible-mast90139-bilingual.pdfThree assignments, one final 三次作业,一场期末 3 Gradescope assignments (dates confirmed) + a final exam - weights per the subject guide 3 次 Gradescope 作业(日期已确认)外加一场期末考试 -- 权重见课程指南 Your mark comes from three written assignments (submitted to Gradescope via Canvas) and a final examination. The assignment due dates are confirmed below; the exact weights, exam format and book- status are NOT in the supplied subject material - they live in the official MAST90139info . pdf. Do not assume numbers: check the handbook / subject guide. 你的成绩来自三次书面作业(通过 Canvas 提交到 Gradescope)和一场期末考试。作业的截止日期已在下方确认;而确切 的权重、考试形式与可否带书状态并不在所提供的课程材料中 -- 它们记载于官方的 MAST90139info. pdf。不要臆测数 字:请查阅课程手册/课程指南。 ! Weights & exam format are NOT confirmed here 此处的权重与考试形式尚未确认 The supplied source confirms three assignments, their due dates and Gradescope submission, plus a final exam - but it does not state the percentage weights, the exam length, or whether the exam is open- or closed-book. Treat every weight as "confirm against the subject guide". The numbers below are structure, not weights. 所提供的来源确认了三次作业、它们的截止日期与 Gradescope 提交方式,外加一场期末考 -- 但它没有说明各项的百 分比权重、考试时长,或考试是开卷还是闭卷。把每个权重都当作“对照科目指南核实”。下面的数字是结构,不是权 重。 The four assessment pieces 四项评估内容 Component What it drills (evidenced) Due / format Assignment 1 - binary / grouped logistic Logistic regression study (domestic-violence predictors); odds-ratio interpretation 11:59pm Thu 2 Apr 2026 . Gradescope Assignment 2 - binomial dose-response Beetle-mortality dose-response; binomial / grouped logistic, link comparison 11:59pm Fri 1 May 2026 . Gradescope Assignment 3 - ordinal / multinomial Coal-miner pneumoconiosis; 3-category ordinal response (polr) 11:59pm Sun 31 May 2026 . Gradescope Final examination Standard MAST graduate stats paper; read-the-R-output across the whole GLM arc Exam period · format: confirm MAST90139 . Statistical Modelling for Data Science i A plagiarism-declaration quiz is a gate, not a grade 学术诚信声明测验是一道门槛,而非分数[8]Source: asksia-cheatsheet-mast90139.pdfLR for each single term confint() profile-likelihood CIs multinom coef matrix (row/category) polr slopes (-y!) + thresholds Reflex: deviance + df -> x2 adequacy (grouped); AD + Adf > nested LR; exp (coef) > the interpretation. 18 . High-Yield Trap DON'T LOSE MARKS List · Wald can fail (Hauck-Donner) => prefer LR with sparse data 19 . AssessmentCONFIRM THE INFO SHEET 3 assignments + final exam. Due dates confirmed (Gradescope): A1 logistic 2 Apr, A2 binomial dose- response 1 May, A3 ordinal/multinomial 31 May 2026. Exact weights, exam format & book-status are NOT confirmed - check the MAST90139 subject guide / Handbook. ---- SIA > Half the marks are "show the deviance subtraction / show e^ß. " Method marks survive even if the final figure slips - always write the working and the one-line interpretation. Formula Belt SIDE 2 Poisson: log p = xTB . e^ß = rate ratio offset: log p = log t + xTB table: log Aij = p+a+B+(ap) . Q=P11P22/p12P21 prop-odds: log P(<r)/P(>r) = 0,+XTY AIC=D+2q . BIC=D+q log n . AD~x2(Aq) Revision aid . check the handbook / subject guide for exam conditions . @ 2026 good luck. name the link . show the deviance. asksia. ai/cheatsheet/ unimelb-mast90139 · side 2/2 AskSia CHEATSHEET SERIES Compiled by AskSia . mapped to the MAST90139 syllabus . asksia. ai/cheatsheet/unimelb-mast90139 THE RESPONSES . Poisson . offsets & rate ratios . Log-Linear tables . Multinomial logit . Proportional-odds . Model checking . REVISION SHEET . ALL TOPICS[2]Source: asksia-bible-mast90139-bilingual.pdfA GLM course taught in R- method and interpretation, not memorised slides 一门用 R 讲授的 GLM 课程 -- 重在方法与解释,而非死记幻灯片 This is not a transcript of the lecture decks. It is a self-contained tour of every model MAST90139 examines - logistic, binomial, Poisson, log-linear, multinomial, proportional-odds - built on the one engine that unifies them, the generalised linear model. Each concept is presented as an AHA-unit: a displayed equation and/or a diagram, a plain-English explainer, the method as numbered steps, a worked R-output reading, and the trap that catches people in the exam. 这并不是讲义幻灯片的逐字记录。它是一次自成体系的巡览,覆盖 MAST90139 所考查的每一个模型 -- logistic、二项、 Poisson、对数线性、多项、比例优势 -- 这些模型都建立在统一它们的同一台引擎之上,即广义线性模型。每个概念都以一 个 AHA 单元呈现:一个展示性方程和/或一幅图示、一段通俗英语的讲解、以编号步骤给出的方法、一次对 R输出的实战解 读,以及那个在考试中让人栽跟头的陷阱。 A 1 . LEARN 1 · 学习 Before the lecture. Read a chapter top to bottom. The GLM framework chapter is the spine - random / systematic / link, the exponential family, deviance and IRLS - every later model is just that template with a new distribution and link. Learn it cold. 课前。把一章从头读到尾。 GLM 框架章是脊梁––随机/系 统/链接、指数族、偏差与 IRLS -- 后面每个模型都只是这套模 板换上新的分布和链接。要烂熟 于心。 B 2 . DRILL 2 · 操练 After the applied lab. Cover the worked interpretation and re- derive it: name the family + link, compute D or AD, compare to x2, then say what exp (B) means in words. That triple is the exam. 应用实验课后。 盖住解出的解释 自己重推一遍:说出分布族+链 接,算出 D 或 △D,对照x2,再 用文字说出 exp(β)的含义。这 三件套就是考试。 C 3 · EXAM 3 · 应考 SWOTVAC. Walk the chapters as a checklist of model families. The blueprint overleaf maps the three Gradescope assignments onto the chapter arc - they are your rehearsal for the paper. 复习周(SWOTVAC)。把各章 当作模型分布族的清单来过一 遍。背面的蓝图把三次 Gradescope 作业映射到章节主 线上 -- 它们就是你为正式试卷 做的预演。 MAST90139 . Statistical Modelling for Data Science ! The single most important thing about MAST90139 关于 MAST90139 最重要的一件事 It looks like one model with many faces - and it is. Once you see that logistic, Poisson, log-linear and ordinal regression are all the same three-part GLM (a random component from an exponential family, a linear predictor n=XB, and a link g ()=n), the course collapses to one workflow: fit - assess fit by deviance - compare nested models by AD - interpret exp(@). Don't memorise six models; memorise one and swap the distribution and link. 它看起来像一个有许多面孔的模型 -- 而它确实如此。一旦你看清 logistic、泊松、对数线性和有序回归全都是同一个三 部分 GLM(一个来自指数族的随机成分、一个线性预测子 n=Xβ,以及一个链接 g(μ)=n),整门课就坍缩为一套工作 流程:拟合→ 用偏差评估拟合→用△D 比较嵌套模型→解释 exp(β)。不要去背六个模型;背一个,然后替换分布和 链接。 i How this book was built - and the two-layer rule 本书是如何构建的 -- 以及两层规则 Standard GLM theory (the exponential family, IRLS, deviance, the canon of link functions) is stated plainly - it is universal and lives in Faraway, McCullagh & Nelder, Agresti. The subject's own framing and the lecturer's specific example numbers are paraphrased and re-numbered, never copied from slides or past papers. The course runs on A/Prof Qian's lecture notes with Faraway, Extending the Linear Model with R and the faraway package. Verify all dates and weights against the official MAST90139info. pdf subject guide / University Handbook, as details shift between cohorts. 标准 GLM 理论(指数族、IRLS、偏差、链接函数的经典清单)被平实地陈述 -- 它是通用的,见于 Faraway、 McCullagh & Nelder、Agresti。本科目自身的框架表述与讲师特定的算例数字均经改述并重新编号,绝不照抄自幻灯片 或往年试卷。本课程以 Qian 副教授的讲义为主,配合 Faraway 的 Extending the Linear Model with R 和 faraway 包。所有日期与权重请对照官方 MAST90139info. pdf 科目指南/大学手册核实,因为细节会因届别而变。 MAST90139 . Statistical Modelling for Data Science THE BLUEPRINT CONFIRM WEIGHTS - THE ASSESSMENT BLUEPRINT
-
0)一句“救命总纲”:这门课最终考什么?
- 你这门课的 Final(以及整门课训练的反射)可以浓缩成三连招:
- (1)看到 R 输出先“报型号”:说清楚这是哪个模型/哪种分布 + 哪个 link(例如 binomial+logit?Poisson+log?ordinal cumulative+logit?)。[1]Source: asksia-bible-mast90139-bilingual.pdfA compulsory academic-integrity declaration (you must "Agree", unlimited attempts, no due date) gates submission. It carries no graded weight - treat it as a checkbox, complete it early. There is no evidence of a mid-semester test in the supplied source; only these three assignments + the final. - - 一份强制性的学术诚信声明(你必须“同意”、无限次尝试、无截止日期)是提交的门槛。它不计入成绩权重 -- 把它当 作一个勾选框,尽早完成。所提供来源中没有期中测试的证据;只有这三次作业+期末。 ★ The exam hook - a read-the-R-output gauntlet 考试的关键钩子 --- 一场读懂 R 输出的连环挑战 Whatever the weighting, the final trains one reflex. Given a printed summary (glm()) / anova(. , test="Chi") / polr () table you must: (1) name the model + link (binomial+logit? Poisson+log? cumulative+logit?); (2) do the deviance / AD-vs-x2 arithmetic by hand to test fit or compare nested models; and (3) translate a coefficient into a real- world odds ratio / rate ratio / cumulative-odds statement. 无论权重如何,期末训练的是同一个反射。给定一张 打印的 summary(glm()) / anova(. , test="Chi")/ polr(〕 表,你必须:(1)说出模型 + 链接(二项+logit? 泊松+log? 累积+logit?);(2) 手算偏差/ AD 对 ×2 的运算以检验拟合或比较嵌套模 型;并(3)把一个系数翻译成现实世界的优势比/率 比/累积优势陈述。 ✓ The three recurring moves - drill these 三个反复出现的招式 -- 务必操练 Name: response type - distribution - canonical link. Test: AD = Do-D1 ~ X2 (df o-df 1) for nested models; D ~ x2 (n-q) for grouped goodness-of-fit. Interpret: exp (ß3) - an odds ratio (logit), a rate ratio (log), a cross-product ratio (log-linear), a cumulative OR (proportional-odds). Get those three automatic and the paper is procedural. 命名:响应类型→分布→典则链接。检验:嵌套 模型用 △D = De-D1 ~ X2(dfe-df1);分组数据 拟合优度用 D ~ x2(n-q)。解释:exp(β3)––优 势比(logit)、率比(log)、交叉乘积比(对数线 性)、累积OR(比例优势)。把这三件事做成自动反 应,整张试卷就只是按部就班。 GRADESCOPE ASSIGNMENTS Gradescope 作业 FINAL EXAMINATION 期末考试 R THE ONLY SOFTWARE (FARAWAY) 唯一的软件(faraway) exp(ß) THE #1 EXAMINED SKILL 头号考查技能 MAST90139 . Statistical Modelling for Data Science CONTENTS - CONTENTS Framework first, then the families 先讲框架,再讲分布族 Learn the one GLM engine, then watch it bend to every response type 学会唯一的 GLM引擎,再看它如何弯折以适应各类响应 Ch Topic Core content Part 1 . The framework (the conceptual core) 1 Linear model review matrix form . OLS . LINE assumptions . the springboard → 2 The GLM framework[2]Source: asksia-bible-mast90139-bilingual.pdfA GLM course taught in R- method and interpretation, not memorised slides 一门用 R 讲授的 GLM 课程 -- 重在方法与解释,而非死记幻灯片 This is not a transcript of the lecture decks. It is a self-contained tour of every model MAST90139 examines - logistic, binomial, Poisson, log-linear, multinomial, proportional-odds - built on the one engine that unifies them, the generalised linear model. Each concept is presented as an AHA-unit: a displayed equation and/or a diagram, a plain-English explainer, the method as numbered steps, a worked R-output reading, and the trap that catches people in the exam. 这并不是讲义幻灯片的逐字记录。它是一次自成体系的巡览,覆盖 MAST90139 所考查的每一个模型 -- logistic、二项、 Poisson、对数线性、多项、比例优势 -- 这些模型都建立在统一它们的同一台引擎之上,即广义线性模型。每个概念都以一 个 AHA 单元呈现:一个展示性方程和/或一幅图示、一段通俗英语的讲解、以编号步骤给出的方法、一次对 R输出的实战解 读,以及那个在考试中让人栽跟头的陷阱。 A 1 . LEARN 1 · 学习 Before the lecture. Read a chapter top to bottom. The GLM framework chapter is the spine - random / systematic / link, the exponential family, deviance and IRLS - every later model is just that template with a new distribution and link. Learn it cold. 课前。把一章从头读到尾。 GLM 框架章是脊梁––随机/系 统/链接、指数族、偏差与 IRLS -- 后面每个模型都只是这套模 板换上新的分布和链接。要烂熟 于心。 B 2 . DRILL 2 · 操练 After the applied lab. Cover the worked interpretation and re- derive it: name the family + link, compute D or AD, compare to x2, then say what exp (B) means in words. That triple is the exam. 应用实验课后。 盖住解出的解释 自己重推一遍:说出分布族+链 接,算出 D 或 △D,对照x2,再 用文字说出 exp(β)的含义。这 三件套就是考试。 C 3 · EXAM 3 · 应考 SWOTVAC. Walk the chapters as a checklist of model families. The blueprint overleaf maps the three Gradescope assignments onto the chapter arc - they are your rehearsal for the paper. 复习周(SWOTVAC)。把各章 当作模型分布族的清单来过一 遍。背面的蓝图把三次 Gradescope 作业映射到章节主 线上 -- 它们就是你为正式试卷 做的预演。 MAST90139 . Statistical Modelling for Data Science ! The single most important thing about MAST90139 关于 MAST90139 最重要的一件事 It looks like one model with many faces - and it is. Once you see that logistic, Poisson, log-linear and ordinal regression are all the same three-part GLM (a random component from an exponential family, a linear predictor n=XB, and a link g ()=n), the course collapses to one workflow: fit - assess fit by deviance - compare nested models by AD - interpret exp(@). Don't memorise six models; memorise one and swap the distribution and link. 它看起来像一个有许多面孔的模型 -- 而它确实如此。一旦你看清 logistic、泊松、对数线性和有序回归全都是同一个三 部分 GLM(一个来自指数族的随机成分、一个线性预测子 n=Xβ,以及一个链接 g(μ)=n),整门课就坍缩为一套工作 流程:拟合→ 用偏差评估拟合→用△D 比较嵌套模型→解释 exp(β)。不要去背六个模型;背一个,然后替换分布和 链接。 i How this book was built - and the two-layer rule 本书是如何构建的 -- 以及两层规则 Standard GLM theory (the exponential family, IRLS, deviance, the canon of link functions) is stated plainly - it is universal and lives in Faraway, McCullagh & Nelder, Agresti. The subject's own framing and the lecturer's specific example numbers are paraphrased and re-numbered, never copied from slides or past papers. The course runs on A/Prof Qian's lecture notes with Faraway, Extending the Linear Model with R and the faraway package. Verify all dates and weights against the official MAST90139info. pdf subject guide / University Handbook, as details shift between cohorts. 标准 GLM 理论(指数族、IRLS、偏差、链接函数的经典清单)被平实地陈述 -- 它是通用的,见于 Faraway、 McCullagh & Nelder、Agresti。本科目自身的框架表述与讲师特定的算例数字均经改述并重新编号,绝不照抄自幻灯片 或往年试卷。本课程以 Qian 副教授的讲义为主,配合 Faraway 的 Extending the Linear Model with R 和 faraway 包。所有日期与权重请对照官方 MAST90139info. pdf 科目指南/大学手册核实,因为细节会因届别而变。 MAST90139 . Statistical Modelling for Data Science THE BLUEPRINT CONFIRM WEIGHTS - THE ASSESSMENT BLUEPRINT
- (2)会做 deviance / $\Delta D$ 的手算检验:
- 嵌套模型比较:$\Delta D = D_0 - D_1 \sim \chi^2(\Delta df)$
- 分组数据拟合优度:$D \sim \chi^2(n-q)$(材料强调“grouped GoF”)。[1]Source: asksia-bible-mast90139-bilingual.pdfA compulsory academic-integrity declaration (you must "Agree", unlimited attempts, no due date) gates submission. It carries no graded weight - treat it as a checkbox, complete it early. There is no evidence of a mid-semester test in the supplied source; only these three assignments + the final. - - 一份强制性的学术诚信声明(你必须“同意”、无限次尝试、无截止日期)是提交的门槛。它不计入成绩权重 -- 把它当 作一个勾选框,尽早完成。所提供来源中没有期中测试的证据;只有这三次作业+期末。 ★ The exam hook - a read-the-R-output gauntlet 考试的关键钩子 --- 一场读懂 R 输出的连环挑战 Whatever the weighting, the final trains one reflex. Given a printed summary (glm()) / anova(. , test="Chi") / polr () table you must: (1) name the model + link (binomial+logit? Poisson+log? cumulative+logit?); (2) do the deviance / AD-vs-x2 arithmetic by hand to test fit or compare nested models; and (3) translate a coefficient into a real- world odds ratio / rate ratio / cumulative-odds statement. 无论权重如何,期末训练的是同一个反射。给定一张 打印的 summary(glm()) / anova(. , test="Chi")/ polr(〕 表,你必须:(1)说出模型 + 链接(二项+logit? 泊松+log? 累积+logit?);(2) 手算偏差/ AD 对 ×2 的运算以检验拟合或比较嵌套模 型;并(3)把一个系数翻译成现实世界的优势比/率 比/累积优势陈述。 ✓ The three recurring moves - drill these 三个反复出现的招式 -- 务必操练 Name: response type - distribution - canonical link. Test: AD = Do-D1 ~ X2 (df o-df 1) for nested models; D ~ x2 (n-q) for grouped goodness-of-fit. Interpret: exp (ß3) - an odds ratio (logit), a rate ratio (log), a cross-product ratio (log-linear), a cumulative OR (proportional-odds). Get those three automatic and the paper is procedural. 命名:响应类型→分布→典则链接。检验:嵌套 模型用 △D = De-D1 ~ X2(dfe-df1);分组数据 拟合优度用 D ~ x2(n-q)。解释:exp(β3)––优 势比(logit)、率比(log)、交叉乘积比(对数线 性)、累积OR(比例优势)。把这三件事做成自动反 应,整张试卷就只是按部就班。 GRADESCOPE ASSIGNMENTS Gradescope 作业 FINAL EXAMINATION 期末考试 R THE ONLY SOFTWARE (FARAWAY) 唯一的软件(faraway) exp(ß) THE #1 EXAMINED SKILL 头号考查技能 MAST90139 . Statistical Modelling for Data Science CONTENTS - CONTENTS Framework first, then the families 先讲框架,再讲分布族 Learn the one GLM engine, then watch it bend to every response type 学会唯一的 GLM引擎,再看它如何弯折以适应各类响应 Ch Topic Core content Part 1 . The framework (the conceptual core) 1 Linear model review matrix form . OLS . LINE assumptions . the springboard → 2 The GLM framework[22]Source: asksia-cheatsheet-mast90139.pdfTWO JOBS OF DEVIANCE (1) Goodness of fit (grouped data only): under a correct model D/$ = X2n-q; reject fit if D > x20. 95(n-q). (2) Nested comparison (always valid): AD = Do-D1 = X"q1-q0 - the likelihood-ratio test. 4b . Pearson statistic & THE GOF X2 TWIN PEARSON CHI-SQUARED Χ2 = Σ (yi-Pi) 2 / Vai (y;) = Σ (0-E)2/E Asymptotically equivalent to D for GoF. For a 2-way table the deviance/Pearson independence test coincides with the classical x2 test . 4c . Residuals DIAGNOSE THE FIT PEARSON RESIDUAL ri= (y ;- Pi) / VV (μ;) · Σ(ΓΡ)2 = Χ2 DEVIANCE RESIDUAL ri = sign (yi-Âi). Vdi . E(rD)2 = D Deviance residuals are closer to normal; both should mostly lie in [-2, 2] for a good model. Plot vs rand each covariate. SIA > For ungrouped/binary data (m,=1) the residual deviance is NOT a GoF measure it's a function of the ît alone. Use Hosmer- Lemeshow. But AD between nested models is always a valid LR test. 4d . Worked . GoF + AD Goodness of fit. A grouped model has D = 28. 1 on 23 df. x20. 95(23) = 35. 2. Since 28. 1 < 35. 2, no evidence of lack of fit (p = 0. 21). If D had been >> df with no structural cause = suspect overdispersion. Nested comparison. Model 0 (no x): Do = 41. 6 on 24 df. Model 1 (+x): D1 = 28. 1 on 23 df. AD = 41. 6 - 28. 1 = 13. 5 on 1 df ×20. 95 (1) = 3. 84 = 13. 5 * 3. 84 => reject Mo: x is significant (p = 0. 0002). This AD is the LR test - the same conclusion anova(m0, m1, test="Chi") prints. 4e . Scaled vs Unscaled MIND THE ¢ R prints the unscaled D (the $=1 quantity). For Poisson/binomial $=1 => scaled = unscaled, compare D directly to x2. For quasi/normal ¢#1: D/+ is x2- distributed and nested tests use an F-test . Confirm the dispersion line first. 5 . Logistic Regression CH2 . BINARY / BINOMIAL Y; ~ Bin(1, p;) (Bernoulli) or grouped Bin(m¡, TT;). The logit link: THE MODEL Log[ Ti/ (1-T;) ] = Be + B1x11 + " = X1TB Ti = e^ni / (1 + e^ni) LOG-LIKELIHOOD (BINARY) e(β) = Σ[ yin; - Log(1+ern;) ] u(B)=XT(y-Tt) . (B)=XTdiag{m; (1-m;)}X Fitted prob Tố¡ = ilogit(n}) via predict(type="response"); n; via predict (mod). 5b . ODDS RATIOS * THE #1 EXAM · e^B SKILL
- (3)把系数翻译成一句人话:几乎永远是先取指数 $e^{\beta}$,再写成 odds ratio / rate ratio / cumulative odds ratio 的句子。[1]Source: asksia-bible-mast90139-bilingual.pdfA compulsory academic-integrity declaration (you must "Agree", unlimited attempts, no due date) gates submission. It carries no graded weight - treat it as a checkbox, complete it early. There is no evidence of a mid-semester test in the supplied source; only these three assignments + the final. - - 一份强制性的学术诚信声明(你必须“同意”、无限次尝试、无截止日期)是提交的门槛。它不计入成绩权重 -- 把它当 作一个勾选框,尽早完成。所提供来源中没有期中测试的证据;只有这三次作业+期末。 ★ The exam hook - a read-the-R-output gauntlet 考试的关键钩子 --- 一场读懂 R 输出的连环挑战 Whatever the weighting, the final trains one reflex. Given a printed summary (glm()) / anova(. , test="Chi") / polr () table you must: (1) name the model + link (binomial+logit? Poisson+log? cumulative+logit?); (2) do the deviance / AD-vs-x2 arithmetic by hand to test fit or compare nested models; and (3) translate a coefficient into a real- world odds ratio / rate ratio / cumulative-odds statement. 无论权重如何,期末训练的是同一个反射。给定一张 打印的 summary(glm()) / anova(. , test="Chi")/ polr(〕 表,你必须:(1)说出模型 + 链接(二项+logit? 泊松+log? 累积+logit?);(2) 手算偏差/ AD 对 ×2 的运算以检验拟合或比较嵌套模 型;并(3)把一个系数翻译成现实世界的优势比/率 比/累积优势陈述。 ✓ The three recurring moves - drill these 三个反复出现的招式 -- 务必操练 Name: response type - distribution - canonical link. Test: AD = Do-D1 ~ X2 (df o-df 1) for nested models; D ~ x2 (n-q) for grouped goodness-of-fit. Interpret: exp (ß3) - an odds ratio (logit), a rate ratio (log), a cross-product ratio (log-linear), a cumulative OR (proportional-odds). Get those three automatic and the paper is procedural. 命名:响应类型→分布→典则链接。检验:嵌套 模型用 △D = De-D1 ~ X2(dfe-df1);分组数据 拟合优度用 D ~ x2(n-q)。解释:exp(β3)––优 势比(logit)、率比(log)、交叉乘积比(对数线 性)、累积OR(比例优势)。把这三件事做成自动反 应,整张试卷就只是按部就班。 GRADESCOPE ASSIGNMENTS Gradescope 作业 FINAL EXAMINATION 期末考试 R THE ONLY SOFTWARE (FARAWAY) 唯一的软件(faraway) exp(ß) THE #1 EXAMINED SKILL 头号考查技能 MAST90139 . Statistical Modelling for Data Science CONTENTS - CONTENTS Framework first, then the families 先讲框架,再讲分布族 Learn the one GLM engine, then watch it bend to every response type 学会唯一的 GLM引擎,再看它如何弯折以适应各类响应 Ch Topic Core content Part 1 . The framework (the conceptual core) 1 Linear model review matrix form . OLS . LINE assumptions . the springboard → 2 The GLM framework[6]Source: asksia-bible-mast90139-bilingual.pdfR package; (1 | cluster) = a random intercept per cluster; REML for variance components. Marginal vs conditional — Population-averaged (GEE) vs cluster-specific effects; differ for nonlinear links. ★ The terms that carry the most marks 分值最高的那些术语 Three interpretation skills recur every year: odds ratio el (logistic), rate ratio e (Poisson, with an offset), and the cumulative odds ratio e? (proportional-odds). The recurring procedure is the AD vs x2 nested test. Memorise the one-line phrasing for each - the marks are in the wording, not the arithmetic. 三项解释技能年年出现:优势比(logistic)、率比(泊松,带 offset),以及累积优势比(比例优势)。反复出现的流程 是AD 对 x2 的嵌套检验。把每一种的一行措辞背下来 -- 分数在措辞里,不在算术里。 MAST90139 . Statistical Modelling for Data Science CH 11 - PRACTICE - CHAPTER 11 . PRACTICE BANK & WORKED SOLUTIONS DRILL TO EXAM STANDARD Read the R output. Name the model. Interpret. 读懂 R 输出。叫出模型。给出解释。 Fourteen fresh exam-style problems - logistic, Poisson, log-linear, ordinal, deviance, overdispersion, AIC 十四道全新的考试型题目 -- logistic、Poisson、对数线性、有序、偏差、过度离散、AIC The one-line takeaway. The MAST90139 final is a read-the-R-output gauntlet: given a summary (glm) / anova(test="Chi") / polr table, you must (a) name the model and link, (b) do the AD-vs-x2 or deviance arithmetic by hand, and (c) translate a coefficient into an odds ratio / rate ratio / cumulative-odds statement. This bank gives one fresh problem per skill, fully worked. 一句话要点。MAST90139 期末是一场读 R 输出的闯关赛:给定一张 summary(glm) / anova(test="Chi") / polr 表 格,你必须(a)说出模型与链接,(b)手算 △D 对 x2 或偏差的运算,并(c)把一个系数翻译成优势比/率比/累积优势的陈 述。本题库为每项技能各给一道全新且完整解出的题。 ★ Fresh numbers - mirrors the template, not the stems 全新数字 -- 模仿模板,而非照搬题干 These are AskSia-authored questions in the MAST90139 style; they are not real exam/assignment stems. The GLM procedures are canonical. Closed-book reasoning - every odds/rate ratio below is computed from the printed estimate by hand. Recall: e0. 7 ~ 2. 01, e1 ~ 2. 72, e-0. 69 ~ 0. 50, X6. 95 (1) = 3. 84. 这些是 AskSia 自撰、仿 MAST90139 风格的题目;它们不是真实的考试/作业题干。其中的GLM 流程是经典的。闭卷 推理 -- 下面每个优势/率比都由打印的估计值手算得出。 Q1-03 Logistic regression - odds ratios Q1-Q3 逻辑回归 -- 优势比 R: SUMMARY(GLM(DEFAULT ~ INCOME + LATE, FAMILY=BINOMIAL)) Coefficients : Estimate Std. Error z value Pr(> |z|) (Intercept) -1. 8400 0. 4100 -4. 49 7e-06 income -0. 0230 0. 0090 -2. 56 0. 0105 late 0. 6931 0. 1500 4. 62 4e-06 Null deviance: 410. 2 on 299 df Residual deviance: 351. 6 on 297 df AIC: 357. 6 3 marks . Logistic Q1 ODDS RATIO late is the number of late payments last year. Give the odds ratio for one extra late payment and interpret it. Q2 OR FOR C UNITS 3 marks . Logistic
-
1)考试最高频“得分点”(你要练到条件反射)
-
1.1 读输出的固定流程(你每题都按这个模板写)
- 面对任何 GLM/多分类/有序回归的输出(
summary(glm())、anova(test="Chi")、polr()、multinom()等),按下面写:- Step A:Name(命名)
- 先写:响应类型 → 分布族(exponential family 里的哪个)→ link。[1]Source: asksia-bible-mast90139-bilingual.pdfA compulsory academic-integrity declaration (you must "Agree", unlimited attempts, no due date) gates submission. It carries no graded weight - treat it as a checkbox, complete it early. There is no evidence of a mid-semester test in the supplied source; only these three assignments + the final. - - 一份强制性的学术诚信声明(你必须“同意”、无限次尝试、无截止日期)是提交的门槛。它不计入成绩权重 -- 把它当 作一个勾选框,尽早完成。所提供来源中没有期中测试的证据;只有这三次作业+期末。 ★ The exam hook - a read-the-R-output gauntlet 考试的关键钩子 --- 一场读懂 R 输出的连环挑战 Whatever the weighting, the final trains one reflex. Given a printed summary (glm()) / anova(. , test="Chi") / polr () table you must: (1) name the model + link (binomial+logit? Poisson+log? cumulative+logit?); (2) do the deviance / AD-vs-x2 arithmetic by hand to test fit or compare nested models; and (3) translate a coefficient into a real- world odds ratio / rate ratio / cumulative-odds statement. 无论权重如何,期末训练的是同一个反射。给定一张 打印的 summary(glm()) / anova(. , test="Chi")/ polr(〕 表,你必须:(1)说出模型 + 链接(二项+logit? 泊松+log? 累积+logit?);(2) 手算偏差/ AD 对 ×2 的运算以检验拟合或比较嵌套模 型;并(3)把一个系数翻译成现实世界的优势比/率 比/累积优势陈述。 ✓ The three recurring moves - drill these 三个反复出现的招式 -- 务必操练 Name: response type - distribution - canonical link. Test: AD = Do-D1 ~ X2 (df o-df 1) for nested models; D ~ x2 (n-q) for grouped goodness-of-fit. Interpret: exp (ß3) - an odds ratio (logit), a rate ratio (log), a cross-product ratio (log-linear), a cumulative OR (proportional-odds). Get those three automatic and the paper is procedural. 命名:响应类型→分布→典则链接。检验:嵌套 模型用 △D = De-D1 ~ X2(dfe-df1);分组数据 拟合优度用 D ~ x2(n-q)。解释:exp(β3)––优 势比(logit)、率比(log)、交叉乘积比(对数线 性)、累积OR(比例优势)。把这三件事做成自动反 应,整张试卷就只是按部就班。 GRADESCOPE ASSIGNMENTS Gradescope 作业 FINAL EXAMINATION 期末考试 R THE ONLY SOFTWARE (FARAWAY) 唯一的软件(faraway) exp(ß) THE #1 EXAMINED SKILL 头号考查技能 MAST90139 . Statistical Modelling for Data Science CONTENTS - CONTENTS Framework first, then the families 先讲框架,再讲分布族 Learn the one GLM engine, then watch it bend to every response type 学会唯一的 GLM引擎,再看它如何弯折以适应各类响应 Ch Topic Core content Part 1 . The framework (the conceptual core) 1 Linear model review matrix form . OLS . LINE assumptions . the springboard → 2 The GLM framework[2]Source: asksia-bible-mast90139-bilingual.pdfA GLM course taught in R- method and interpretation, not memorised slides 一门用 R 讲授的 GLM 课程 -- 重在方法与解释,而非死记幻灯片 This is not a transcript of the lecture decks. It is a self-contained tour of every model MAST90139 examines - logistic, binomial, Poisson, log-linear, multinomial, proportional-odds - built on the one engine that unifies them, the generalised linear model. Each concept is presented as an AHA-unit: a displayed equation and/or a diagram, a plain-English explainer, the method as numbered steps, a worked R-output reading, and the trap that catches people in the exam. 这并不是讲义幻灯片的逐字记录。它是一次自成体系的巡览,覆盖 MAST90139 所考查的每一个模型 -- logistic、二项、 Poisson、对数线性、多项、比例优势 -- 这些模型都建立在统一它们的同一台引擎之上,即广义线性模型。每个概念都以一 个 AHA 单元呈现:一个展示性方程和/或一幅图示、一段通俗英语的讲解、以编号步骤给出的方法、一次对 R输出的实战解 读,以及那个在考试中让人栽跟头的陷阱。 A 1 . LEARN 1 · 学习 Before the lecture. Read a chapter top to bottom. The GLM framework chapter is the spine - random / systematic / link, the exponential family, deviance and IRLS - every later model is just that template with a new distribution and link. Learn it cold. 课前。把一章从头读到尾。 GLM 框架章是脊梁––随机/系 统/链接、指数族、偏差与 IRLS -- 后面每个模型都只是这套模 板换上新的分布和链接。要烂熟 于心。 B 2 . DRILL 2 · 操练 After the applied lab. Cover the worked interpretation and re- derive it: name the family + link, compute D or AD, compare to x2, then say what exp (B) means in words. That triple is the exam. 应用实验课后。 盖住解出的解释 自己重推一遍:说出分布族+链 接,算出 D 或 △D,对照x2,再 用文字说出 exp(β)的含义。这 三件套就是考试。 C 3 · EXAM 3 · 应考 SWOTVAC. Walk the chapters as a checklist of model families. The blueprint overleaf maps the three Gradescope assignments onto the chapter arc - they are your rehearsal for the paper. 复习周(SWOTVAC)。把各章 当作模型分布族的清单来过一 遍。背面的蓝图把三次 Gradescope 作业映射到章节主 线上 -- 它们就是你为正式试卷 做的预演。 MAST90139 . Statistical Modelling for Data Science ! The single most important thing about MAST90139 关于 MAST90139 最重要的一件事 It looks like one model with many faces - and it is. Once you see that logistic, Poisson, log-linear and ordinal regression are all the same three-part GLM (a random component from an exponential family, a linear predictor n=XB, and a link g ()=n), the course collapses to one workflow: fit - assess fit by deviance - compare nested models by AD - interpret exp(@). Don't memorise six models; memorise one and swap the distribution and link. 它看起来像一个有许多面孔的模型 -- 而它确实如此。一旦你看清 logistic、泊松、对数线性和有序回归全都是同一个三 部分 GLM(一个来自指数族的随机成分、一个线性预测子 n=Xβ,以及一个链接 g(μ)=n),整门课就坍缩为一套工作 流程:拟合→ 用偏差评估拟合→用△D 比较嵌套模型→解释 exp(β)。不要去背六个模型;背一个,然后替换分布和 链接。 i How this book was built - and the two-layer rule 本书是如何构建的 -- 以及两层规则 Standard GLM theory (the exponential family, IRLS, deviance, the canon of link functions) is stated plainly - it is universal and lives in Faraway, McCullagh & Nelder, Agresti. The subject's own framing and the lecturer's specific example numbers are paraphrased and re-numbered, never copied from slides or past papers. The course runs on A/Prof Qian's lecture notes with Faraway, Extending the Linear Model with R and the faraway package. Verify all dates and weights against the official MAST90139info. pdf subject guide / University Handbook, as details shift between cohorts. 标准 GLM 理论(指数族、IRLS、偏差、链接函数的经典清单)被平实地陈述 -- 它是通用的,见于 Faraway、 McCullagh & Nelder、Agresti。本科目自身的框架表述与讲师特定的算例数字均经改述并重新编号,绝不照抄自幻灯片 或往年试卷。本课程以 Qian 副教授的讲义为主,配合 Faraway 的 Extending the Linear Model with R 和 faraway 包。所有日期与权重请对照官方 MAST90139info. pdf 科目指南/大学手册核实,因为细节会因届别而变。 MAST90139 . Statistical Modelling for Data Science THE BLUEPRINT CONFIRM WEIGHTS - THE ASSESSMENT BLUEPRINT
- Step B:Test(检验)
- 读 Residual deviance 和 df(在“可做 GoF 的场景”)或算嵌套模型 $\Delta D$ 并对照 $\chi^2$。[22]Source: asksia-cheatsheet-mast90139.pdfTWO JOBS OF DEVIANCE (1) Goodness of fit (grouped data only): under a correct model D/$ = X2n-q; reject fit if D > x20. 95(n-q). (2) Nested comparison (always valid): AD = Do-D1 = X"q1-q0 - the likelihood-ratio test. 4b . Pearson statistic & THE GOF X2 TWIN PEARSON CHI-SQUARED Χ2 = Σ (yi-Pi) 2 / Vai (y;) = Σ (0-E)2/E Asymptotically equivalent to D for GoF. For a 2-way table the deviance/Pearson independence test coincides with the classical x2 test . 4c . Residuals DIAGNOSE THE FIT PEARSON RESIDUAL ri= (y ;- Pi) / VV (μ;) · Σ(ΓΡ)2 = Χ2 DEVIANCE RESIDUAL ri = sign (yi-Âi). Vdi . E(rD)2 = D Deviance residuals are closer to normal; both should mostly lie in [-2, 2] for a good model. Plot vs rand each covariate. SIA > For ungrouped/binary data (m,=1) the residual deviance is NOT a GoF measure it's a function of the ît alone. Use Hosmer- Lemeshow. But AD between nested models is always a valid LR test. 4d . Worked . GoF + AD Goodness of fit. A grouped model has D = 28. 1 on 23 df. x20. 95(23) = 35. 2. Since 28. 1 < 35. 2, no evidence of lack of fit (p = 0. 21). If D had been >> df with no structural cause = suspect overdispersion. Nested comparison. Model 0 (no x): Do = 41. 6 on 24 df. Model 1 (+x): D1 = 28. 1 on 23 df. AD = 41. 6 - 28. 1 = 13. 5 on 1 df ×20. 95 (1) = 3. 84 = 13. 5 * 3. 84 => reject Mo: x is significant (p = 0. 0002). This AD is the LR test - the same conclusion anova(m0, m1, test="Chi") prints. 4e . Scaled vs Unscaled MIND THE ¢ R prints the unscaled D (the $=1 quantity). For Poisson/binomial $=1 => scaled = unscaled, compare D directly to x2. For quasi/normal ¢#1: D/+ is x2- distributed and nested tests use an F-test . Confirm the dispersion line first. 5 . Logistic Regression CH2 . BINARY / BINOMIAL Y; ~ Bin(1, p;) (Bernoulli) or grouped Bin(m¡, TT;). The logit link: THE MODEL Log[ Ti/ (1-T;) ] = Be + B1x11 + " = X1TB Ti = e^ni / (1 + e^ni) LOG-LIKELIHOOD (BINARY) e(β) = Σ[ yin; - Log(1+ern;) ] u(B)=XT(y-Tt) . (B)=XTdiag{m; (1-m;)}X Fitted prob Tố¡ = ilogit(n}) via predict(type="response"); n; via predict (mod). 5b . ODDS RATIOS * THE #1 EXAM · e^B SKILL[15]Source: asksia-bible-mast90139-bilingual.pdf1 Q12. polr sign trap: R parameterises P(Y <r) = F(0, - xTB), so the lecture-convention slope is -0. 82. The cumulative OR per unit income is e+0. 82 = 2. 27 for higher satisfaction: each unit of income multiplies the odds of being in a higher satisfaction category by 2. 3, the same at every threshold (proportional- odds). income is significant (t = 4. 10). Q12. polr 符号陷阱:R的参数化使讲义约定的斜率为其相反数。每单位收入的累积OR 针对更高的满意度:每单位收 入把处于更高满意度类别的优势乘以 2. 3,且在每个阈值处都相同(比例优势)。收入显著(t=4. 10)。 - 2 Q13. e-0. 060 = 0. 94. Each one-unit rise in cost multiplies the odds of choosing Train rather than Car (the baseline) by 0. 94 - about a 6% drop per unit cost. Each non-baseline category has its own slope; Bus's is e-0. 030 = 0. 97. Q13. cost 每升一个单位,把选择 Train 而非 Car(基准)的优势乘以 0. 94 -- 每单位成本约下降 6%。每个非基准类 别都有自己的斜率;Bus 有它自己的一个。 - 3 Q14. Latent-scale ICC for a logistic GLMM: p 0+π2/3 2 1. 2 =0. 27 = 1. 2 +3. 29 = 4. 49 1. 2 About 27% of the latent variation is between clinics. Patients within a clinic are correlated, so a plain glm would understate the standard errors - the random clinic intercept is needed for honest inference. Q14. logistic GLMM 的潜尺度 ICC: p = 06 + T2 /3 2 = 1. 2 1. 2 +3. 29 =: 1. 2 = 0. 27 4. 49 潜变量变异的约 27% 在诊所之间。同一诊所内的患者相关,所以普通 glm 会低估标准误 -- 为了诚实的推断,需要那 个随机诊所截距。 ★ Three phrasings that bank the marks 三种稳拿分数的表述 Logistic: "odds of [event] multiplied by e per unit x. " Poisson+offset: "rate per [exposure] multiplied by e. " Proportional-odds: "odds of a higher category multiplied by e7, same at every cut" - after negating the polr slope. Always add "holding other covariates fixed. " Logistic:“[事件]的优势每单位 x 乘以某因子。”泊松+offset:“每[暴露]的率乘以某因子。”比例优势:“处于更高 类别的优势乘以某因子,在每个切点都相同” -- 在把 polr 斜率取负之后。务必加上“在其他协变量保持不变时”。 MAST90139 . Statistical Modelling for Data Science ★ The framework, compressed to one reflex 把框架压缩成一个条件反射 Faced with any GLM printout: (1) name family + link from the response type; (2) read Residual deviance vs its df (GoF) or take AD between nested models (LR test) and compare to x2; (3) exponentiate a coefficient and state the odds / rate / cumulative-odds meaning. Chapters 3-7 only change the distribution and link. 面对任何 GLM 打印输出:(1)由响应类型说出分布族+链接;(2)把 Residual deviance 对照其自由度来读(拟合 优度),或取嵌套模型间的差(LR 检验)并对照 x2;(3)对一个系数取指数,并说出其优势/率/累积优势的含义。第 3-7 章只是改变分布和链接而已。 MAST90139 . Statistical Modelling for Data Science THE MODEL CHAPTER . LOGISTIC REGRESSION LOGISTIC . CORE The flagship GLM for a yes/no response
- Step C:Interpret(解释)
- 找一个系数 $\beta_j$,算 $e^{\beta_j}$,用模板句子写解释(“乘以多少倍/变化百分比”)。[6]Source: asksia-bible-mast90139-bilingual.pdfR package; (1 | cluster) = a random intercept per cluster; REML for variance components. Marginal vs conditional — Population-averaged (GEE) vs cluster-specific effects; differ for nonlinear links. ★ The terms that carry the most marks 分值最高的那些术语 Three interpretation skills recur every year: odds ratio el (logistic), rate ratio e (Poisson, with an offset), and the cumulative odds ratio e? (proportional-odds). The recurring procedure is the AD vs x2 nested test. Memorise the one-line phrasing for each - the marks are in the wording, not the arithmetic. 三项解释技能年年出现:优势比(logistic)、率比(泊松,带 offset),以及累积优势比(比例优势)。反复出现的流程 是AD 对 x2 的嵌套检验。把每一种的一行措辞背下来 -- 分数在措辞里,不在算术里。 MAST90139 . Statistical Modelling for Data Science CH 11 - PRACTICE - CHAPTER 11 . PRACTICE BANK & WORKED SOLUTIONS DRILL TO EXAM STANDARD Read the R output. Name the model. Interpret. 读懂 R 输出。叫出模型。给出解释。 Fourteen fresh exam-style problems - logistic, Poisson, log-linear, ordinal, deviance, overdispersion, AIC 十四道全新的考试型题目 -- logistic、Poisson、对数线性、有序、偏差、过度离散、AIC The one-line takeaway. The MAST90139 final is a read-the-R-output gauntlet: given a summary (glm) / anova(test="Chi") / polr table, you must (a) name the model and link, (b) do the AD-vs-x2 or deviance arithmetic by hand, and (c) translate a coefficient into an odds ratio / rate ratio / cumulative-odds statement. This bank gives one fresh problem per skill, fully worked. 一句话要点。MAST90139 期末是一场读 R 输出的闯关赛:给定一张 summary(glm) / anova(test="Chi") / polr 表 格,你必须(a)说出模型与链接,(b)手算 △D 对 x2 或偏差的运算,并(c)把一个系数翻译成优势比/率比/累积优势的陈 述。本题库为每项技能各给一道全新且完整解出的题。 ★ Fresh numbers - mirrors the template, not the stems 全新数字 -- 模仿模板,而非照搬题干 These are AskSia-authored questions in the MAST90139 style; they are not real exam/assignment stems. The GLM procedures are canonical. Closed-book reasoning - every odds/rate ratio below is computed from the printed estimate by hand. Recall: e0. 7 ~ 2. 01, e1 ~ 2. 72, e-0. 69 ~ 0. 50, X6. 95 (1) = 3. 84. 这些是 AskSia 自撰、仿 MAST90139 风格的题目;它们不是真实的考试/作业题干。其中的GLM 流程是经典的。闭卷 推理 -- 下面每个优势/率比都由打印的估计值手算得出。 Q1-03 Logistic regression - odds ratios Q1-Q3 逻辑回归 -- 优势比 R: SUMMARY(GLM(DEFAULT ~ INCOME + LATE, FAMILY=BINOMIAL)) Coefficients : Estimate Std. Error z value Pr(> |z|) (Intercept) -1. 8400 0. 4100 -4. 49 7e-06 income -0. 0230 0. 0090 -2. 56 0. 0105 late 0. 6931 0. 1500 4. 62 4e-06 Null deviance: 410. 2 on 299 df Residual deviance: 351. 6 on 297 df AIC: 357. 6 3 marks . Logistic Q1 ODDS RATIO late is the number of late payments last year. Give the odds ratio for one extra late payment and interpret it. Q2 OR FOR C UNITS 3 marks . Logistic[14]Source: asksia-bible-mast90139-bilingual.pdfeBj = the odds ratio (OR): the factor the odds are multiplied by for that one-unit increase. eBj = 优势比 (OR):那一单位增加时优势被乘上的因子。 The phrasing template (memorise it) 措辞模板(务必记牢) "The estimated odds of [event] are multiplied by eßj (a (eß-1)×100% change) for each one- unit increase in xj, holding all else fixed. " “在其他一切保持不变时,x;每增加一个单位,[事件] 的估计优势乘以 eBj(变化(eβ-1)×100%)。” FILL IN THE BRACKETS - THIS IS THE MARK MAST90139 . Statistical Modelling for Data Science - For a c-unit change the OR is ecBj. 对一个 c 单位的变化,OR 是 ecpj。 For a binary x, ef compares the two groups directly. 对一个二元 x,eβ 直接比较两个组。 ODDS RATIO FROM A COEFFICIENT OR = odds(xj+1) = eBo+Bj(xj+1) epu+ Bjpg =eBj odds(xj) % change in odds = (eß; - 1) × 100% ✓ Sign tells you direction instantly 符号瞬间告诉你方向 B > 0 => ef > 1 => odds rise (risk factor). B < 0 => ef < 1 => odds fall (protective). B = 0 => ef = 1 => no effect. You can state the direction before doing any arithmetic. - β>0⇒ e >1⇒ 优势上升(风险因素)。β<0⇒ eβ<1⇒优势下降(保护因素)。β=0⇒ eβ=1⇒ 无效应。你在做任何算术之前就能说出方向。 Worked odds-ratio statements (course canon) ★ 优势比表述范例(课程标准说法) (a) Heart disease - cigarettes/day: 3 = 0. 0231, so e0. 0231 = 1. 023. Each extra cigarette per day multiplies the odds of CHD by 1. 023 - a 2. 3% rise in the odds per cigarette, all else equal. Over 20 cigarettes: e20x0. 0231 = 0. 462 ~ 1. 59, a 59% higher odds. (b) Voting - one extra unit of the predictor: el = 1. 646, so the odds of voting are multiplied by 1. 646 - a 64. 6% increase in the odds per unit. State the multiplier and the percent. (a) 心脏病 -- 每日香烟数:β^=0. 0231,故 e0. 0231=1. 023。每天每多抽一支烟,把患冠心病的优势乘以 1. 023- 在其他条件不变下,每支烟使优势上升 2. 3%。抽 20 支:e20x0. 0231=e0. 462~1. 59,优势高出59%。(b)投票 预测子每增加一个单位: eβ =1. 646,故投票的优势被乘以 1. 646- -每单位优势增加 64. 6%。要同时给出乘子和百分 比。 Fitted probabilities & useful summaries 拟合概率与实用汇总量 PROBABILITIES & LD50 îi = leni 1 + eti (predict (mod, type=response) ni = x{ } (predict (mod)) LD50 (dose at T = 0. 5) : n = 0 => x = - ₿0/₿1 MAST90139 . Statistical Modelling for Data Science
- Step A:Name(命名)
-
2)整门课的“骨架定义”(必须会背 + 会用)
-
2.1 GLM 的三部分(三件套定义)
-
你的材料把 GLM 定义成“三个部件”:
- 随机成分(Random component):$Y_i$ 的分布(来自指数族),均值 $\mu_i=E(Y_i)$。[20]Source: asksia-cheatsheet-mast90139.pdfMAST90139 Statistical Modelling for Data Science UNIVERSITY OF MELBOURNE . SCHOOL OF MATHEMATICS & STATISTICS EXAM REVISION Sem 1 2026 . SIDE 1 OF 2 GLM framework + logistic SIDE 1/2 THE MACHINE . GLM 3 components . Links . Exponential family . IRLS . Deviance . Residuals . Logistic regression . Odds ratios . Wald/LR . ROC . Overdispersion 0 · Exam Blueprint READ FIRST * The whole subject trains one reflex: the exam is a read-the-R-output gauntlet . Given a glm / multinom / potr summary or anova (test="Chi") table, you must (a) name the model + link, (b) do the deviance / AD-vs-x2 arithmetic by hand, (c) translate a coefficient into a real-world odds ratio / rate ratio / cumulative- odds sentence. Side 1 = the GLM machine + logistic regression. Side 2 = Poisson / log-linear / multinomial / ordinal + model checking. Everything is one engine - exponential family -> link -> IRLS -> deviance - bolted onto different response types. #1 skill: turn a printed ß into an interpretation. e^ß'is multiplicative on odds (logit), the count (log), cumulative odds (ordinal). Always exponentiate, say "per one-unit increase, all else equal". SIA > Two-line discipline: name the link before you interpret (logit-odds, log++rate), and show the deviance subtraction (AD = Do-Dı on Adf) - method marks survive even if the final number slips. 1 . The GLM . 3 Components CH3 . NELDER- WEDDERBURN Every GLM is exactly three choices: 1. * Random - Y; independent from a one- parameter exponential family, mean u; = E(Y;). 2. * Systematic - a linear predictor n; = x,TB. 3. * Link g(. ) - joins them: g(p;) = n; = x,TB. The normal LM is the special case (identity link, normal random part). The GLM frees the response to be binary, count or categorical - the variance is then a function of the mean, not constant. EXPONENTIAL FAMILY (CANONICAL FORM) In f(yle, v) = [y0 - b(0)]/a(p) + c(y, y) 0 = canonical (natural) param . y = dispersion p = b' (0) . Var(Y) = a(p) . b"(0) Differentiating b gives the mean & variance: [ = b'(0), Var = a(+)b"(0). The canonical link is the g with g(u) = e. 1b . The Three Workhorses MEMORISE THESE ROWS FAMILY Θ CANON. LINK V(M) Normal[11]Source: asksia-bible-mast90139-bilingual.pdf- CHAPTER 10 . GLOSSARY - THE GLM VOCABULARY GLM CANON Every examinable term, one line each 每个可考术语,各占一行 English term . X . crisp meaning - grouped by the GLM arc (framework- mixed effects) 英文术语 · 中文 · 简明含义 -- 按 GLM 主线(框架→混合效应)分组 A fast reference for the language MAST90139 actually examines. About 55 terms across the whole arc - framework, exponential family, links, logistic, Poisson & rates, log-linear, multicategory, model checking, and random / mixed effects - each with a one-line meaning and key formula. The middle column is reserved for the +X gloss in the bilingual edition. 一份针对 MAST90139 实际考查语言的速查参考。整条主线约 55个术语 -- 框架、指数族、链接、logistic、泊松与率、对 数线性、多类别、模型检验,以及随机/混合效应 -- 每个都附一行释义和关键公式。中间一列在双语版中留给中文释义。 Term (EN) 中文 One-line meaning A - The GLM framework Generalized linear model (GLM) — Random component + linear predictor + link; extends the LM to non-normal responses. Random component The response distribution (an exponential-family member) with mean plį. — Systematic component — The linear predictor ni = x{ { of covariate effects. Link function g — Connects mean to predictor: g(pi) = ni; monotone, differentiable. Canonical (natural) link — The link for which g(p) = 0 (logit/log/identity); simplest score. Linear predictor n — Weighted sum of covariates x; B on the link scale. B - Exponential family Exponential family Densities of form exp{(y0-b(0))/a(+)+c(y,¢)}. Canonical parameter 0
- 系统成分(Systematic component):线性预测子
- $$\eta_i = x_i^T\beta$$
也就是协变量的加权和。[2]Source: asksia-bible-mast90139-bilingual.pdfA GLM course taught in R- method and interpretation, not memorised slides 一门用 R 讲授的 GLM 课程 -- 重在方法与解释,而非死记幻灯片 This is not a transcript of the lecture decks. It is a self-contained tour of every model MAST90139 examines - logistic, binomial, Poisson, log-linear, multinomial, proportional-odds - built on the one engine that unifies them, the generalised linear model. Each concept is presented as an AHA-unit: a displayed equation and/or a diagram, a plain-English explainer, the method as numbered steps, a worked R-output reading, and the trap that catches people in the exam. 这并不是讲义幻灯片的逐字记录。它是一次自成体系的巡览,覆盖 MAST90139 所考查的每一个模型 -- logistic、二项、 Poisson、对数线性、多项、比例优势 -- 这些模型都建立在统一它们的同一台引擎之上,即广义线性模型。每个概念都以一 个 AHA 单元呈现:一个展示性方程和/或一幅图示、一段通俗英语的讲解、以编号步骤给出的方法、一次对 R输出的实战解 读,以及那个在考试中让人栽跟头的陷阱。 A 1 . LEARN 1 · 学习 Before the lecture. Read a chapter top to bottom. The GLM framework chapter is the spine - random / systematic / link, the exponential family, deviance and IRLS - every later model is just that template with a new distribution and link. Learn it cold. 课前。把一章从头读到尾。 GLM 框架章是脊梁––随机/系 统/链接、指数族、偏差与 IRLS -- 后面每个模型都只是这套模 板换上新的分布和链接。要烂熟 于心。 B 2 . DRILL 2 · 操练 After the applied lab. Cover the worked interpretation and re- derive it: name the family + link, compute D or AD, compare to x2, then say what exp (B) means in words. That triple is the exam. 应用实验课后。 盖住解出的解释 自己重推一遍:说出分布族+链 接,算出 D 或 △D,对照x2,再 用文字说出 exp(β)的含义。这 三件套就是考试。 C 3 · EXAM 3 · 应考 SWOTVAC. Walk the chapters as a checklist of model families. The blueprint overleaf maps the three Gradescope assignments onto the chapter arc - they are your rehearsal for the paper. 复习周(SWOTVAC)。把各章 当作模型分布族的清单来过一 遍。背面的蓝图把三次 Gradescope 作业映射到章节主 线上 -- 它们就是你为正式试卷 做的预演。 MAST90139 . Statistical Modelling for Data Science ! The single most important thing about MAST90139 关于 MAST90139 最重要的一件事 It looks like one model with many faces - and it is. Once you see that logistic, Poisson, log-linear and ordinal regression are all the same three-part GLM (a random component from an exponential family, a linear predictor n=XB, and a link g ()=n), the course collapses to one workflow: fit - assess fit by deviance - compare nested models by AD - interpret exp(@). Don't memorise six models; memorise one and swap the distribution and link. 它看起来像一个有许多面孔的模型 -- 而它确实如此。一旦你看清 logistic、泊松、对数线性和有序回归全都是同一个三 部分 GLM(一个来自指数族的随机成分、一个线性预测子 n=Xβ,以及一个链接 g(μ)=n),整门课就坍缩为一套工作 流程:拟合→ 用偏差评估拟合→用△D 比较嵌套模型→解释 exp(β)。不要去背六个模型;背一个,然后替换分布和 链接。 i How this book was built - and the two-layer rule 本书是如何构建的 -- 以及两层规则 Standard GLM theory (the exponential family, IRLS, deviance, the canon of link functions) is stated plainly - it is universal and lives in Faraway, McCullagh & Nelder, Agresti. The subject's own framing and the lecturer's specific example numbers are paraphrased and re-numbered, never copied from slides or past papers. The course runs on A/Prof Qian's lecture notes with Faraway, Extending the Linear Model with R and the faraway package. Verify all dates and weights against the official MAST90139info. pdf subject guide / University Handbook, as details shift between cohorts. 标准 GLM 理论(指数族、IRLS、偏差、链接函数的经典清单)被平实地陈述 -- 它是通用的,见于 Faraway、 McCullagh & Nelder、Agresti。本科目自身的框架表述与讲师特定的算例数字均经改述并重新编号,绝不照抄自幻灯片 或往年试卷。本课程以 Qian 副教授的讲义为主,配合 Faraway 的 Extending the Linear Model with R 和 faraway 包。所有日期与权重请对照官方 MAST90139info. pdf 科目指南/大学手册核实,因为细节会因届别而变。 MAST90139 . Statistical Modelling for Data Science THE BLUEPRINT CONFIRM WEIGHTS - THE ASSESSMENT BLUEPRINT[11]Source: asksia-bible-mast90139-bilingual.pdf- CHAPTER 10 . GLOSSARY - THE GLM VOCABULARY GLM CANON Every examinable term, one line each 每个可考术语,各占一行 English term . X . crisp meaning - grouped by the GLM arc (framework- mixed effects) 英文术语 · 中文 · 简明含义 -- 按 GLM 主线(框架→混合效应)分组 A fast reference for the language MAST90139 actually examines. About 55 terms across the whole arc - framework, exponential family, links, logistic, Poisson & rates, log-linear, multicategory, model checking, and random / mixed effects - each with a one-line meaning and key formula. The middle column is reserved for the +X gloss in the bilingual edition. 一份针对 MAST90139 实际考查语言的速查参考。整条主线约 55个术语 -- 框架、指数族、链接、logistic、泊松与率、对 数线性、多类别、模型检验,以及随机/混合效应 -- 每个都附一行释义和关键公式。中间一列在双语版中留给中文释义。 Term (EN) 中文 One-line meaning A - The GLM framework Generalized linear model (GLM) — Random component + linear predictor + link; extends the LM to non-normal responses. Random component The response distribution (an exponential-family member) with mean plį. — Systematic component — The linear predictor ni = x{ { of covariate effects. Link function g — Connects mean to predictor: g(pi) = ni; monotone, differentiable. Canonical (natural) link — The link for which g(p) = 0 (logit/log/identity); simplest score. Linear predictor n — Weighted sum of covariates x; B on the link scale. B - Exponential family Exponential family Densities of form exp{(y0-b(0))/a(+)+c(y,¢)}. Canonical parameter 0
- $$\eta_i = x_i^T\beta$$
- 链接函数(Link function):连接均值与预测子
- $$g(\mu_i)=\eta_i$$
要求单调、可微(材料把它当成“方向盘/steering wheel”)。[11]Source: asksia-bible-mast90139-bilingual.pdf- CHAPTER 10 . GLOSSARY - THE GLM VOCABULARY GLM CANON Every examinable term, one line each 每个可考术语,各占一行 English term . X . crisp meaning - grouped by the GLM arc (framework- mixed effects) 英文术语 · 中文 · 简明含义 -- 按 GLM 主线(框架→混合效应)分组 A fast reference for the language MAST90139 actually examines. About 55 terms across the whole arc - framework, exponential family, links, logistic, Poisson & rates, log-linear, multicategory, model checking, and random / mixed effects - each with a one-line meaning and key formula. The middle column is reserved for the +X gloss in the bilingual edition. 一份针对 MAST90139 实际考查语言的速查参考。整条主线约 55个术语 -- 框架、指数族、链接、logistic、泊松与率、对 数线性、多类别、模型检验,以及随机/混合效应 -- 每个都附一行释义和关键公式。中间一列在双语版中留给中文释义。 Term (EN) 中文 One-line meaning A - The GLM framework Generalized linear model (GLM) — Random component + linear predictor + link; extends the LM to non-normal responses. Random component The response distribution (an exponential-family member) with mean plį. — Systematic component — The linear predictor ni = x{ { of covariate effects. Link function g — Connects mean to predictor: g(pi) = ni; monotone, differentiable. Canonical (natural) link — The link for which g(p) = 0 (logit/log/identity); simplest score. Linear predictor n — Weighted sum of covariates x; B on the link scale. B - Exponential family Exponential family Densities of form exp{(y0-b(0))/a(+)+c(y,¢)}. Canonical parameter 0[17]Source: asksia-bible-mast90139-bilingual.pdf处理是/否型响应的旗舰 GLM The logit link, and why a straight line will not do logit 连接,以及为何一条直线行不通 When the response is binary - disease / no disease, vote / abstain, pass / fail - the thing we model is a probability Tt E (0,1). A plain linear model Tt = x ß is wrong on two counts: fitted values escape [0,1], and the variance of a Bernoulli is nt(1-nt), not constant. Logistic regression fixes both by modelling the log-odds as linear. 当响应是二值时 -- 患病/未患病、投票/弃权、通过/不通过 -- 我们要建模的对象是一个概率 π ∈(0,1)。一个朴素的线性模 型 π = x β 在两点上都是错的:拟合值会跑出[0,1],而且 Bernoulli 的方差是 nt(1-π),并非常数。逻辑回归通过把对数几 率建模为线性,同时修正这两个问题。 ★ Why this is the chapter that pays the rent 为何这一章是真正赚钱的章节 Logistic regression is the single most examined model in MAST90139, and the one skill the examiner returns to every year is turning a coefficient into an odds ratio and saying what it means in words. Everything on this page exists to set that up. If you master el = odds ratio and can read it off the R output, you have banked the marks. logistic 回归是 MAST90139 中考查最多的单个模型,而考官年年回归的那项技能就是把一个系数转化成优势比,并用 文字说出它的含义。本页的一切都是为铺垫这件事而存在。如果你掌握了eB=优势比,并能从 R 输出中读出它,你就 把分数收入囊中了。 1 The binary logistic model 1二值 logistic 模型 Let Yi ~ Bernoulli (Ti) (or grouped Bin(mi , Ti)). The systematic part is the usual linear predictor ni = Xi B; the logit link ties it to the mean: 设 Yi ~ Bernoulli(ni)(或分组的 Bin(mi,Ti))。 系统部分是通常的线性预测子 ni = Xi β; logit 连接把它 与均值系在一起: THE LOGIT LINK TT logit (Ti) = log = Bo + B1x11 + . . . + Baxiq = xT B 1 - Ti eni Ti = = € (0,1) 1 + e-ni The link is the GLM's steering wheel: model on the n scale (a line, -o to +00), then the inverse-logit squashes it back to a valid probability. The logit is the canonical link for the binomial family, which is why R's family=binomial uses it by default. probability pi 1. 01 pi(eta) 0. 75 0. 50 (eta=0, pi=0. 5) 0. 25 logit
- $$g(\mu_i)=\eta_i$$
-
2.2 指数族(Exponential family)与均值-方差关系(你至少要知道“为什么 GLM 合理”)
-
你的材料写到指数族的典型形式,并强调“均值决定方差”的脊梁:
- 指数族形式(概念性记住):
$$\exp{(y\theta-b(\theta))/a(\phi)+c(y,\phi)}$$
并且由 $b(\theta)$ 的导数能得到均值与方差。[11]Source: asksia-bible-mast90139-bilingual.pdf- CHAPTER 10 . GLOSSARY - THE GLM VOCABULARY GLM CANON Every examinable term, one line each 每个可考术语,各占一行 English term . X . crisp meaning - grouped by the GLM arc (framework- mixed effects) 英文术语 · 中文 · 简明含义 -- 按 GLM 主线(框架→混合效应)分组 A fast reference for the language MAST90139 actually examines. About 55 terms across the whole arc - framework, exponential family, links, logistic, Poisson & rates, log-linear, multicategory, model checking, and random / mixed effects - each with a one-line meaning and key formula. The middle column is reserved for the +X gloss in the bilingual edition. 一份针对 MAST90139 实际考查语言的速查参考。整条主线约 55个术语 -- 框架、指数族、链接、logistic、泊松与率、对 数线性、多类别、模型检验,以及随机/混合效应 -- 每个都附一行释义和关键公式。中间一列在双语版中留给中文释义。 Term (EN) 中文 One-line meaning A - The GLM framework Generalized linear model (GLM) — Random component + linear predictor + link; extends the LM to non-normal responses. Random component The response distribution (an exponential-family member) with mean plį. — Systematic component — The linear predictor ni = x{ { of covariate effects. Link function g — Connects mean to predictor: g(pi) = ni; monotone, differentiable. Canonical (natural) link — The link for which g(p) = 0 (logit/log/identity); simplest score. Linear predictor n — Weighted sum of covariates x; B on the link scale. B - Exponential family Exponential family Densities of form exp{(y0-b(0))/a(+)+c(y,¢)}. Canonical parameter 0[20]Source: asksia-cheatsheet-mast90139.pdfMAST90139 Statistical Modelling for Data Science UNIVERSITY OF MELBOURNE . SCHOOL OF MATHEMATICS & STATISTICS EXAM REVISION Sem 1 2026 . SIDE 1 OF 2 GLM framework + logistic SIDE 1/2 THE MACHINE . GLM 3 components . Links . Exponential family . IRLS . Deviance . Residuals . Logistic regression . Odds ratios . Wald/LR . ROC . Overdispersion 0 · Exam Blueprint READ FIRST * The whole subject trains one reflex: the exam is a read-the-R-output gauntlet . Given a glm / multinom / potr summary or anova (test="Chi") table, you must (a) name the model + link, (b) do the deviance / AD-vs-x2 arithmetic by hand, (c) translate a coefficient into a real-world odds ratio / rate ratio / cumulative- odds sentence. Side 1 = the GLM machine + logistic regression. Side 2 = Poisson / log-linear / multinomial / ordinal + model checking. Everything is one engine - exponential family -> link -> IRLS -> deviance - bolted onto different response types. #1 skill: turn a printed ß into an interpretation. e^ß'is multiplicative on odds (logit), the count (log), cumulative odds (ordinal). Always exponentiate, say "per one-unit increase, all else equal". SIA > Two-line discipline: name the link before you interpret (logit-odds, log++rate), and show the deviance subtraction (AD = Do-Dı on Adf) - method marks survive even if the final number slips. 1 . The GLM . 3 Components CH3 . NELDER- WEDDERBURN Every GLM is exactly three choices: 1. * Random - Y; independent from a one- parameter exponential family, mean u; = E(Y;). 2. * Systematic - a linear predictor n; = x,TB. 3. * Link g(. ) - joins them: g(p;) = n; = x,TB. The normal LM is the special case (identity link, normal random part). The GLM frees the response to be binary, count or categorical - the variance is then a function of the mean, not constant. EXPONENTIAL FAMILY (CANONICAL FORM) In f(yle, v) = [y0 - b(0)]/a(p) + c(y, y) 0 = canonical (natural) param . y = dispersion p = b' (0) . Var(Y) = a(p) . b"(0) Differentiating b gives the mean & variance: [ = b'(0), Var = a(+)b"(0). The canonical link is the g with g(u) = e. 1b . The Three Workhorses MEMORISE THESE ROWS FAMILY Θ CANON. LINK V(M) Normal - 方差函数(variance function):
$$\mathrm{Var}(Y)=\phi,V(\mu)$$
- 指数族形式(概念性记住):
-
3)最常考模型族:你该背哪些“公式 + 一句话解释”
-
3.1 Logistic 回归(本课“最赚钱”的章节)
-
模型与 link:
- $$Y_i\sim \text{Bernoulli}(\pi_i)\ \ \text{(或 grouped } \mathrm{Bin}(m_i,\pi_i))$$
- $$\text{logit}(\pi_i)=\log\frac{\pi_i}{1-\pi_i}=\eta_i=x_i^T\beta$$
- 反函数给概率:
$$\pi_i=\frac{e^{\eta_i}}{1+e^{\eta_i}}\in(0,1)$$
这些都在材料里直接写了。[17]Source: asksia-bible-mast90139-bilingual.pdf处理是/否型响应的旗舰 GLM The logit link, and why a straight line will not do logit 连接,以及为何一条直线行不通 When the response is binary - disease / no disease, vote / abstain, pass / fail - the thing we model is a probability Tt E (0,1). A plain linear model Tt = x ß is wrong on two counts: fitted values escape [0,1], and the variance of a Bernoulli is nt(1-nt), not constant. Logistic regression fixes both by modelling the log-odds as linear. 当响应是二值时 -- 患病/未患病、投票/弃权、通过/不通过 -- 我们要建模的对象是一个概率 π ∈(0,1)。一个朴素的线性模 型 π = x β 在两点上都是错的:拟合值会跑出[0,1],而且 Bernoulli 的方差是 nt(1-π),并非常数。逻辑回归通过把对数几 率建模为线性,同时修正这两个问题。 ★ Why this is the chapter that pays the rent 为何这一章是真正赚钱的章节 Logistic regression is the single most examined model in MAST90139, and the one skill the examiner returns to every year is turning a coefficient into an odds ratio and saying what it means in words. Everything on this page exists to set that up. If you master el = odds ratio and can read it off the R output, you have banked the marks. logistic 回归是 MAST90139 中考查最多的单个模型,而考官年年回归的那项技能就是把一个系数转化成优势比,并用 文字说出它的含义。本页的一切都是为铺垫这件事而存在。如果你掌握了eB=优势比,并能从 R 输出中读出它,你就 把分数收入囊中了。 1 The binary logistic model 1二值 logistic 模型 Let Yi ~ Bernoulli (Ti) (or grouped Bin(mi , Ti)). The systematic part is the usual linear predictor ni = Xi B; the logit link ties it to the mean: 设 Yi ~ Bernoulli(ni)(或分组的 Bin(mi,Ti))。 系统部分是通常的线性预测子 ni = Xi β; logit 连接把它 与均值系在一起: THE LOGIT LINK TT logit (Ti) = log = Bo + B1x11 + . . . + Baxiq = xT B 1 - Ti eni Ti = = € (0,1) 1 + e-ni The link is the GLM's steering wheel: model on the n scale (a line, -o to +00), then the inverse-logit squashes it back to a valid probability. The logit is the canonical link for the binomial family, which is why R's family=binomial uses it by default. probability pi 1. 01 pi(eta) 0. 75 0. 50 (eta=0, pi=0. 5) 0. 25 logit
-
核心考点:$e^\beta$ 解释(优势比 OR)
- 系数在 log-odds 尺度上,人不直观;所以要取指数:$e^{\beta_j}$ 就变成 OR。[7]Source: asksia-bible-mast90139-bilingual.pdf- THE #1 EXAM SKILL ODDS RATIOS Coefficients are log-odds; el is an odds ratio 系数即对数几率;eβ 即优势比 Turn a number into a sentence the marker can tick 把一个数字翻译成阅卷人能打勾的一句话 A logistic coefficient lives on the log-odds scale, which no human reads directly. Exponentiate it and it becomes an odds ratio - a multiplier that everyone understands. Getting from @ to a clean English sentence is the move the exam rewards most. 一个 logistic 系数位于对数几率尺度上,没人能直接读懂它。把它取指数,它就变成一个优势比 -- 一个人人都理解的乘数。 从 β 走到一句干净的中文(英文)表述,正是考试最看重的那个动作。 - probability 1-p odds = p/(1-p) log odds in (0, inf) e. g. 0. 7/0. 3 = 2. 33 p log OR = b1 (per unit x) => OR = exp(b1) p = 0. 5 -> logit 0 p < 0. 5 (negative) grpq > 0. 5 (positive) grp 2 -3 -2 0 0 logjt = log-odds From probability to log-odds. p in (0,1) -> odds = p/(1-p) in (0, infinity) -> logit = log odds on a centred line (negative below p=0. 5, zero at p=0. 5, positive above). Two groups give an odds ratio OR = odds1/odds2; log OR = b1, so one unit of x multiplies the odds by exp(b1). 从概率到对数优势。p 在(0,1)->优势= p/(1-p)在(0,无穷)-> logit = 居中直线上的对数优势 (p<0. 5 时为负,p=0. 5 时为零,p>0. 5 时为正)。两组给出一个优势比 OR = odds1/odds2; log OR = b1,所以x 的一个单位把优势乘以 exp(b1)。 2 What each coefficient means 2 每个系数的含义 - Bj = the change in the log-odds of success for a one-unit increase in xj, holding all other covariates fixed. βj= 在所有其他协变量保持不变时,x;每增加一个单位,成 功的对数优势的变化量。
- 必背“阅卷人能打勾”的句子模板:
- c 单位变化:OR $=e^{c\beta_j}$。[14]Source: asksia-bible-mast90139-bilingual.pdfeBj = the odds ratio (OR): the factor the odds are multiplied by for that one-unit increase. eBj = 优势比 (OR):那一单位增加时优势被乘上的因子。 The phrasing template (memorise it) 措辞模板(务必记牢) "The estimated odds of [event] are multiplied by eßj (a (eß-1)×100% change) for each one- unit increase in xj, holding all else fixed. " “在其他一切保持不变时,x;每增加一个单位,[事件] 的估计优势乘以 eBj(变化(eβ-1)×100%)。” FILL IN THE BRACKETS - THIS IS THE MARK MAST90139 . Statistical Modelling for Data Science - For a c-unit change the OR is ecBj. 对一个 c 单位的变化,OR 是 ecpj。 For a binary x, ef compares the two groups directly. 对一个二元 x,eβ 直接比较两个组。 ODDS RATIO FROM A COEFFICIENT OR = odds(xj+1) = eBo+Bj(xj+1) epu+ Bjpg =eBj odds(xj) % change in odds = (eß; - 1) × 100% ✓ Sign tells you direction instantly 符号瞬间告诉你方向 B > 0 => ef > 1 => odds rise (risk factor). B < 0 => ef < 1 => odds fall (protective). B = 0 => ef = 1 => no effect. You can state the direction before doing any arithmetic. - β>0⇒ e >1⇒ 优势上升(风险因素)。β<0⇒ eβ<1⇒优势下降(保护因素)。β=0⇒ eβ=1⇒ 无效应。你在做任何算术之前就能说出方向。 Worked odds-ratio statements (course canon) ★ 优势比表述范例(课程标准说法) (a) Heart disease - cigarettes/day: 3 = 0. 0231, so e0. 0231 = 1. 023. Each extra cigarette per day multiplies the odds of CHD by 1. 023 - a 2. 3% rise in the odds per cigarette, all else equal. Over 20 cigarettes: e20x0. 0231 = 0. 462 ~ 1. 59, a 59% higher odds. (b) Voting - one extra unit of the predictor: el = 1. 646, so the odds of voting are multiplied by 1. 646 - a 64. 6% increase in the odds per unit. State the multiplier and the percent. (a) 心脏病 -- 每日香烟数:β^=0. 0231,故 e0. 0231=1. 023。每天每多抽一支烟,把患冠心病的优势乘以 1. 023- 在其他条件不变下,每支烟使优势上升 2. 3%。抽 20 支:e20x0. 0231=e0. 462~1. 59,优势高出59%。(b)投票 预测子每增加一个单位: eβ =1. 646,故投票的优势被乘以 1. 646- -每单位优势增加 64. 6%。要同时给出乘子和百分 比。 Fitted probabilities & useful summaries 拟合概率与实用汇总量 PROBABILITIES & LD50 îi = leni 1 + eti (predict (mod, type=response) ni = x{ } (predict (mod)) LD50 (dose at T = 0. 5) : n = 0 => x = - ₿0/₿1 MAST90139 . Statistical Modelling for Data Science[21]Source: asksia-cheatsheet-mast90139.pdf5b . ODDS RATIOS * THE #1 EXAM · e^B SKILL Odds = Tt/(1-Tt); log-odds = logit(Tt). logit(0. 5)=0; logit(1-Tt) =- logit(Tt). A c-unit change = OR =e^(cf;). . * PHRASING TEMPLATE - MEMORISE "The odds of [event] are multiplied by e^Bi (change by (e^₿ ;- 1) x100%) for each one-unit increase in xj, all else equal. " Worked (heart disease): Bcigs=0. 0231 => e^0. 0231 = 1. 023 - odds of CHD rise ~2. 3% per extra cigarette/day. A height coef 0. 025 = OR 1. 026 per inch. LD50 (dose giving Tt=0. 5): solve n=0 => x =- Bo/B1. Empirical logit (plotting grouped data): log[(y+0. 5)/(n-y+0. 5)]. 5c . Grouped = Ungrouped DON'T DOUBLE- COUNT Fit grouped via glm(y/n ~ x, family=binomial, weights=n). Estimates B'are identical grouped or ungrouped - only the deviance & its df differ (so GoF differs, but LR model-comparison tests agree). 5d . The Logistic S- Curve DO BOTH BY HAND PROBABILITY SCALE T = e^n/(1+e^n) is a sigmoid in n: flat near 0 and 1, steepest at Tt = 0. 5 (n = 0) . So a one-unit + in x shifts the log-odds by a constant ß, but the change in probability depends on where you start. MARGINAL EFFECT οπ/x; = β; · π(1-π) maximal at T=0. 5, =0 at the tails Don't read @ as a probability change. Convert to the odds (e^ß) for a clean constant statement, or evaluate Tt at specific x for a probability. 5e . A Categorical Predictor FACTOR LEVELS A factor enters as dummy (indicator) columns vs a reference level. Each coefficient's e^ß is the OR of that level vs the reference. A 3-level factor = 2 dummies => test the factor as a block with AD on 2 df, not by individual z-values. To compare two non-reference levels, re-level or take the difference of their log-odds. 6 . Inference . 3 TestsCH2/3 . ONE He Test Ho: CB = &. Three asymptotically-x25 tests (s = rank C):
-
高频陷阱:别把 $\beta$ 当“概率变化”
-
关于 deviance 的一个关键提醒(非常容易误用)
-
3.2 Poisson 回归(计数 & 率:offset 是送分题)
-
模型与 link:Poisson + log link
- $$Y_i\sim \mathrm{Poisson}(\mu_i)$$
- $$\log(\mu_i)=\eta_i=x_i^T\beta$$
- 解释模板(率比 RR):$e^{\beta_j}$ 是 rate ratio。[13]Source: asksia-bible-mast90139-bilingual.pdf用对数连接对计数建模 The Poisson pmf . the log link . the rate ratio e . offsets for rates . deviance GoF . quasi-Poisson Poisson 概率质量函数 · 对数连接 · 率比 es . 用于率的偏移项 · 偏差拟合优度 · quasi-Poisson When the response is a count - emails per hour, claims per policy, cases per region - the normal linear model breaks: counts are non-negative integers, their variance grows with their mean, and a fitted line can wander negative. The fix is a GLM with the Poisson random component and the log link. Two skills carry almost every exam mark here: reading el as a rate ratio, and knowing when to drop in an offset so you are modelling a rate rather than a raw count. We build both from the pmf up. 当响应是计数时 -- 每小时邮件数、每张保单理赔数、每个地区病例数 -- 正态线性模型就崩溃了:计数是非负整数,其方差 随均值增长,而拟合直线会游走到负值。修补办法是采用 Poisson 随机成分与对数连接的 GLM。这里几乎每一分都靠两项技 能拿下:把 eβ 读作率比,以及知道何时加入偏移项,从而你建模的是率而非原始计数。我们将从概率质量函数起逐步构建这 两者。 MAST90139 . Statistical Modelling for Data Science ★ What the exam asks here 此处考试会问什么 Three recurring items. (1) State and interpret a fitted Poisson model - turn a coefficient into a rate-ratio sentence ("the expected count is multiplied by el per unit of x"). (2) Add an offset log(exposure) to convert counts to rates, and re-interpret el as a rate per unit exposure. (3) Judge goodness of fit from the residual deviance vs its df, diagnose overdispersion, and re-fit as quasi-Poisson with inflated standard errors. Reading R output correctly is half the marks. 三个常考项目。(1)陈述并解释一个已拟合的泊松模型 -- 把系数转成一句率比陈述(“x每增加一个单位,期望计数乘以 eB”)。(2)加上 offset log(暴露量)把计数转成率,并把es重新解释为每单位暴露的率。(3)由残差偏差对照其自由度 来判断拟合优度,诊断过度离散,并以膨胀的标准误重拟合为拟泊松。正确读 R 输出占一半分数。 1 The Poisson distribution - the count signature 1 Poisson 分布 -- 计数的特征签名 A count Y with no fixed upper bound, occurring at average rate A, follows a Poisson law. Its defining fingerprint - and the one the examiner probes - is that the mean equals the variance. That single fact is what makes the log link natural and what overdispersion later violates. 一个没有固定上界、以平均率入发生的计数 Y,服从 Poisson 分布。它定义性的指纹 -- 也是考官会探查的那一点 -- 是均 值等于方差。正是这一个事实使对数连接自然成立,也正是过度离散后来所违反的那一点。 POISSON PMF . MEAN = VARIANCE Pr(Y = k) = " e-11h k! , k = 0,1,2,. . . E(Y) = Var(Y) = ) Support is the non-negative integers; the law is right- skewed for small À and the mode sits at [X ]. As & grows the bars shift right and spread - because the variance rises with the mean in lockstep. There is no separate scale parameter: ¢ = 1 is baked in. 支撑集是非负整数;当入较小时分布右偏,众数位于 [入。随着 入增大,柱状图整体右移并变宽 -- 因为方差与 均值步调一致地上升。这里没有单独的尺度参数:Φ=1是 内置的。 i Why not just use a linear model on the counts? 为何不直接对计数用线性模型? Three failures. The fitted mean can go negative (nonsense for a count); the variance is not constant (it equals the mean, so big counts are noisier); and the integer, skewed shape is far from normal. The Poisson GLM fixes all three at once by modelling log u linearly and letting Var = u. 三处失败。拟合均值可能变负(对计数而言荒谬);方 差不恒定(它等于均值,所以大计数噪声更大);而且 整数、偏斜的形状离正态很远。泊松 GLM 一举修复这 三点:对 log u 作线性建模,并令 Var = μ。 Pr(Y = k) mean = lambda mean = var = lambda 0 1 2 3 4 5 6 7 8 9 10 11 68unt k Poisson PMF Pr(Y=k) for rate lambda: integer support k >= 0, right-skewed for small lambda, mode at floor(lambda). The count-data signature is mean = variance = lambda = exp(eta); raising the rate shifts and spreads the bars together. 率为 lambda 的泊松 PMF Pr(Y=k): 整数支撑 k>=0, lambda 小时右偏,众数在 floor(lambda)。计数数据的 标志是均值 = 方差 = lambda = exp(eta);提高率会使 柱子一起右移并展开。 MAST90139 . Statistical Modelling for Data Science 2 The Poisson regression model & the log link 2 Poisson 回归模型与对数连接[26]Source: asksia-cheatsheet-mast90139.pdfEXAM REVISION Sem 1 2026 . SIDE 2 OF 2 Poisson . log-linear . multicat SIDE 2/2 Reading R output . Traps 10 . Poisson Regression CH3/5 . COUNTS Counts Y; ~ Poi(u;) with the log link (canonical): THE MODEL = LOG-LINEAR Log pi = Be + B1x11 + = = X1TB Pi = e^(X1TB) . mean = variance = Pi Score u(B)=X™(y-u); Fisher info J=X™diag{u;}X. Same IRLS/deviance machinery as Side 1. * RATE RATIOS - E^B e^ß = rate ratio (RR) - the multiplicative change in the expected count per one-unit + in x ;. Interpret exactly like an OR but on the count scale: "the expected count is multiplied by e^ß) per unit increase in xj. " Worked: B=0. 405 => e^0. 405 = 1. 50 => the rate is 50% higher per unit. 10b . The OFFSET * RATES PER EXPOSURE To model a rate (events per exposure tj, e. g. per person-year), use µ; = t-e^(x,TB): OFFSET FORM log pi = log ti + X1TB R: glm(y~x, poisson, offset=log(t)) log ti is an offset - a covariate with coefficient fixed at 1. Then e^ß¡ is a rate ratio per unit of exposure. SIA > Forgetting the offset turns a rate model into a count model - the coefficient interpretation changes completely. If the question gives an exposure/denominator, an offset is almost certainly intended. 10c . Poisson GoF & R GROUPED = VALID Same deviance/Pearson GoF tests as Side 1 §4 (valid: counts behave like grouped data). (Dispersion parameter for poisson taken to be 1) confirms $=1. anova (mod, test="Chi") gives sequential LR tests. Trap: Poisson counts often overdisperse - check D/df; if »1 switch to quasi-Poisson or negative binomial. 10d . Worked . Rate w/ Offset READ THE RR Disease counts y in regions with population t. Fit glm(y~group, poisson, offset=log(t)). The group coef ß= 0. 693, se 0. 10. RR = e^0. 693 = 2. 00 95% CI = e^(0. 693+1. 96. 0. 10) = (1. 64, 2. 43)
-
offset(最常见“明示分母”的题)
-
过度离散(overdispersion)与 quasi-Poisson
- 材料写:quasi-Poisson 的点估计 $\hat\beta$ 与 Poisson 相同,变化的是标准误会按离散参数膨胀,R 也会改用 t 值等输出特征。[5]Source: asksia-bible-mast90139-bilingual.pdfFit with family = quasipoisson. The point estimates @ (and hence every rate ratio) are identical to the Poisson fit - only the standard errors inflate by vo, widening Cls and shrinking z/t statistics. R now prints an estimated dispersion instead of "taken to be 1", and uses t values rather than z. 用 family = quasipoisson 拟合。点估计 β^(以及由此得到的每个率比)与泊松拟合完全相同 -- 只有标准误膨胀 为√Φ^倍,使 CI 变宽、z/t 统计量缩小。R现在打印一个估计的离散参数,而不是“taken to be 1”,并用 t 值而非 Zo - n - q n - q MAST90139 . Statistical Modelling for Data Science AskSia Library EXAM BIBLE . ASKSIA SCHOOL OF MATHEMATICS & STATISTICS SEMESTER 1 . 2026 pi- g: logit log(pi/ (1-pi)) g: log log (mu) g: probit Phi-inv (pi) pi=0. [g: cloglog log(-log(1-p: eta = XB THE COMPLETE EXAM BIBLE Statistical Modelling for Data Science 数据科学统计建模 ONE FRAMEWORK - THE GENERALISED LINEAR MODEL - THAT BENDS REGRESSION TO BINARY, COUNT, CATEGORICAL AND ORDINAL DATA, ALL FITTED IN R. 墨尔本大学 MAST90139 · 数据科学硕士 · 双语视觉精读 · LaTeX 公式排版 · 广义线性模型(逻 辑/泊松/对数线性/多项) MAST90139 . THE UNIVERSITY OF MELBOURNE 中英双语版 · BILINGUAL EDITION 英文主讲,中文随行 一 考试要点与术语保留英文原词 The exam is a read-the-R-output gauntlet: given a glm() / multinom() / polr() printout you must name the model & link, do the deviance / AD-vs-x2 arithmetic by hand, and translate a coefficient into an odds / rate / cumulative-odds statement. This book drills exactly that, across the whole GLM arc. Independent study companion. Not affiliated with or endorsed by The University of Melbourne. Corrections: takedowns@asksia. ai PREFACE - HOW TO USE THIS BOOK Read the output, name the move 读懂输出,叫出招式
- 实战动作:看 $D/df$,若远大于 1 考虑 overdispersion 并改用 quasi-Poisson(材料提到这一诊断路径)。[26]Source: asksia-cheatsheet-mast90139.pdfEXAM REVISION Sem 1 2026 . SIDE 2 OF 2 Poisson . log-linear . multicat SIDE 2/2 Reading R output . Traps 10 . Poisson Regression CH3/5 . COUNTS Counts Y; ~ Poi(u;) with the log link (canonical): THE MODEL = LOG-LINEAR Log pi = Be + B1x11 + = = X1TB Pi = e^(X1TB) . mean = variance = Pi Score u(B)=X™(y-u); Fisher info J=X™diag{u;}X. Same IRLS/deviance machinery as Side 1. * RATE RATIOS - E^B e^ß = rate ratio (RR) - the multiplicative change in the expected count per one-unit + in x ;. Interpret exactly like an OR but on the count scale: "the expected count is multiplied by e^ß) per unit increase in xj. " Worked: B=0. 405 => e^0. 405 = 1. 50 => the rate is 50% higher per unit. 10b . The OFFSET * RATES PER EXPOSURE To model a rate (events per exposure tj, e. g. per person-year), use µ; = t-e^(x,TB): OFFSET FORM log pi = log ti + X1TB R: glm(y~x, poisson, offset=log(t)) log ti is an offset - a covariate with coefficient fixed at 1. Then e^ß¡ is a rate ratio per unit of exposure. SIA > Forgetting the offset turns a rate model into a count model - the coefficient interpretation changes completely. If the question gives an exposure/denominator, an offset is almost certainly intended. 10c . Poisson GoF & R GROUPED = VALID Same deviance/Pearson GoF tests as Side 1 §4 (valid: counts behave like grouped data). (Dispersion parameter for poisson taken to be 1) confirms $=1. anova (mod, test="Chi") gives sequential LR tests. Trap: Poisson counts often overdisperse - check D/df; if »1 switch to quasi-Poisson or negative binomial. 10d . Worked . Rate w/ Offset READ THE RR Disease counts y in regions with population t. Fit glm(y~group, poisson, offset=log(t)). The group coef ß= 0. 693, se 0. 10. RR = e^0. 693 = 2. 00 95% CI = e^(0. 693+1. 96. 0. 10) = (1. 64, 2. 43)
-
3.3 有序分类:比例优势模型(
polr())与“符号陷阱” - 模型核心:比例优势(proportional odds)= 所有 cut(阈值)共享同一组斜率,只差阈值平移。[16]Source: asksia-bible-mast90139-bilingual.pdfpi(eta) 0. 75 0. 50 (eta=0, pi=0. 5) 0. 25 logit pi- log(pi/(1-pi)) = eta pi = 0 linear predictor eta = b0 + b1 x Each cumulative logit P(Y <= r) is a logistic S-curve in the linear predictor; proportional odds means all k-1 curves share ONE slope and differ only by the threshold theta_r, so they are parallel horizontal shifts of this single shape. 每个累积 logit P(Y <= r)都是线性预测子中的一条 logistic S 形曲线;比例优势意味着全部 k-1 条曲线共 享同一个斜率,仅以阈值 theta_r 相区别,因此它们是 这一单一形状的平行水平平移。 PROPORTIONAL ODDS . CONSTANT ACROSS CUTS [P(Y ≤r)/P(Y >>)], cumulative OR = [P(Y ≤r)/P(Y > r)]x2 X1 = e(x1-x2) MAST90139 . Statistical Modelling for Data Science ✓ Interpretation - the * sentence 解释 -- 那句 ★ 句子 eYj = the factor multiplying the odds of being in a lower-or-equal category (Y ≤r, vs higher) for a one-unit rise in xi - and by "proportional odds" the same multiplier at every threshold r. That common-slope assumption is the model's defining (and checkable) feature. eVj= x; 每上升一个单位时,处于较低或相等类别(Y ≤r,相对于更高)的优势被乘上的因子 -- 并且由“比例优势”可 知,在每个阈值 r处这个乘子都相同。这个共享斜率的假设正是该模型的定义性(也是可检验的)特征。 - 4 Reading polr() - and the SIGN-CONVENTION trap (* the known trap) 4 读懂 polr() -- 以及符号约定陷阱(★已知陷阱) R's MASS : : polr fits the proportional-odds model - but it parameterises the cut differently from the lecture. This single sign flip is the most common error in interpreting an ordinal fit, and the exam tests it directly. R的 MASS :: polr 拟合比例优势模型 -- 但它对切点的参数化方式与讲义不同。这一处符号翻转是解释有序拟合时最常见的 错误,而考试会直接考它。 THE TWO PARAMETERISATIONS Lecture: P(Y <r | x) = F(Or + xTY) polr: P(Y <r | x) = F(Or - xTB) ! Negate the polr slopes 对 polr 斜率取相反数 polr reports B = - y relative to the lecture form. Multiply every polr slope by -1 before stating a direction or an odds ratio. Forget this and you will report the effect backwards - a guaranteed lost mark. polr 报告的是相对于讲义形式的 β= - y。在陈述方 向或优势比之前,把每个 polr斜率乘以 -1。忘了这 一点,你就会把效应报告反 -- 必然失分。
- 材料强调的最大坑:
polr()的符号约定与讲义相反- 讲义形式:$P(Y<r|x)=F(\theta_r + x^T\gamma)$
polr形式:$P(Y<r|x)=F(\theta_r - x^T\beta)$- 所以:
polr打印出来的 slope 要先乘以 $-1$ 再解释方向/再算 OR。忘了就会把效应说反,必丢分。[16]Source: asksia-bible-mast90139-bilingual.pdfpi(eta) 0. 75 0. 50 (eta=0, pi=0. 5) 0. 25 logit pi- log(pi/(1-pi)) = eta pi = 0 linear predictor eta = b0 + b1 x Each cumulative logit P(Y <= r) is a logistic S-curve in the linear predictor; proportional odds means all k-1 curves share ONE slope and differ only by the threshold theta_r, so they are parallel horizontal shifts of this single shape. 每个累积 logit P(Y <= r)都是线性预测子中的一条 logistic S 形曲线;比例优势意味着全部 k-1 条曲线共 享同一个斜率,仅以阈值 theta_r 相区别,因此它们是 这一单一形状的平行水平平移。 PROPORTIONAL ODDS . CONSTANT ACROSS CUTS [P(Y ≤r)/P(Y >>)], cumulative OR = [P(Y ≤r)/P(Y > r)]x2 X1 = e(x1-x2) MAST90139 . Statistical Modelling for Data Science ✓ Interpretation - the * sentence 解释 -- 那句 ★ 句子 eYj = the factor multiplying the odds of being in a lower-or-equal category (Y ≤r, vs higher) for a one-unit rise in xi - and by "proportional odds" the same multiplier at every threshold r. That common-slope assumption is the model's defining (and checkable) feature. eVj= x; 每上升一个单位时,处于较低或相等类别(Y ≤r,相对于更高)的优势被乘上的因子 -- 并且由“比例优势”可 知,在每个阈值 r处这个乘子都相同。这个共享斜率的假设正是该模型的定义性(也是可检验的)特征。 - 4 Reading polr() - and the SIGN-CONVENTION trap (* the known trap) 4 读懂 polr() -- 以及符号约定陷阱(★已知陷阱) R's MASS : : polr fits the proportional-odds model - but it parameterises the cut differently from the lecture. This single sign flip is the most common error in interpreting an ordinal fit, and the exam tests it directly. R的 MASS :: polr 拟合比例优势模型 -- 但它对切点的参数化方式与讲义不同。这一处符号翻转是解释有序拟合时最常见的 错误,而考试会直接考它。 THE TWO PARAMETERISATIONS Lecture: P(Y <r | x) = F(Or + xTY) polr: P(Y <r | x) = F(Or - xTB) ! Negate the polr slopes 对 polr 斜率取相反数 polr reports B = - y relative to the lecture form. Multiply every polr slope by -1 before stating a direction or an odds ratio. Forget this and you will report the effect backwards - a guaranteed lost mark. polr 报告的是相对于讲义形式的 β= - y。在陈述方 向或优势比之前,把每个 polr斜率乘以 -1。忘了这 一点,你就会把效应报告反 -- 必然失分。[25]Source: asksia-cheatsheet-mast90139.pdf· probit: F = ¢ Sequential / continuation-ratio: model P(Y=r | Yzr, x) = F(0,+xTy) - for genuinely sequential orderings. 14c . Reading polr(). MASS . THE output SIGN TRAP polr(Y~x, weights=Freq, method="logistic"): · Coefficients - slope estimates + se, t value . Intercepts - thresholds 0, (e. g. 1Normal |2Border) · Residual Deviance, AIC; anova(m1, m2) = LR test SIA > polr parameterises P(Ysr) = F(0, - x'B) - so its reported slopes are -y vs the lecture's 0,+x'y. Multiply polr slopes by - 1 before stating the odds-ratio direction. This sign flip is the classic exam trap. 14d . Nominal vs Ordinal COUNT THE PARAMS Multinomial (nominal): qxp coefficients - one set per non-baseline category. Proportional-odds: one slope vector + (k-1) thresholds - far fewer params. Don't confuse them; check the common-slope assumption (if it fails, use partial/non-proportional or baseline- category). 14e . Worked . polr Sign Flip PNEUMO . THE TRAP Ordinal pneumoconiosis (Normal < Mild < Severe) on years exposure. pour prints years coefficient = -0. 096. polr uses F(0, - xTB), so the lecture-notation slope is y = +0. 096. Then: |e^y = e^0. 096 = 1. 10 per year Say: "each extra year multiplies the odds of more severe disease by 1. 10, the same at every cut. " Quote -0. 096 directly and the direction is backwards - flip the polr sign first. 14f . Latent View WHY THRESHOLDS An unobserved "severity" U = xTy + E; fixed cut-points e, slice U into the ordered categories => a covariate shifting U by y moves every cumulative probability alike, giving the common slope. 15 . Model Checking . CH7 . THE LR Nested TEST For Mo C M, with qo < q1 params, under Ho (smaller model adequate): DEVIANCE / LR TEST AD = De - D1 ~ X q1-qe reject Me if AD > X20 . 95 (q1 -qe) In R: anova(m@, m1, test="Chi"). Order matters in sequential anova (mod) - a term's deviance depends on what precedes it (Type-I SS analogue); a term significant alone can be non- significant after a correlated term. 15b . Information Criteria BIC
- 一句话解释模板(材料给的“稳拿分表述”)
- (先做符号翻转后)“$x$ 每增加 1 单位,处于更高类别的优势乘以 $e^{\gamma}$,并且在每个阈值处相同(proportional odds)。”[15]Source: asksia-bible-mast90139-bilingual.pdf1 Q12. polr sign trap: R parameterises P(Y <r) = F(0, - xTB), so the lecture-convention slope is -0. 82. The cumulative OR per unit income is e+0. 82 = 2. 27 for higher satisfaction: each unit of income multiplies the odds of being in a higher satisfaction category by 2. 3, the same at every threshold (proportional- odds). income is significant (t = 4. 10). Q12. polr 符号陷阱:R的参数化使讲义约定的斜率为其相反数。每单位收入的累积OR 针对更高的满意度:每单位收 入把处于更高满意度类别的优势乘以 2. 3,且在每个阈值处都相同(比例优势)。收入显著(t=4. 10)。 - 2 Q13. e-0. 060 = 0. 94. Each one-unit rise in cost multiplies the odds of choosing Train rather than Car (the baseline) by 0. 94 - about a 6% drop per unit cost. Each non-baseline category has its own slope; Bus's is e-0. 030 = 0. 97. Q13. cost 每升一个单位,把选择 Train 而非 Car(基准)的优势乘以 0. 94 -- 每单位成本约下降 6%。每个非基准类 别都有自己的斜率;Bus 有它自己的一个。 - 3 Q14. Latent-scale ICC for a logistic GLMM: p 0+π2/3 2 1. 2 =0. 27 = 1. 2 +3. 29 = 4. 49 1. 2 About 27% of the latent variation is between clinics. Patients within a clinic are correlated, so a plain glm would understate the standard errors - the random clinic intercept is needed for honest inference. Q14. logistic GLMM 的潜尺度 ICC: p = 06 + T2 /3 2 = 1. 2 1. 2 +3. 29 =: 1. 2 = 0. 27 4. 49 潜变量变异的约 27% 在诊所之间。同一诊所内的患者相关,所以普通 glm 会低估标准误 -- 为了诚实的推断,需要那 个随机诊所截距。 ★ Three phrasings that bank the marks 三种稳拿分数的表述 Logistic: "odds of [event] multiplied by e per unit x. " Poisson+offset: "rate per [exposure] multiplied by e. " Proportional-odds: "odds of a higher category multiplied by e7, same at every cut" - after negating the polr slope. Always add "holding other covariates fixed. " Logistic:“[事件]的优势每单位 x 乘以某因子。”泊松+offset:“每[暴露]的率乘以某因子。”比例优势:“处于更高 类别的优势乘以某因子,在每个切点都相同” -- 在把 polr 斜率取负之后。务必加上“在其他协变量保持不变时”。 MAST90139 . Statistical Modelling for Data Science ★ The framework, compressed to one reflex 把框架压缩成一个条件反射 Faced with any GLM printout: (1) name family + link from the response type; (2) read Residual deviance vs its df (GoF) or take AD between nested models (LR test) and compare to x2; (3) exponentiate a coefficient and state the odds / rate / cumulative-odds meaning. Chapters 3-7 only change the distribution and link. 面对任何 GLM 打印输出:(1)由响应类型说出分布族+链接;(2)把 Residual deviance 对照其自由度来读(拟合 优度),或取嵌套模型间的差(LR 检验)并对照 x2;(3)对一个系数取指数,并说出其优势/率/累积优势的含义。第 3-7 章只是改变分布和链接而已。 MAST90139 . Statistical Modelling for Data Science THE MODEL CHAPTER . LOGISTIC REGRESSION LOGISTIC . CORE The flagship GLM for a yes/no response[25]Source: asksia-cheatsheet-mast90139.pdf· probit: F = ¢ Sequential / continuation-ratio: model P(Y=r | Yzr, x) = F(0,+xTy) - for genuinely sequential orderings. 14c . Reading polr(). MASS . THE output SIGN TRAP polr(Y~x, weights=Freq, method="logistic"): · Coefficients - slope estimates + se, t value . Intercepts - thresholds 0, (e. g. 1Normal |2Border) · Residual Deviance, AIC; anova(m1, m2) = LR test SIA > polr parameterises P(Ysr) = F(0, - x'B) - so its reported slopes are -y vs the lecture's 0,+x'y. Multiply polr slopes by - 1 before stating the odds-ratio direction. This sign flip is the classic exam trap. 14d . Nominal vs Ordinal COUNT THE PARAMS Multinomial (nominal): qxp coefficients - one set per non-baseline category. Proportional-odds: one slope vector + (k-1) thresholds - far fewer params. Don't confuse them; check the common-slope assumption (if it fails, use partial/non-proportional or baseline- category). 14e . Worked . polr Sign Flip PNEUMO . THE TRAP Ordinal pneumoconiosis (Normal < Mild < Severe) on years exposure. pour prints years coefficient = -0. 096. polr uses F(0, - xTB), so the lecture-notation slope is y = +0. 096. Then: |e^y = e^0. 096 = 1. 10 per year Say: "each extra year multiplies the odds of more severe disease by 1. 10, the same at every cut. " Quote -0. 096 directly and the direction is backwards - flip the polr sign first. 14f . Latent View WHY THRESHOLDS An unobserved "severity" U = xTy + E; fixed cut-points e, slice U into the ordered categories => a covariate shifting U by y moves every cumulative probability alike, giving the common slope. 15 . Model Checking . CH7 . THE LR Nested TEST For Mo C M, with qo < q1 params, under Ho (smaller model adequate): DEVIANCE / LR TEST AD = De - D1 ~ X q1-qe reject Me if AD > X20 . 95 (q1 -qe) In R: anova(m@, m1, test="Chi"). Order matters in sequential anova (mod) - a term's deviance depends on what precedes it (Type-I SS analogue); a term significant alone can be non- significant after a correlated term. 15b . Information Criteria BIC
-
4)Deviance / $\Delta D$:你必须会的两种用法(考试经常“半题都是它”)
-
4.1 两个工作(材料明确写了“Two jobs of deviance”)
-
(1)拟合优度 GoF(主要对 grouped 数据)
-
(2)嵌套模型比较(永远有效)
-
4.2 考试策略(材料的强提醒)
-
“半数分数来自:写出 deviance 相减 / 写出 $e^\beta$ 并解释。就算最后数字算错,方法分也能活下来。”——你复习时就要按这个评分逻辑练。[8]Source: asksia-cheatsheet-mast90139.pdfLR for each single term confint() profile-likelihood CIs multinom coef matrix (row/category) polr slopes (-y!) + thresholds Reflex: deviance + df -> x2 adequacy (grouped); AD + Adf > nested LR; exp (coef) > the interpretation. 18 . High-Yield Trap DON'T LOSE MARKS List · Wald can fail (Hauck-Donner) => prefer LR with sparse data 19 . AssessmentCONFIRM THE INFO SHEET 3 assignments + final exam. Due dates confirmed (Gradescope): A1 logistic 2 Apr, A2 binomial dose- response 1 May, A3 ordinal/multinomial 31 May 2026. Exact weights, exam format & book-status are NOT confirmed - check the MAST90139 subject guide / Handbook. ---- SIA > Half the marks are "show the deviance subtraction / show e^ß. " Method marks survive even if the final figure slips - always write the working and the one-line interpretation. Formula Belt SIDE 2 Poisson: log p = xTB . e^ß = rate ratio offset: log p = log t + xTB table: log Aij = p+a+B+(ap) . Q=P11P22/p12P21 prop-odds: log P(<r)/P(>r) = 0,+XTY AIC=D+2q . BIC=D+q log n . AD~x2(Aq) Revision aid . check the handbook / subject guide for exam conditions . @ 2026 good luck. name the link . show the deviance. asksia. ai/cheatsheet/ unimelb-mast90139 · side 2/2 AskSia CHEATSHEET SERIES Compiled by AskSia . mapped to the MAST90139 syllabus . asksia. ai/cheatsheet/unimelb-mast90139 THE RESPONSES . Poisson . offsets & rate ratios . Log-Linear tables . Multinomial logit . Proportional-odds . Model checking . REVISION SHEET . ALL TOPICS[20]Source: asksia-cheatsheet-mast90139.pdfMAST90139 Statistical Modelling for Data Science UNIVERSITY OF MELBOURNE . SCHOOL OF MATHEMATICS & STATISTICS EXAM REVISION Sem 1 2026 . SIDE 1 OF 2 GLM framework + logistic SIDE 1/2 THE MACHINE . GLM 3 components . Links . Exponential family . IRLS . Deviance . Residuals . Logistic regression . Odds ratios . Wald/LR . ROC . Overdispersion 0 · Exam Blueprint READ FIRST * The whole subject trains one reflex: the exam is a read-the-R-output gauntlet . Given a glm / multinom / potr summary or anova (test="Chi") table, you must (a) name the model + link, (b) do the deviance / AD-vs-x2 arithmetic by hand, (c) translate a coefficient into a real-world odds ratio / rate ratio / cumulative- odds sentence. Side 1 = the GLM machine + logistic regression. Side 2 = Poisson / log-linear / multinomial / ordinal + model checking. Everything is one engine - exponential family -> link -> IRLS -> deviance - bolted onto different response types. #1 skill: turn a printed ß into an interpretation. e^ß'is multiplicative on odds (logit), the count (log), cumulative odds (ordinal). Always exponentiate, say "per one-unit increase, all else equal". SIA > Two-line discipline: name the link before you interpret (logit-odds, log++rate), and show the deviance subtraction (AD = Do-Dı on Adf) - method marks survive even if the final number slips. 1 . The GLM . 3 Components CH3 . NELDER- WEDDERBURN Every GLM is exactly three choices: 1. * Random - Y; independent from a one- parameter exponential family, mean u; = E(Y;). 2. * Systematic - a linear predictor n; = x,TB. 3. * Link g(. ) - joins them: g(p;) = n; = x,TB. The normal LM is the special case (identity link, normal random part). The GLM frees the response to be binary, count or categorical - the variance is then a function of the mean, not constant. EXPONENTIAL FAMILY (CANONICAL FORM) In f(yle, v) = [y0 - b(0)]/a(p) + c(y, y) 0 = canonical (natural) param . y = dispersion p = b' (0) . Var(Y) = a(p) . b"(0) Differentiating b gives the mean & variance: [ = b'(0), Var = a(+)b"(0). The canonical link is the g with g(u) = e. 1b . The Three Workhorses MEMORISE THESE ROWS FAMILY Θ CANON. LINK V(M) Normal
-
5)模型选择与信息准则:AIC / BIC(会算会比较)
- 你的 cheatsheet 明确给公式:
-
6)按“你快 final 了”的最省命复习计划(按材料的应考法)
-
6.1 先把“模板化必考技能”刷熟(优先级最高)
-
你每天至少练 6 次(每次只练 10 分钟也行):
- 从一张输出里写:family + link(1 行)→ $\Delta D$ 或 $D$ vs df(2–3 行计算)→ $e^\beta$ 的一句话解释(1 行)。[1]Source: asksia-bible-mast90139-bilingual.pdfA compulsory academic-integrity declaration (you must "Agree", unlimited attempts, no due date) gates submission. It carries no graded weight - treat it as a checkbox, complete it early. There is no evidence of a mid-semester test in the supplied source; only these three assignments + the final. - - 一份强制性的学术诚信声明(你必须“同意”、无限次尝试、无截止日期)是提交的门槛。它不计入成绩权重 -- 把它当 作一个勾选框,尽早完成。所提供来源中没有期中测试的证据;只有这三次作业+期末。 ★ The exam hook - a read-the-R-output gauntlet 考试的关键钩子 --- 一场读懂 R 输出的连环挑战 Whatever the weighting, the final trains one reflex. Given a printed summary (glm()) / anova(. , test="Chi") / polr () table you must: (1) name the model + link (binomial+logit? Poisson+log? cumulative+logit?); (2) do the deviance / AD-vs-x2 arithmetic by hand to test fit or compare nested models; and (3) translate a coefficient into a real- world odds ratio / rate ratio / cumulative-odds statement. 无论权重如何,期末训练的是同一个反射。给定一张 打印的 summary(glm()) / anova(. , test="Chi")/ polr(〕 表,你必须:(1)说出模型 + 链接(二项+logit? 泊松+log? 累积+logit?);(2) 手算偏差/ AD 对 ×2 的运算以检验拟合或比较嵌套模 型;并(3)把一个系数翻译成现实世界的优势比/率 比/累积优势陈述。 ✓ The three recurring moves - drill these 三个反复出现的招式 -- 务必操练 Name: response type - distribution - canonical link. Test: AD = Do-D1 ~ X2 (df o-df 1) for nested models; D ~ x2 (n-q) for grouped goodness-of-fit. Interpret: exp (ß3) - an odds ratio (logit), a rate ratio (log), a cross-product ratio (log-linear), a cumulative OR (proportional-odds). Get those three automatic and the paper is procedural. 命名:响应类型→分布→典则链接。检验:嵌套 模型用 △D = De-D1 ~ X2(dfe-df1);分组数据 拟合优度用 D ~ x2(n-q)。解释:exp(β3)––优 势比(logit)、率比(log)、交叉乘积比(对数线 性)、累积OR(比例优势)。把这三件事做成自动反 应,整张试卷就只是按部就班。 GRADESCOPE ASSIGNMENTS Gradescope 作业 FINAL EXAMINATION 期末考试 R THE ONLY SOFTWARE (FARAWAY) 唯一的软件(faraway) exp(ß) THE #1 EXAMINED SKILL 头号考查技能 MAST90139 . Statistical Modelling for Data Science CONTENTS - CONTENTS Framework first, then the families 先讲框架,再讲分布族 Learn the one GLM engine, then watch it bend to every response type 学会唯一的 GLM引擎,再看它如何弯折以适应各类响应 Ch Topic Core content Part 1 . The framework (the conceptual core) 1 Linear model review matrix form . OLS . LINE assumptions . the springboard → 2 The GLM framework[6]Source: asksia-bible-mast90139-bilingual.pdfR package; (1 | cluster) = a random intercept per cluster; REML for variance components. Marginal vs conditional — Population-averaged (GEE) vs cluster-specific effects; differ for nonlinear links. ★ The terms that carry the most marks 分值最高的那些术语 Three interpretation skills recur every year: odds ratio el (logistic), rate ratio e (Poisson, with an offset), and the cumulative odds ratio e? (proportional-odds). The recurring procedure is the AD vs x2 nested test. Memorise the one-line phrasing for each - the marks are in the wording, not the arithmetic. 三项解释技能年年出现:优势比(logistic)、率比(泊松,带 offset),以及累积优势比(比例优势)。反复出现的流程 是AD 对 x2 的嵌套检验。把每一种的一行措辞背下来 -- 分数在措辞里,不在算术里。 MAST90139 . Statistical Modelling for Data Science CH 11 - PRACTICE - CHAPTER 11 . PRACTICE BANK & WORKED SOLUTIONS DRILL TO EXAM STANDARD Read the R output. Name the model. Interpret. 读懂 R 输出。叫出模型。给出解释。 Fourteen fresh exam-style problems - logistic, Poisson, log-linear, ordinal, deviance, overdispersion, AIC 十四道全新的考试型题目 -- logistic、Poisson、对数线性、有序、偏差、过度离散、AIC The one-line takeaway. The MAST90139 final is a read-the-R-output gauntlet: given a summary (glm) / anova(test="Chi") / polr table, you must (a) name the model and link, (b) do the AD-vs-x2 or deviance arithmetic by hand, and (c) translate a coefficient into an odds ratio / rate ratio / cumulative-odds statement. This bank gives one fresh problem per skill, fully worked. 一句话要点。MAST90139 期末是一场读 R 输出的闯关赛:给定一张 summary(glm) / anova(test="Chi") / polr 表 格,你必须(a)说出模型与链接,(b)手算 △D 对 x2 或偏差的运算,并(c)把一个系数翻译成优势比/率比/累积优势的陈 述。本题库为每项技能各给一道全新且完整解出的题。 ★ Fresh numbers - mirrors the template, not the stems 全新数字 -- 模仿模板,而非照搬题干 These are AskSia-authored questions in the MAST90139 style; they are not real exam/assignment stems. The GLM procedures are canonical. Closed-book reasoning - every odds/rate ratio below is computed from the printed estimate by hand. Recall: e0. 7 ~ 2. 01, e1 ~ 2. 72, e-0. 69 ~ 0. 50, X6. 95 (1) = 3. 84. 这些是 AskSia 自撰、仿 MAST90139 风格的题目;它们不是真实的考试/作业题干。其中的GLM 流程是经典的。闭卷 推理 -- 下面每个优势/率比都由打印的估计值手算得出。 Q1-03 Logistic regression - odds ratios Q1-Q3 逻辑回归 -- 优势比 R: SUMMARY(GLM(DEFAULT ~ INCOME + LATE, FAMILY=BINOMIAL)) Coefficients : Estimate Std. Error z value Pr(> |z|) (Intercept) -1. 8400 0. 4100 -4. 49 7e-06 income -0. 0230 0. 0090 -2. 56 0. 0105 late 0. 6931 0. 1500 4. 62 4e-06 Null deviance: 410. 2 on 299 df Residual deviance: 351. 6 on 297 df AIC: 357. 6 3 marks . Logistic Q1 ODDS RATIO late is the number of late payments last year. Give the odds ratio for one extra late payment and interpret it. Q2 OR FOR C UNITS 3 marks . Logistic
-
6.2 按作业对应的模型族做“专项复盘”(因为材料暗示作业就是 rehearsal)
-
你的资料确认了 3 个 Gradescope 作业主题(这非常像考试范围切片):
- A1:binary / grouped logistic(odds ratio 解释)[3]Source: asksia-bible-mast90139-bilingual.pdfThree assignments, one final 三次作业,一场期末 3 Gradescope assignments (dates confirmed) + a final exam - weights per the subject guide 3 次 Gradescope 作业(日期已确认)外加一场期末考试 -- 权重见课程指南 Your mark comes from three written assignments (submitted to Gradescope via Canvas) and a final examination. The assignment due dates are confirmed below; the exact weights, exam format and book- status are NOT in the supplied subject material - they live in the official MAST90139info . pdf. Do not assume numbers: check the handbook / subject guide. 你的成绩来自三次书面作业(通过 Canvas 提交到 Gradescope)和一场期末考试。作业的截止日期已在下方确认;而确切 的权重、考试形式与可否带书状态并不在所提供的课程材料中 -- 它们记载于官方的 MAST90139info. pdf。不要臆测数 字:请查阅课程手册/课程指南。 ! Weights & exam format are NOT confirmed here 此处的权重与考试形式尚未确认 The supplied source confirms three assignments, their due dates and Gradescope submission, plus a final exam - but it does not state the percentage weights, the exam length, or whether the exam is open- or closed-book. Treat every weight as "confirm against the subject guide". The numbers below are structure, not weights. 所提供的来源确认了三次作业、它们的截止日期与 Gradescope 提交方式,外加一场期末考 -- 但它没有说明各项的百 分比权重、考试时长,或考试是开卷还是闭卷。把每个权重都当作“对照科目指南核实”。下面的数字是结构,不是权 重。 The four assessment pieces 四项评估内容 Component What it drills (evidenced) Due / format Assignment 1 - binary / grouped logistic Logistic regression study (domestic-violence predictors); odds-ratio interpretation 11:59pm Thu 2 Apr 2026 . Gradescope Assignment 2 - binomial dose-response Beetle-mortality dose-response; binomial / grouped logistic, link comparison 11:59pm Fri 1 May 2026 . Gradescope Assignment 3 - ordinal / multinomial Coal-miner pneumoconiosis; 3-category ordinal response (polr) 11:59pm Sun 31 May 2026 . Gradescope Final examination Standard MAST graduate stats paper; read-the-R-output across the whole GLM arc Exam period · format: confirm MAST90139 . Statistical Modelling for Data Science i A plagiarism-declaration quiz is a gate, not a grade 学术诚信声明测验是一道门槛,而非分数
- A2:binomial dose-response(分组、link 比较、LD50)[3]Source: asksia-bible-mast90139-bilingual.pdfThree assignments, one final 三次作业,一场期末 3 Gradescope assignments (dates confirmed) + a final exam - weights per the subject guide 3 次 Gradescope 作业(日期已确认)外加一场期末考试 -- 权重见课程指南 Your mark comes from three written assignments (submitted to Gradescope via Canvas) and a final examination. The assignment due dates are confirmed below; the exact weights, exam format and book- status are NOT in the supplied subject material - they live in the official MAST90139info . pdf. Do not assume numbers: check the handbook / subject guide. 你的成绩来自三次书面作业(通过 Canvas 提交到 Gradescope)和一场期末考试。作业的截止日期已在下方确认;而确切 的权重、考试形式与可否带书状态并不在所提供的课程材料中 -- 它们记载于官方的 MAST90139info. pdf。不要臆测数 字:请查阅课程手册/课程指南。 ! Weights & exam format are NOT confirmed here 此处的权重与考试形式尚未确认 The supplied source confirms three assignments, their due dates and Gradescope submission, plus a final exam - but it does not state the percentage weights, the exam length, or whether the exam is open- or closed-book. Treat every weight as "confirm against the subject guide". The numbers below are structure, not weights. 所提供的来源确认了三次作业、它们的截止日期与 Gradescope 提交方式,外加一场期末考 -- 但它没有说明各项的百 分比权重、考试时长,或考试是开卷还是闭卷。把每个权重都当作“对照科目指南核实”。下面的数字是结构,不是权 重。 The four assessment pieces 四项评估内容 Component What it drills (evidenced) Due / format Assignment 1 - binary / grouped logistic Logistic regression study (domestic-violence predictors); odds-ratio interpretation 11:59pm Thu 2 Apr 2026 . Gradescope Assignment 2 - binomial dose-response Beetle-mortality dose-response; binomial / grouped logistic, link comparison 11:59pm Fri 1 May 2026 . Gradescope Assignment 3 - ordinal / multinomial Coal-miner pneumoconiosis; 3-category ordinal response (polr) 11:59pm Sun 31 May 2026 . Gradescope Final examination Standard MAST graduate stats paper; read-the-R-output across the whole GLM arc Exam period · format: confirm MAST90139 . Statistical Modelling for Data Science i A plagiarism-declaration quiz is a gate, not a grade 学术诚信声明测验是一道门槛,而非分数[21]Source: asksia-cheatsheet-mast90139.pdf5b . ODDS RATIOS * THE #1 EXAM · e^B SKILL Odds = Tt/(1-Tt); log-odds = logit(Tt). logit(0. 5)=0; logit(1-Tt) =- logit(Tt). A c-unit change = OR =e^(cf;). . * PHRASING TEMPLATE - MEMORISE "The odds of [event] are multiplied by e^Bi (change by (e^₿ ;- 1) x100%) for each one-unit increase in xj, all else equal. " Worked (heart disease): Bcigs=0. 0231 => e^0. 0231 = 1. 023 - odds of CHD rise ~2. 3% per extra cigarette/day. A height coef 0. 025 = OR 1. 026 per inch. LD50 (dose giving Tt=0. 5): solve n=0 => x =- Bo/B1. Empirical logit (plotting grouped data): log[(y+0. 5)/(n-y+0. 5)]. 5c . Grouped = Ungrouped DON'T DOUBLE- COUNT Fit grouped via glm(y/n ~ x, family=binomial, weights=n). Estimates B'are identical grouped or ungrouped - only the deviance & its df differ (so GoF differs, but LR model-comparison tests agree). 5d . The Logistic S- Curve DO BOTH BY HAND PROBABILITY SCALE T = e^n/(1+e^n) is a sigmoid in n: flat near 0 and 1, steepest at Tt = 0. 5 (n = 0) . So a one-unit + in x shifts the log-odds by a constant ß, but the change in probability depends on where you start. MARGINAL EFFECT οπ/x; = β; · π(1-π) maximal at T=0. 5, =0 at the tails Don't read @ as a probability change. Convert to the odds (e^ß) for a clean constant statement, or evaluate Tt at specific x for a probability. 5e . A Categorical Predictor FACTOR LEVELS A factor enters as dummy (indicator) columns vs a reference level. Each coefficient's e^ß is the OR of that level vs the reference. A 3-level factor = 2 dummies => test the factor as a block with AD on 2 df, not by individual z-values. To compare two non-reference levels, re-level or take the difference of their log-odds. 6 . Inference . 3 TestsCH2/3 . ONE He Test Ho: CB = &. Three asymptotically-x25 tests (s = rank C):
- A3:ordinal / multinomial(
polr/multinom,尤其polr符号陷阱)[3]Source: asksia-bible-mast90139-bilingual.pdfThree assignments, one final 三次作业,一场期末 3 Gradescope assignments (dates confirmed) + a final exam - weights per the subject guide 3 次 Gradescope 作业(日期已确认)外加一场期末考试 -- 权重见课程指南 Your mark comes from three written assignments (submitted to Gradescope via Canvas) and a final examination. The assignment due dates are confirmed below; the exact weights, exam format and book- status are NOT in the supplied subject material - they live in the official MAST90139info . pdf. Do not assume numbers: check the handbook / subject guide. 你的成绩来自三次书面作业(通过 Canvas 提交到 Gradescope)和一场期末考试。作业的截止日期已在下方确认;而确切 的权重、考试形式与可否带书状态并不在所提供的课程材料中 -- 它们记载于官方的 MAST90139info. pdf。不要臆测数 字:请查阅课程手册/课程指南。 ! Weights & exam format are NOT confirmed here 此处的权重与考试形式尚未确认 The supplied source confirms three assignments, their due dates and Gradescope submission, plus a final exam - but it does not state the percentage weights, the exam length, or whether the exam is open- or closed-book. Treat every weight as "confirm against the subject guide". The numbers below are structure, not weights. 所提供的来源确认了三次作业、它们的截止日期与 Gradescope 提交方式,外加一场期末考 -- 但它没有说明各项的百 分比权重、考试时长,或考试是开卷还是闭卷。把每个权重都当作“对照科目指南核实”。下面的数字是结构,不是权 重。 The four assessment pieces 四项评估内容 Component What it drills (evidenced) Due / format Assignment 1 - binary / grouped logistic Logistic regression study (domestic-violence predictors); odds-ratio interpretation 11:59pm Thu 2 Apr 2026 . Gradescope Assignment 2 - binomial dose-response Beetle-mortality dose-response; binomial / grouped logistic, link comparison 11:59pm Fri 1 May 2026 . Gradescope Assignment 3 - ordinal / multinomial Coal-miner pneumoconiosis; 3-category ordinal response (polr) 11:59pm Sun 31 May 2026 . Gradescope Final examination Standard MAST graduate stats paper; read-the-R-output across the whole GLM arc Exam period · format: confirm MAST90139 . Statistical Modelling for Data Science i A plagiarism-declaration quiz is a gate, not a grade 学术诚信声明测验是一道门槛,而非分数[25]Source: asksia-cheatsheet-mast90139.pdf· probit: F = ¢ Sequential / continuation-ratio: model P(Y=r | Yzr, x) = F(0,+xTy) - for genuinely sequential orderings. 14c . Reading polr(). MASS . THE output SIGN TRAP polr(Y~x, weights=Freq, method="logistic"): · Coefficients - slope estimates + se, t value . Intercepts - thresholds 0, (e. g. 1Normal |2Border) · Residual Deviance, AIC; anova(m1, m2) = LR test SIA > polr parameterises P(Ysr) = F(0, - x'B) - so its reported slopes are -y vs the lecture's 0,+x'y. Multiply polr slopes by - 1 before stating the odds-ratio direction. This sign flip is the classic exam trap. 14d . Nominal vs Ordinal COUNT THE PARAMS Multinomial (nominal): qxp coefficients - one set per non-baseline category. Proportional-odds: one slope vector + (k-1) thresholds - far fewer params. Don't confuse them; check the common-slope assumption (if it fails, use partial/non-proportional or baseline- category). 14e . Worked . polr Sign Flip PNEUMO . THE TRAP Ordinal pneumoconiosis (Normal < Mild < Severe) on years exposure. pour prints years coefficient = -0. 096. polr uses F(0, - xTB), so the lecture-notation slope is y = +0. 096. Then: |e^y = e^0. 096 = 1. 10 per year Say: "each extra year multiplies the odds of more severe disease by 1. 10, the same at every cut. " Quote -0. 096 directly and the direction is backwards - flip the polr sign first. 14f . Latent View WHY THRESHOLDS An unobserved "severity" U = xTy + E; fixed cut-points e, slice U into the ordered categories => a covariate shifting U by y moves every cumulative probability alike, giving the common slope. 15 . Model Checking . CH7 . THE LR Nested TEST For Mo C M, with qo < q1 params, under Ho (smaller model adequate): DEVIANCE / LR TEST AD = De - D1 ~ X q1-qe reject Me if AD > X20 . 95 (q1 -qe) In R: anova(m@, m1, test="Chi"). Order matters in sequential anova (mod) - a term's deviance depends on what precedes it (Type-I SS analogue); a term significant alone can be non- significant after a correlated term. 15b . Information Criteria BIC
-
6.3 “不要丢分陷阱清单”(你考前必须背一遍)
-
我按你材料里明确点名的高频坑,给你做成清单:
- Wald 可能失灵(Hauck–Donner):稀疏数据更偏向用 LR($\Delta D$)而不是只看 z/Wald。[8]Source: asksia-cheatsheet-mast90139.pdfLR for each single term confint() profile-likelihood CIs multinom coef matrix (row/category) polr slopes (-y!) + thresholds Reflex: deviance + df -> x2 adequacy (grouped); AD + Adf > nested LR; exp (coef) > the interpretation. 18 . High-Yield Trap DON'T LOSE MARKS List · Wald can fail (Hauck-Donner) => prefer LR with sparse data 19 . AssessmentCONFIRM THE INFO SHEET 3 assignments + final exam. Due dates confirmed (Gradescope): A1 logistic 2 Apr, A2 binomial dose- response 1 May, A3 ordinal/multinomial 31 May 2026. Exact weights, exam format & book-status are NOT confirmed - check the MAST90139 subject guide / Handbook. ---- SIA > Half the marks are "show the deviance subtraction / show e^ß. " Method marks survive even if the final figure slips - always write the working and the one-line interpretation. Formula Belt SIDE 2 Poisson: log p = xTB . e^ß = rate ratio offset: log p = log t + xTB table: log Aij = p+a+B+(ap) . Q=P11P22/p12P21 prop-odds: log P(<r)/P(>r) = 0,+XTY AIC=D+2q . BIC=D+q log n . AD~x2(Aq) Revision aid . check the handbook / subject guide for exam conditions . @ 2026 good luck. name the link . show the deviance. asksia. ai/cheatsheet/ unimelb-mast90139 · side 2/2 AskSia CHEATSHEET SERIES Compiled by AskSia . mapped to the MAST90139 syllabus . asksia. ai/cheatsheet/unimelb-mast90139 THE RESPONSES . Poisson . offsets & rate ratios . Log-Linear tables . Multinomial logit . Proportional-odds . Model checking . REVISION SHEET . ALL TOPICS
polr()符号翻转:先把 slope 乘以 $-1$ 再解释方向与 OR。[16]Source: asksia-bible-mast90139-bilingual.pdfpi(eta) 0. 75 0. 50 (eta=0, pi=0. 5) 0. 25 logit pi- log(pi/(1-pi)) = eta pi = 0 linear predictor eta = b0 + b1 x Each cumulative logit P(Y <= r) is a logistic S-curve in the linear predictor; proportional odds means all k-1 curves share ONE slope and differ only by the threshold theta_r, so they are parallel horizontal shifts of this single shape. 每个累积 logit P(Y <= r)都是线性预测子中的一条 logistic S 形曲线;比例优势意味着全部 k-1 条曲线共 享同一个斜率,仅以阈值 theta_r 相区别,因此它们是 这一单一形状的平行水平平移。 PROPORTIONAL ODDS . CONSTANT ACROSS CUTS [P(Y ≤r)/P(Y >>)], cumulative OR = [P(Y ≤r)/P(Y > r)]x2 X1 = e(x1-x2) MAST90139 . Statistical Modelling for Data Science ✓ Interpretation - the * sentence 解释 -- 那句 ★ 句子 eYj = the factor multiplying the odds of being in a lower-or-equal category (Y ≤r, vs higher) for a one-unit rise in xi - and by "proportional odds" the same multiplier at every threshold r. That common-slope assumption is the model's defining (and checkable) feature. eVj= x; 每上升一个单位时,处于较低或相等类别(Y ≤r,相对于更高)的优势被乘上的因子 -- 并且由“比例优势”可 知,在每个阈值 r处这个乘子都相同。这个共享斜率的假设正是该模型的定义性(也是可检验的)特征。 - 4 Reading polr() - and the SIGN-CONVENTION trap (* the known trap) 4 读懂 polr() -- 以及符号约定陷阱(★已知陷阱) R's MASS : : polr fits the proportional-odds model - but it parameterises the cut differently from the lecture. This single sign flip is the most common error in interpreting an ordinal fit, and the exam tests it directly. R的 MASS :: polr 拟合比例优势模型 -- 但它对切点的参数化方式与讲义不同。这一处符号翻转是解释有序拟合时最常见的 错误,而考试会直接考它。 THE TWO PARAMETERISATIONS Lecture: P(Y <r | x) = F(Or + xTY) polr: P(Y <r | x) = F(Or - xTB) ! Negate the polr slopes 对 polr 斜率取相反数 polr reports B = - y relative to the lecture form. Multiply every polr slope by -1 before stating a direction or an odds ratio. Forget this and you will report the effect backwards - a guaranteed lost mark. polr 报告的是相对于讲义形式的 β= - y。在陈述方 向或优势比之前,把每个 polr斜率乘以 -1。忘了这 一点,你就会把效应报告反 -- 必然失分。[25]Source: asksia-cheatsheet-mast90139.pdf· probit: F = ¢ Sequential / continuation-ratio: model P(Y=r | Yzr, x) = F(0,+xTy) - for genuinely sequential orderings. 14c . Reading polr(). MASS . THE output SIGN TRAP polr(Y~x, weights=Freq, method="logistic"): · Coefficients - slope estimates + se, t value . Intercepts - thresholds 0, (e. g. 1Normal |2Border) · Residual Deviance, AIC; anova(m1, m2) = LR test SIA > polr parameterises P(Ysr) = F(0, - x'B) - so its reported slopes are -y vs the lecture's 0,+x'y. Multiply polr slopes by - 1 before stating the odds-ratio direction. This sign flip is the classic exam trap. 14d . Nominal vs Ordinal COUNT THE PARAMS Multinomial (nominal): qxp coefficients - one set per non-baseline category. Proportional-odds: one slope vector + (k-1) thresholds - far fewer params. Don't confuse them; check the common-slope assumption (if it fails, use partial/non-proportional or baseline- category). 14e . Worked . polr Sign Flip PNEUMO . THE TRAP Ordinal pneumoconiosis (Normal < Mild < Severe) on years exposure. pour prints years coefficient = -0. 096. polr uses F(0, - xTB), so the lecture-notation slope is y = +0. 096. Then: |e^y = e^0. 096 = 1. 10 per year Say: "each extra year multiplies the odds of more severe disease by 1. 10, the same at every cut. " Quote -0. 096 directly and the direction is backwards - flip the polr sign first. 14f . Latent View WHY THRESHOLDS An unobserved "severity" U = xTy + E; fixed cut-points e, slice U into the ordered categories => a covariate shifting U by y moves every cumulative probability alike, giving the common slope. 15 . Model Checking . CH7 . THE LR Nested TEST For Mo C M, with qo < q1 params, under Ho (smaller model adequate): DEVIANCE / LR TEST AD = De - D1 ~ X q1-qe reject Me if AD > X20 . 95 (q1 -qe) In R: anova(m@, m1, test="Chi"). Order matters in sequential anova (mod) - a term's deviance depends on what precedes it (Type-I SS analogue); a term significant alone can be non- significant after a correlated term. 15b . Information Criteria BIC- offset 该加就要加:题目给 exposure/分母时,offset 几乎必然;否则解释错位。[26]Source: asksia-cheatsheet-mast90139.pdfEXAM REVISION Sem 1 2026 . SIDE 2 OF 2 Poisson . log-linear . multicat SIDE 2/2 Reading R output . Traps 10 . Poisson Regression CH3/5 . COUNTS Counts Y; ~ Poi(u;) with the log link (canonical): THE MODEL = LOG-LINEAR Log pi = Be + B1x11 + = = X1TB Pi = e^(X1TB) . mean = variance = Pi Score u(B)=X™(y-u); Fisher info J=X™diag{u;}X. Same IRLS/deviance machinery as Side 1. * RATE RATIOS - E^B e^ß = rate ratio (RR) - the multiplicative change in the expected count per one-unit + in x ;. Interpret exactly like an OR but on the count scale: "the expected count is multiplied by e^ß) per unit increase in xj. " Worked: B=0. 405 => e^0. 405 = 1. 50 => the rate is 50% higher per unit. 10b . The OFFSET * RATES PER EXPOSURE To model a rate (events per exposure tj, e. g. per person-year), use µ; = t-e^(x,TB): OFFSET FORM log pi = log ti + X1TB R: glm(y~x, poisson, offset=log(t)) log ti is an offset - a covariate with coefficient fixed at 1. Then e^ß¡ is a rate ratio per unit of exposure. SIA > Forgetting the offset turns a rate model into a count model - the coefficient interpretation changes completely. If the question gives an exposure/denominator, an offset is almost certainly intended. 10c . Poisson GoF & R GROUPED = VALID Same deviance/Pearson GoF tests as Side 1 §4 (valid: counts behave like grouped data). (Dispersion parameter for poisson taken to be 1) confirms $=1. anova (mod, test="Chi") gives sequential LR tests. Trap: Poisson counts often overdisperse - check D/df; if »1 switch to quasi-Poisson or negative binomial. 10d . Worked . Rate w/ Offset READ THE RR Disease counts y in regions with population t. Fit glm(y~group, poisson, offset=log(t)). The group coef ß= 0. 693, se 0. 10. RR = e^0. 693 = 2. 00 95% CI = e^(0. 693+1. 96. 0. 10) = (1. 64, 2. 43)
- 二元非分组数据别乱用 residual deviance 做 GoF:但嵌套比较 $\Delta D$ 永远可用。[22]Source: asksia-cheatsheet-mast90139.pdfTWO JOBS OF DEVIANCE (1) Goodness of fit (grouped data only): under a correct model D/$ = X2n-q; reject fit if D > x20. 95(n-q). (2) Nested comparison (always valid): AD = Do-D1 = X"q1-q0 - the likelihood-ratio test. 4b . Pearson statistic & THE GOF X2 TWIN PEARSON CHI-SQUARED Χ2 = Σ (yi-Pi) 2 / Vai (y;) = Σ (0-E)2/E Asymptotically equivalent to D for GoF. For a 2-way table the deviance/Pearson independence test coincides with the classical x2 test . 4c . Residuals DIAGNOSE THE FIT PEARSON RESIDUAL ri= (y ;- Pi) / VV (μ;) · Σ(ΓΡ)2 = Χ2 DEVIANCE RESIDUAL ri = sign (yi-Âi). Vdi . E(rD)2 = D Deviance residuals are closer to normal; both should mostly lie in [-2, 2] for a good model. Plot vs rand each covariate. SIA > For ungrouped/binary data (m,=1) the residual deviance is NOT a GoF measure it's a function of the ît alone. Use Hosmer- Lemeshow. But AD between nested models is always a valid LR test. 4d . Worked . GoF + AD Goodness of fit. A grouped model has D = 28. 1 on 23 df. x20. 95(23) = 35. 2. Since 28. 1 < 35. 2, no evidence of lack of fit (p = 0. 21). If D had been >> df with no structural cause = suspect overdispersion. Nested comparison. Model 0 (no x): Do = 41. 6 on 24 df. Model 1 (+x): D1 = 28. 1 on 23 df. AD = 41. 6 - 28. 1 = 13. 5 on 1 df ×20. 95 (1) = 3. 84 = 13. 5 * 3. 84 => reject Mo: x is significant (p = 0. 0002). This AD is the LR test - the same conclusion anova(m0, m1, test="Chi") prints. 4e . Scaled vs Unscaled MIND THE ¢ R prints the unscaled D (the $=1 quantity). For Poisson/binomial $=1 => scaled = unscaled, compare D directly to x2. For quasi/normal ¢#1: D/+ is x2- distributed and nested tests use an F-test . Confirm the dispersion line first. 5 . Logistic Regression CH2 . BINARY / BINOMIAL Y; ~ Bin(1, p;) (Bernoulli) or grouped Bin(m¡, TT;). The logit link: THE MODEL Log[ Ti/ (1-T;) ] = Be + B1x11 + " = X1TB Ti = e^ni / (1 + e^ni) LOG-LIKELIHOOD (BINARY) e(β) = Σ[ yin; - Log(1+ern;) ] u(B)=XT(y-Tt) . (B)=XTdiag{m; (1-m;)}X Fitted prob Tố¡ = ilogit(n}) via predict(type="response"); n; via predict (mod). 5b . ODDS RATIOS * THE #1 EXAM · e^B SKILL
-
7)你下一步给我 2 个信息,我就能把“复习地图”改成你的个人冲刺表
- 你告诉我:
- 你的 Final 是哪天(具体日期)?
- 你最怕哪一块(选 1–2 个):
- A Logistic:OR 解释 / ROC-AUC / grouped vs ungrouped
- B Poisson:offset + RR / overdispersion
- C Log-linear 表(列联表、独立性、OR/交叉乘积比)
- D Multinomial:baseline logit 输出矩阵怎么读
- E Ordinal:
polr输出 + 符号陷阱 + cumulative OR - F 模型选择:$\Delta D$、AIC/BIC、drop1/step 的逻辑
- 我就按你选的模块,把每块补成:
- 必背定义(一句话)
- 必会公式 + 何时用
- 读输出“填空模板”(你直接照着写就能拿分)
- 3 个最常见扣分点 & 如何规避
- (可选)如果你愿意,我也可以把你“最怕的两块 + 最常踩的陷阱(比如 polr 符号/offset/GoF 误用)”记住,之后你每次来我都优先带你复习这些。
Ask Sia for quick explanations, examples, and study support.