MAST90139

MAST90139 · Statistical Modelling for Data ScienceMAST90139 · 数据科学统计建模

Graduate statistics: linear & generalized linear models, mixed models, and modelling dependent data — assignments plus a final exam.研究生统计核心课:线性模型、广义线性模型、混合模型与相关结构数据建模 · 平时作业 + 期末闭卷考试。

MAST90139 is a graduate subject in the School of Mathematics and Statistics that builds the modern statistical-modelling toolkit data scientists rely on: linear models, generalized linear models (GLMs), mixed models and non-parametric regression, applied to time series, longitudinal and spatial data, with an introduction to causal inference and handling missing data. In practice it runs as a GLM-and-categorical-data course taught in R (built on Julian Faraway's Extending the Linear Model with R), where the single most-examined skill is reading R glm()/multinom()/polr() output and turning a coefficient into an odds ratio, rate ratio or cumulative-odds statement. It is assessed by written assignments across the semester plus a final written examination. This page is built from 100 real MAST90139 course materials in the AskSia Library.

MAST90139 是墨大数学与统计学院开设的研究生课程,系统搭建数据科学家最常用的统计建模工具箱:线性模型、广义线性模型(GLM)、混合模型与非参数回归,并应用于时间序列、纵向(重复测量)数据和空间数据,同时介绍因果推断与缺失数据处理。实际授课以「GLM 与分类数据分析」为主线,全程用 R 教学(基于 Julian Faraway 的《Extending the Linear Model with R》),考查最多的核心技能就是读懂 R 中 glm()/multinom()/polr() 的输出,并把系数翻译成 odds ratio(比值比)、rate ratio(率比)或 cumulative-odds(累积比值)的实际结论。考核方式为学期内的书面作业加期末闭卷笔试。本页基于AskSia Library 中 100 份真实 MAST90139 课程材料构建。

Built from 100 real MAST90139 course materials in the AskSia Library — including the full Ch1–Ch7 GLM lecture decks, weekly R practicals (prac1–prac11 with answer keys) and the three 2026 assignments.

基于AskSia Library 中 100 份真实 MAST90139 课程材料构建——涵盖 Ch1–Ch7 完整 GLM 讲义、每周 R 上机练习(prac1–prac11 及答案)以及 2026 年的三次作业。

Faculty院系School of Mathematics and Statistics, Faculty of ScienceLevel层级Graduate / PostgraduateCredit学分12.5 ptsSemester学期2026 Semester 1Prereq先修MAST90104 A First Course in Statistical Learning and MAST90105 Methods of Mathematical StatisticsCampus校区Parkville
📚 AskSia Library data·100 AskSia Library resources·11 topics·Three written R assignments across the semester (binary/grouped logistic, binomial dose-response, ordinal/multinomial) plus a final written examination; exact weight split set on the official handbook each year.Built from 100 real MAST90139 course materials in the AskSia Library — the full Ch1–Ch7 GLM lecture decks, weekly R practicals (prac1–prac11 with answer keys) and the three 2026 assignments. No past-exam paper was in the mined set, so no exam-frequency claims are made.
📚 AskSia Library 数据·100 份 AskSia Library 资料·11 个主题·学期内三次 R 书面作业(二元/分组 logistic、二项剂量-反应、有序/多项响应)加期末闭卷笔试;作业与考试的具体占比每年以官方 handbook 为准。基于AskSia Library 中 100 份真实 MAST90139 课程材料构建——Ch1–Ch7 完整 GLM 讲义、每周 R 上机练习(prac1–prac11 及答案)与 2026 年三次作业。挖掘到的材料中不含历年真题,因此本页不作任何「考试出现频率」的声称。
Overview课程概览

What MAST90139 is aboutMAST90139 讲什么

MAST90139 Statistical Modelling for Data Science is a 12.5-point graduate subject offered by the University of Melbourne's School of Mathematics and Statistics. It develops the core statistical models underlying data-science practice — linear models, generalized linear models, mixed (random-effects) models and non-parametric regression — and shows how to apply them to dependent data such as time series, longitudinal/repeated-measures data and spatially correlated data. The subject also introduces methods for causal inference and for handling missing or incomplete data. Students learn both the modelling framework and its limitations, and how to fit and interpret these models on real datasets, typically using R.

MAST90139《数据科学统计建模》是墨尔本大学数学与统计学院开设的 12.5 学分研究生课程。课程构建数据科学实践背后的核心统计模型——线性模型、广义线性模型、混合(随机效应)模型与非参数回归——并讲解如何把它们应用到相关结构数据上,例如时间序列、纵向(重复测量)数据和空间相关数据。课程还介绍因果推断方法以及缺失/不完整数据的处理。学生既学习建模框架本身,也理解其适用边界,并在真实数据集上(通常使用 R)拟合与解读这些模型。

Topic map知识地图

The MAST90139 syllabus, topic by topicMAST90139 大纲 · 逐个主题

1

Review of linear models & statistical inference线性模型与统计推断回顾

Multiple linear regression, least squares, the Gauss-Markov framework, and inference for model parameters as the foundation for everything that follows.

多元线性回归、最小二乘、Gauss-Markov 框架,以及对模型参数的推断,作为后续所有内容的基础。

2

Generalized linear models (GLMs)广义线性模型(GLM)

The course's spine. A GLM has three parts — an exponential-family random component, a linear predictor η = xᵀβ, and a link function g(μ) = η. From b(θ) you get the mean μ = b'(θ) and variance Var(Y) = a(φ)·b''(θ); the canonical links are log (Poisson), logit (binomial) and identity (normal). There is no closed-form MLE, so β is found by Fisher scoring, which reduces to iteratively re-weighted least squares (IRWLS) — exactly what R's glm() runs. Logistic regression (logit link) gives e^β = an odds ratio; Poisson regression (log link) gives e^β = a rate ratio, with an offset = log(exposure) when modelling rates per person-year.

本课程的主线。GLM 由三部分组成——指数族随机分量、线性预测量 η = xᵀβ,以及连接函数 g(μ) = η。由 b(θ) 可得均值 μ = b'(θ) 与方差 Var(Y) = a(φ)·b''(θ);常见的标准连接为 log(Poisson)、logit(二项)、identity(正态)。MLE 没有闭式解,需用 Fisher scoring 迭代求解,它等价于「迭代重加权最小二乘(IRWLS)」——也正是 R 中 glm() 的实现。Logistic 回归(logit 连接)中 e^β 是 odds ratio(比值比);Poisson 回归(log 连接)中 e^β 是 rate ratio(率比),建模「率」时需加 offset = log(暴露量)。

3

Model selection & diagnostics模型选择与诊断

Deviance D = 2[ℓ(saturated) − ℓ(model)] is the GLM analogue of the residual sum of squares and the spine of model comparison: for nested models ΔD = D₀ − D₁ ~ χ²(q₁−q₀) is the likelihood-ratio test (anova(m0, m1, test="Chi") in R). Three asymptotically-equivalent tests of the same hypothesis — likelihood-ratio (ΔD), Wald (the z value in summary()), and score — must not be confused. Model choice also uses AIC = D + 2q and BIC = D + q·log n, with step()/drop1() for backward elimination. A key exam trap: for ungrouped/binary data the residual deviance is NOT a goodness-of-fit statistic (use Hosmer–Lemeshow instead), though its change between nested models still is a valid test. Diagnostics: deviance vs Pearson residuals, QQ/half-normal plots, leverage (hatvalues) and Cook's distance.

偏差 D = 2[ℓ(饱和模型) − ℓ(当前模型)] 是残差平方和在 GLM 中的对应物,也是模型比较的核心:对嵌套模型,ΔD = D₀ − D₁ ~ χ²(q₁−q₀) 即似然比检验(R 中 anova(m0, m1, test="Chi"))。检验同一假设的三种渐近等价方法——似然比(ΔD)、Wald(summary() 中的 z value)与 score——切勿混淆。模型选择还用 AIC = D + 2q 与 BIC = D + q·log n,并以 step()/drop1() 做后向剔除。一个常考陷阱:对未分组的二元数据,残差偏差并不是拟合优度统计量(应改用 Hosmer–Lemeshow),但其在嵌套模型间的变化仍是有效检验。诊断手段:deviance 与 Pearson 残差、QQ/half-normal 图、杠杆值(hatvalues)与 Cook's distance。

4

Mixed models & random effects混合模型与随机效应

Linear and generalized linear mixed models for grouped/clustered data, separating fixed effects from random variation between units.

针对分组/聚类数据的线性与广义线性混合模型,区分固定效应与单位间的随机变异。

5

Longitudinal & correlated data纵向(重复测量)与相关数据

Modelling repeated measurements on the same subjects over time, including within-subject correlation structures.

对同一受试者随时间的重复测量建模,包括个体内部的相关结构。

6

Time series modelling时间序列建模

Applying statistical models to time-indexed data and accounting for temporal dependence.

将统计模型应用于带时间索引的数据,并处理时间上的相关性。

7

Spatial data空间数据

Models for spatially correlated observations where nearby locations are statistically dependent.

针对空间相关观测的模型——邻近位置在统计上彼此依赖。

8

Non-parametric regression & smoothing非参数回归与平滑

Flexible regression methods (e.g. smoothing/splines) that relax the linear functional-form assumption.

更灵活的回归方法(如平滑、样条),放宽对函数形式为线性的假设。

9

Missing & incomplete data缺失与不完整数据

Missingness mechanisms and principled approaches to estimation and imputation when data are incomplete.

缺失机制,以及当数据不完整时进行估计与插补(imputation)的规范方法。

10

Causal inference因果推断

An introduction to drawing causal conclusions from data, distinguishing association from causation.

因果推断入门:如何从数据中得出因果结论,区分相关与因果。

11

Computational implementation in RR 语言计算实现

Fitting, interpreting and validating the above models on real datasets using statistical software, primarily R.

使用统计软件(主要是 R)在真实数据集上拟合、解读并验证上述各类模型。

Assessment考核方式

How MAST90139 is assessedMAST90139 怎么考核

Final exam: Yes期末考试:有
Component考核项 Weight占比 Note说明
Written assignments (across the semester)学期内书面作业 Coursework component Up to approximately 20 pages of written assignments completed during the teaching period (handbook estimates around 20 hours of work). These typically involve fitting and interpreting models on data in R.教学期内完成的书面作业,合计约 20 页(handbook 估计约 20 小时工作量),通常需要在 R 中对数据拟合并解读模型。
Final written examination (end of semester)期末闭卷笔试 Examination component An end-of-semester written examination held during the official exam period. Past exam papers (e.g. 2022 Semester 1) confirm a final exam is part of the assessment.学期末在正式考试周举行的闭卷笔试。历年真题(如 2022 年第一学期)可证实期末考试是考核的一部分。

Assessed by written assignments during the semester plus a final written examination at the end of semester. Confirm the exact assignment/exam weight split on the official handbook and your subject LMS each year.

考核为学期内书面作业 + 期末闭卷笔试。作业与考试的具体分值占比请每年以官方 handbook 与课程 LMS 为准。

Assessment timeline考核时间线

When each MAST90139 task is dueMAST90139 各项考核时间

Plagiarism Declaration (academic-integrity gate)学术诚信声明(前置门槛)
Start of semester (no due date, must Agree before submitting work)学期开始(无截止日期,提交作业前须同意)
Gate, not graded weight
Assignment 1 — binary/grouped logistic regression (domestic-violence predictors)作业 1——二元/分组 logistic 回归(家庭暴力预测因子)
Due 11:59 pm Thu 2 Apr 2026 (submit to Gradescope via Canvas)截止 2026 年 4 月 2 日(周四)23:59(经 Canvas 提交至 Gradescope)
Coursework (exact % per official handbook)
Assignment 2 — binomial dose-response & link comparison (beetle mortality)作业 2——二项剂量-反应与连接函数比较(甲虫死亡率)
Due 11:59 pm Fri 1 May 2026截止 2026 年 5 月 1 日(周五)23:59
Coursework (exact % per official handbook)
Assignment 3 — ordinal/multinomial response (coal-miner pneumoconiosis)作业 3——有序/多项响应(煤矿工尘肺病)
Due 11:59 pm Sun 31 May 2026截止 2026 年 5 月 31 日(周日)23:59
Coursework (exact % per official handbook)
Final written examination期末闭卷笔试
End-of-semester exam period学期末考试周
Examination (exact % per official handbook)
Self-test自测练习

Test yourself: MAST90139 practice questions自测一下:MAST90139 练习题

Question 1第 1 题
In a logistic regression fitted with glm(..., family=binomial), the summary() table reports the coefficient for a continuous predictor x as Estimate = 0.41 on the log-odds scale. What is the correct interpretation of e^0.41 ≈ 1.51?用 glm(..., family=binomial) 拟合的 logistic 回归中,summary() 报告某连续自变量 x 的系数 Estimate = 0.41(log-odds 尺度)。e^0.41 ≈ 1.51 的正确解读是什么?
  1. The probability of success increases by 1.51 for each one-unit increase in x.
  2. The odds of success are multiplied by about 1.51 for each one-unit increase in x, holding other covariates fixed.
  3. x explains 51% of the variation in the response.
  4. The response increases by 0.41 units for each one-unit increase in x.
  1. x 每增加一个单位,成功的概率增加 1.51。
  2. 在其他自变量固定时,x 每增加一个单位,成功的比值(odds)乘以约 1.51。
  3. x 解释了响应变量 51% 的变异。
  4. x 每增加一个单位,响应增加 0.41 个单位。
Show answer查看答案
Answer: B. The odds of success are multiplied by about 1.51 for each one-unit increase in x, holding other covariates fixed.e^β is an ODDS RATIO, not a probability or a probability change. β is on the log-odds scale, so exponentiating gives the multiplicative effect on the odds (odds × 1.51, i.e. +51%) per one-unit increase in x, all else equal. Probabilities change non-linearly along the logistic S-curve, so e^β is never a direct probability statement.
答案:B. 在其他自变量固定时,x 每增加一个单位,成功的比值(odds)乘以约 1.51。e^β 是 odds ratio(比值比),不是概率,也不是概率的变化。β 在 log-odds 尺度上,取指数后得到对比值(odds)的乘法效应:在其他变量固定时,x 每增一个单位,比值乘以 1.51(即提高 51%)。概率沿 logistic S 形曲线非线性变化,所以 e^β 永远不是对概率的直接陈述。
Question 2第 2 题
You fit two nested GLMs and run anova(m0, m1, test="Chi"). The change in deviance is ΔD = D0 − D1 = 9.6 on 2 degrees of freedom (χ²₀.₉₅(2) ≈ 5.99). What do you conclude, and what is this test called?你拟合了两个嵌套的 GLM,并运行 anova(m0, m1, test="Chi")。偏差变化为 ΔD = D0 − D1 = 9.6,自由度为 2(χ²₀.₉₅(2) ≈ 5.99)。你的结论是什么?这个检验叫什么?
  1. ΔD > 5.99, so the extra terms in m1 are significant; this is the likelihood-ratio (LR) test.
  2. ΔD > 5.99, so the smaller model m0 is preferred; this is the Wald test.
  3. ΔD < 5.99, so the models are equivalent; this is the score test.
  4. Deviance cannot be compared between models; you must use the raw AIC values only.
  1. ΔD > 5.99,所以 m1 中新增的项显著;这是似然比(LR)检验。
  2. ΔD > 5.99,所以更小的模型 m0 更优;这是 Wald 检验。
  3. ΔD < 5.99,所以两模型等价;这是 score 检验。
  4. 偏差不能在模型间比较,只能看原始 AIC 值。
Show answer查看答案
Answer: A. ΔD > 5.99, so the extra terms in m1 are significant; this is the likelihood-ratio (LR) test.For nested models, ΔD = D0 − D1 ~ χ²(q1 − q0) under H0 that the smaller model is adequate; since 9.6 > 5.99 we reject H0 — the extra terms matter. This change-in-deviance comparison via anova(..., test="Chi") IS the likelihood-ratio test. The Wald test is the z value in summary(); the score test is a third asymptotically-equivalent option — all three test the same hypothesis but are computed differently.
答案:A. ΔD > 5.99,所以 m1 中新增的项显著;这是似然比(LR)检验。对嵌套模型,在「小模型足够」的 H0 下 ΔD = D0 − D1 ~ χ²(q1 − q0);因为 9.6 > 5.99,拒绝 H0——新增的项有意义。通过 anova(..., test="Chi") 比较偏差变化,正是似然比检验。Wald 检验是 summary() 里的 z value;score 检验是第三种渐近等价方法——三者检验同一假设,但计算方式不同。
Question 3第 3 题
A proportional-odds (cumulative logit) model is fitted with MASS::polr for an ordinal 3-category response. Which statement about this model is TRUE?用 MASS::polr 对一个有序的三分类响应拟合比例优势(累积 logit)模型。关于该模型,下列哪项陈述为真?
  1. It estimates a separate slope vector for each response category, like a baseline-category logit.
  2. It uses one common slope vector across all cut-points (only the thresholds differ), so e^γ is a cumulative odds ratio that is the same at every threshold.
  3. It requires the response to be modelled as Poisson counts with a log link.
  4. The slopes printed by polr can be interpreted directly with no sign adjustment.
  1. 它为每个响应类别估计一组单独的斜率向量,就像 baseline-category logit 那样。
  2. 它在所有分割点上共用一组斜率向量(仅阈值不同),因此 e^γ 是在每个阈值上都相同的累积比值比。
  3. 它要求把响应建模为 Poisson 计数并用 log 连接。
  4. polr 打印的斜率可以直接解读,无需调整符号。
Show answer查看答案
Answer: B. It uses one common slope vector across all cut-points (only the thresholds differ), so e^γ is a cumulative odds ratio that is the same at every threshold.The proportional-odds model uses a SINGLE slope vector γ at every cut-point (only the thresholds θ_r differ) — hence 'proportional odds', and e^γ is a cumulative OR constant across thresholds. That is far fewer parameters than the baseline-category (nominal) logit, which gives each category its own coefficients (option A). It is a cumulative-logit, not a Poisson, model. Note: MASS::polr parameterises P(Y≤r) = F(θ_r − xᵀβ), so its printed slopes are the negative of the lecture's γ — multiply by −1 before stating the odds-ratio direction (so option D is false).
答案:B. 它在所有分割点上共用一组斜率向量(仅阈值不同),因此 e^γ 是在每个阈值上都相同的累积比值比。比例优势模型在每个分割点上共用一个斜率向量 γ(仅阈值 θ_r 不同)——这正是「比例优势」之名的由来,e^γ 是在各阈值上都相同的累积 OR。它的参数远少于 baseline-category(名义)logit——后者给每个类别各自一组系数(选项 A)。它是累积 logit 模型,不是 Poisson 模型。注意:MASS::polr 的参数化为 P(Y≤r) = F(θ_r − xᵀβ),故其打印的斜率是讲义中 γ 的相反数——陈述比值比方向前要乘以 −1(所以选项 D 错误)。
Exam questions考试题型

High-value exam questions in MAST90139MAST90139 高频考点 · 考试风格题

Logistic regression — odds-ratio interpretationLogistic 回归——比值比解读
Given an R summary() table from glm(..., family=binomial), read a coefficient off the log-odds scale, exponentiate to the odds ratio, state the multiplicative/percentage effect on the odds holding other covariates fixed, and give a 95% CI for the OR.给出 glm(..., family=binomial) 的 R summary() 表,从 log-odds 尺度读取系数,取指数得到 odds ratio,在其他变量固定下陈述对比值的乘法/百分比效应,并给出 OR 的 95% 置信区间。
Exam-style on the course's single most-emphasised skill (Ch2 wcgs heart-disease worked example). No frequency claim — framed as exam-style on a key topic.
针对本课最强调的核心技能的考试风格题(Ch2 wcgs 心脏病例题)。不作出现频率声称——仅为关键主题上的考试风格题。
Deviance & nested-model comparison (LR test)偏差与嵌套模型比较(LR 检验)
Compute ΔD = D0 − D1 for two nested GLMs by hand, compare to χ²(q1−q0), and decide whether the dropped/added terms are significant; identify which R output (anova test="Chi" vs Wald z vs score) corresponds to which test.手算两个嵌套 GLM 的 ΔD = D0 − D1,与 χ²(q1−q0) 比较,判断被删/新增项是否显著;指出 R 输出(anova test="Chi" 对 Wald z 对 score)各对应哪种检验。
Exam-style on Ch3 model-comparison machinery; the LR-vs-Wald-vs-score distinction is a flagged trap in the slides.
针对 Ch3 模型比较机制的考试风格题;LR-对-Wald-对-score 的区分是讲义中点名的陷阱。
Goodness of fit & the ungrouped-deviance trap拟合优度与未分组偏差陷阱
Decide whether the residual deviance is a valid goodness-of-fit statistic for a given dataset (grouped binomial vs ungrouped binary), and choose the right tool — χ² on D for grouped data, Hosmer–Lemeshow for ungrouped/binary.判断对给定数据(分组二项 vs 未分组二元)残差偏差是否是有效的拟合优度统计量,并选择正确工具——分组数据用 D 的 χ² 检验,未分组/二元用 Hosmer–Lemeshow。
Exam-style on a trap stated explicitly in Ch2/Ch7 of the slides: ungrouped residual deviance is NOT a GoF statistic (its change between nested models still is).
针对讲义 Ch2/Ch7 明确指出的陷阱的考试风格题:未分组残差偏差不是拟合优度统计量(但其在嵌套模型间的变化仍然是)。
Poisson / log-linear regression — rate ratio & offsetPoisson / 对数线性回归——率比与 offset
Interpret e^β from a Poisson log-link model as a rate ratio, and recognise when an offset = log(exposure) is needed to model a rate (events per person-time) rather than a raw count.把 Poisson log 连接模型中的 e^β 解读为率比(rate ratio),并识别何时需要 offset = log(暴露量) 来建模「率」(单位人-时的事件数)而非原始计数。
Exam-style on Ch3/Ch5 count modelling; forgetting the offset (turning a rate model into a count model) is a flagged trap.
针对 Ch3/Ch5 计数建模的考试风格题;漏掉 offset(把率模型变成计数模型)是点名的陷阱。
Log-linear models for contingency tables — independence列联表的对数线性模型——独立性
Fit an independence log-linear model glm(count ~ factor(row) + factor(col), poisson), test the row–column interaction's deviance as a test of independence, and relate the cross-product (odds) ratio to the interaction term.拟合独立性对数线性模型 glm(count ~ factor(row) + factor(col), poisson),以行-列交互项的偏差作为独立性检验,并把交叉积(比值)比与交互项联系起来。
Exam-style on Ch5; the equivalence of the log-linear independence test and the classical Pearson χ² test of independence is a core idea.
针对 Ch5 的考试风格题;对数线性独立性检验与经典 Pearson χ² 独立性检验的等价关系是核心思想。
Ordinal response — proportional-odds (polr)有序响应——比例优势模型(polr)
Fit and interpret a proportional-odds cumulative-logit model with MASS::polr for an ordinal response: read the thresholds and the single common slope, exponentiate to a cumulative OR, and handle polr's sign convention.用 MASS::polr 对有序响应拟合并解读比例优势累积 logit 模型:读取阈值与唯一的共用斜率,取指数得到累积 OR,并处理 polr 的符号约定。
Exam-style on Ch6 (the pneumo coal-miner ordinal example drilled in Assignment 3); the polr sign flip is a flagged trap.
针对 Ch6 的考试风格题(Assignment 3 训练的 pneumo 煤矿工有序例子);polr 符号翻转是点名的陷阱。
Nominal response — baseline-category multinomial logit名义响应——baseline-category 多项 logit
Interpret a nnet::multinom coefficient matrix where each non-baseline category has its own slope set: state e^β as the odds ratio of category r versus the baseline, and compare nested models by Δdeviance.解读 nnet::multinom 的系数矩阵——每个非基线类别都有各自的一组斜率:把 e^β 陈述为「类别 r 相对基线」的比值比,并用 Δ偏差比较嵌套模型。
Exam-style on Ch6 (Caesarian / nes96-style examples); the contrast with proportional-odds parameter counts is a flagged trap.
针对 Ch6 的考试风格题(Caesarian / nes96 类例子);与比例优势模型参数数量的对比是点名的陷阱。
Overdispersion — detection & quasi-likelihood过度离散——检测与拟似然
Detect overdispersion (D or X² ≫ df), estimate φ̂ = X²/df, refit with quasibinomial/quasipoisson, and compare nested models with an F-test (not χ²) since the dispersion is estimated.检测过度离散(D 或 X² ≫ df),估计 φ̂ = X²/df,用 quasibinomial/quasipoisson 重新拟合,并因离散度是被估计的而改用 F 检验(而非 χ²)比较嵌套模型。
Exam-style on the overdispersion topic; using χ² instead of F under a quasi-model, and double-adjusting already-inflated SEs, are flagged traps.
针对过度离散主题的考试风格题;在 quasi 模型下误用 χ² 而非 F,以及对已放大的标准误重复调整,都是点名的陷阱。
Mixed / random-effects models for clustered data聚类数据的混合/随机效应模型
Recognise when data are clustered/correlated (repeated measures, grouped units) and require a random effect; fit a (generalized) linear mixed model with lme4 (lmer/glmer) and separate fixed effects from between-unit random variation.识别数据何时是聚类/相关的(重复测量、分组单位)而需要随机效应;用 lme4(lmer/glmer)拟合(广义)线性混合模型,区分固定效应与单位间的随机变异。
Exam-style on Ch7 (prac11 mixed-models material). No frequency claim — framed as exam-style on a key topic.
针对 Ch7 的考试风格题(prac11 混合模型材料)。不作出现频率声称——仅为关键主题上的考试风格题。
Worked example例题精解

A worked MAST90139 problemMAST90139 例题

Reading a logistic-regression coefficient as an odds ratio (the #1 examined skill)把 logistic 回归系数读成 odds ratio(最常考的核心技能)

Problem题目

In the Western Collaborative Group Study (the faraway `wcgs` heart-disease data used in Ch2), a binary logistic regression models coronary heart disease (chd) on cigarettes smoked per day and height. The R summary() reports the cigarettes coefficient as Estimate β̂ = 0.0231 (log-odds scale), with a small Std. Error and significant Pr(>|z|). What does this coefficient mean, and how would you state the effect of smoking on heart-disease risk?

在 Western Collaborative Group Study(Ch2 使用的 faraway `wcgs` 心脏病数据集)中,用二元 logistic 回归把冠心病(chd)对「每天吸烟支数」和「身高」建模。R 的 summary() 报告吸烟支数的系数为 Estimate β̂ = 0.0231(log-odds 尺度),标准误很小且 Pr(>|z|) 显著。这个系数代表什么?你会如何陈述吸烟对患心脏病风险的影响?

Approach解题思路

The logistic model is logit(π) = log[π/(1−π)] = β₀ + β₁·cigs + β₂·height, so β̂ is on the LOG-ODDS scale — not directly interpretable. Exponentiate to get the odds ratio: e^β̂ = e^0.0231 = 1.023. Interpret multiplicatively: each extra cigarette per day multiplies the estimated odds of CHD by about 1.023, i.e. raises the odds by (e^0.0231 − 1)×100% ≈ 2.3%, holding height fixed. A 95% CI for the OR is (e^(β̂−1.96·se), e^(β̂+1.96·se)); if it excludes 1 the effect is significant at 5%. Watch the standard exam trap: a model can show high overall accuracy yet near-zero sensitivity for a rare event (the wcgs CHD model has specificity ≈ 0.999 but sensitivity ≈ 0.008), so always check sensitivity and the ROC/AUC (here AUC ≈ 0.737), not just accuracy.

Logistic 模型为 logit(π) = log[π/(1−π)] = β₀ + β₁·cigs + β₂·height,因此 β̂ 在 LOG-ODDS(对数比值)尺度上,不能直接解读。取指数得到 odds ratio:e^β̂ = e^0.0231 = 1.023。按乘法解读:在身高固定的前提下,每天多吸一支烟,会把估计的患 CHD 比值乘以约 1.023,即提高约 (e^0.0231 − 1)×100% ≈ 2.3%。OR 的 95% 置信区间为 (e^(β̂−1.96·se), e^(β̂+1.96·se));若区间不含 1,则在 5% 水平上显著。注意常考陷阱:模型整体准确率很高,但对罕见事件灵敏度可能接近 0(wcgs 的 CHD 模型 specificity ≈ 0.999 而 sensitivity ≈ 0.008),所以一定要看 sensitivity 与 ROC/AUC(此处 AUC ≈ 0.737),而不能只看 accuracy。

Key terms核心术语

MAST90139 glossaryMAST90139 术语表

Generalized linear model (GLM)广义线性模型(GLM)
A regression framework that extends linear models to exponential-family responses (e.g. binary, count) via a link function.
把线性模型推广到指数族响应(如二元、计数)并通过连接函数建立关系的回归框架。
Link function连接函数
The function relating the linear predictor to the mean of the response in a GLM (e.g. logit, log).
GLM 中把线性预测量与响应均值联系起来的函数(如 logit、log)。
Mixed model / random effects混合模型 / 随机效应
A model containing both fixed effects and random effects to capture variation between groups or clusters.
同时包含固定效应与随机效应的模型,用于刻画组别或聚类之间的变异。
Longitudinal data纵向(重复测量)数据
Repeated measurements taken on the same subjects over time, requiring within-subject correlation to be modelled.
对同一受试者随时间重复测量得到的数据,需对个体内部的相关性建模。
Deviance偏差(deviance)
A measure of lack of fit for a GLM, generalizing the residual sum of squares and used in model comparison.
衡量 GLM 拟合不足的指标,是残差平方和的推广,用于模型比较。
Non-parametric regression非参数回归
Regression that does not assume a fixed functional form, letting the data determine the shape of the relationship (e.g. smoothing, splines).
不假设固定函数形式的回归,由数据决定关系形状(如平滑、样条)。
Missing data mechanism (MCAR / MAR / MNAR)缺失机制(MCAR / MAR / MNAR)
Classification of why data are missing, which determines whether estimates can be unbiased and which method applies.
对数据为何缺失的分类,决定估计是否可能无偏以及应采用哪种处理方法。
Imputation插补
Replacing missing values with estimated ones (e.g. multiple imputation) so standard analyses can proceed.
用估计值替代缺失值(如多重插补),使常规分析得以进行。
Causal inference因果推断
Methods for estimating the effect of an intervention or treatment, going beyond mere statistical association.
估计某干预或处理效应的方法,超越单纯的统计相关。
Spatial correlation空间相关
Statistical dependence between observations based on their geographic proximity.
观测值之间因地理邻近而产生的统计依赖。
Odds ratio (OR) / rate ratio (RR)比值比(OR)/ 率比(RR)
The exponentiated GLM coefficient e^β: a multiplicative effect on the odds of success (odds ratio, logistic/logit) or on the expected count or rate (rate ratio, Poisson/log). Interpreting these is the single most-examined skill in MAST90139.
GLM 系数取指数得到的 e^β:对成功比值的乘法效应(odds ratio,logistic/logit)或对期望计数/率的乘法效应(rate ratio,Poisson/log)。解读这两者是 MAST90139 最常考的核心技能。
Exponential family指数族
The class of distributions (normal, binomial, Poisson, etc.) whose density has the canonical form (yθ − b(θ))/a(φ) + c(y,φ); it underpins every GLM, with mean b'(θ) and variance a(φ)·b''(θ).
一类分布(正态、二项、Poisson 等),其密度可写成标准形式 (yθ − b(θ))/a(φ) + c(y,φ);它是所有 GLM 的基础,均值为 b'(θ),方差为 a(φ)·b''(θ)。
IRWLS (iteratively re-weighted least squares)迭代重加权最小二乘(IRWLS)
The algorithm — equivalent to Fisher scoring — that R's glm() uses to find the maximum-likelihood estimate β̂ when no closed form exists, repeatedly solving a weighted least-squares problem XᵀWX·β = XᵀWz.
R 中 glm() 用来在没有闭式解时求极大似然估计 β̂ 的算法,等价于 Fisher scoring,反复求解加权最小二乘问题 XᵀWX·β = XᵀWz。
Proportional-odds (cumulative logit) model比例优势(累积 logit)模型
An ordinal-response model where a single slope vector γ applies at every cut-point (only the thresholds θ_r differ), so e^γ is a cumulative odds ratio that is the same at all thresholds. Fitted in R with MASS::polr (note polr negates the slope sign).
一种有序响应模型:单一斜率向量 γ 适用于每个分割点(仅阈值 θ_r 不同),因此 e^γ 是在所有阈值上都相同的累积比值比。R 中用 MASS::polr 拟合(注意 polr 会取斜率的相反符号)。
FAQ

MAST90139 — common questionsMAST90139 常见问题

How is MAST90139 assessed?MAST90139 怎么考核?
MAST90139 is assessed by written assignments completed during the semester (up to around 20 pages of work) plus a final written examination held in the end-of-semester exam period. There IS a final exam. The exact weighting split between assignments and exam is set on the official University of Melbourne handbook and your subject LMS each year, so confirm it there.
MAST90139 的考核为学期内完成的书面作业(合计约 20 页工作量)加期末闭卷笔试。这门课有期末考试。作业与考试的具体分值占比每年以墨大官方 handbook 和课程 LMS 为准,请以官方为准核对。
What are the prerequisites for MAST90139?MAST90139 的先修要求是什么?
You must have completed MAST90104 A First Course in Statistical Learning and MAST90105 Methods of Mathematical Statistics. It is a graduate subject within the Master of Data Science / statistics stream, so a solid background in mathematical statistics and regression is assumed.
需先完成 MAST90104(A First Course in Statistical Learning)与 MAST90105(Methods of Mathematical Statistics)。这是数据科学硕士/统计方向下的研究生课程,默认你已具备扎实的数理统计与回归基础。
What does MAST90139 actually cover?MAST90139 到底学什么?
Core statistical models for data science: linear models, generalized linear models, mixed (random-effects) models and non-parametric regression, applied to dependent data — time series, longitudinal and spatial data — plus an introduction to causal inference and handling missing data. Models are fitted and interpreted in practice, typically using R.
数据科学的核心统计模型:线性模型、广义线性模型、混合(随机效应)模型与非参数回归,并应用于相关结构数据——时间序列、纵向数据与空间数据——同时介绍因果推断与缺失数据处理。模型会在实践中拟合与解读,通常使用 R。
How heavy is the R / computing workload in MAST90139?MAST90139 的 R / 编程工作量大吗?
Substantial. The written assignments centre on fitting, diagnosing and interpreting models on real datasets, which in practice means regular hands-on work in statistical software (primarily R). Comfort with R and reading model output is important for both the assignments and the exam.
不小。书面作业的核心是在真实数据集上拟合、诊断并解读模型,这意味着需要经常上手统计软件(主要是 R)。熟练使用 R、读懂模型输出,对作业和考试都很重要。
Is it okay to use AskSia for MAST90139 under Melbourne's academic integrity policy?在墨大学术诚信政策下,MAST90139 能用 AskSia 吗?
Use it as a study aid. Sia helps you understand GLMs, mixed models, R output and worked steps — that aligns with the University of Melbourne's policy on AI-assisted study. Submitting AI-generated work as your own, or using it on assessment where it isn't permitted, is misconduct. Treat it like a tutor: learn from it, don't substitute it for your own work.
把它当学习辅助。Sia 帮你理解 GLM、混合模型、R 输出和解题步骤,这符合墨大关于 AI 辅助学习的政策。把 AI 生成的内容当作自己的作业提交,或在不允许的考核中使用,属于学术不端。把它当 tutor:用来学,而不是替你做作业。

AskSia is an independent study aid and is not affiliated with, endorsed by, or sponsored by The University of Melbourne. Course details may change — always confirm against the official handbook. Read about how this guide is built. AskSia 是独立的学习辅助工具,与墨尔本大学没有任何隶属、背书或赞助关系。课程信息可能变动,请始终以官方 handbook 为准。了解本指南的编写方法