MAST90139 · Statistical Modelling for Data ScienceMAST90139 · 数据科学统计建模
Graduate statistics: linear & generalized linear models, mixed models, and modelling dependent data — assignments plus a final exam.研究生统计核心课:线性模型、广义线性模型、混合模型与相关结构数据建模 · 平时作业 + 期末闭卷考试。MAST90139 is a graduate subject in the School of Mathematics and Statistics that builds the modern statistical-modelling toolkit data scientists rely on: linear models, generalized linear models (GLMs), mixed models and non-parametric regression, applied to time series, longitudinal and spatial data, with an introduction to causal inference and handling missing data. In practice it runs as a GLM-and-categorical-data course taught in R (built on Julian Faraway's Extending the Linear Model with R), where the single most-examined skill is reading R glm()/multinom()/polr() output and turning a coefficient into an odds ratio, rate ratio or cumulative-odds statement. It is assessed by written assignments across the semester plus a final written examination. This page is built from 100 real MAST90139 course materials in the AskSia Library.
MAST90139 是墨大数学与统计学院开设的研究生课程,系统搭建数据科学家最常用的统计建模工具箱:线性模型、广义线性模型(GLM)、混合模型与非参数回归,并应用于时间序列、纵向(重复测量)数据和空间数据,同时介绍因果推断与缺失数据处理。实际授课以「GLM 与分类数据分析」为主线,全程用 R 教学(基于 Julian Faraway 的《Extending the Linear Model with R》),考查最多的核心技能就是读懂 R 中 glm()/multinom()/polr() 的输出,并把系数翻译成 odds ratio(比值比)、rate ratio(率比)或 cumulative-odds(累积比值)的实际结论。考核方式为学期内的书面作业加期末闭卷笔试。本页基于AskSia Library 中 100 份真实 MAST90139 课程材料构建。
Built from 100 real MAST90139 course materials in the AskSia Library — including the full Ch1–Ch7 GLM lecture decks, weekly R practicals (prac1–prac11 with answer keys) and the three 2026 assignments.
基于AskSia Library 中 100 份真实 MAST90139 课程材料构建——涵盖 Ch1–Ch7 完整 GLM 讲义、每周 R 上机练习(prac1–prac11 及答案)以及 2026 年的三次作业。
What MAST90139 is aboutMAST90139 讲什么
MAST90139 Statistical Modelling for Data Science is a 12.5-point graduate subject offered by the University of Melbourne's School of Mathematics and Statistics. It develops the core statistical models underlying data-science practice — linear models, generalized linear models, mixed (random-effects) models and non-parametric regression — and shows how to apply them to dependent data such as time series, longitudinal/repeated-measures data and spatially correlated data. The subject also introduces methods for causal inference and for handling missing or incomplete data. Students learn both the modelling framework and its limitations, and how to fit and interpret these models on real datasets, typically using R.
MAST90139《数据科学统计建模》是墨尔本大学数学与统计学院开设的 12.5 学分研究生课程。课程构建数据科学实践背后的核心统计模型——线性模型、广义线性模型、混合(随机效应)模型与非参数回归——并讲解如何把它们应用到相关结构数据上,例如时间序列、纵向(重复测量)数据和空间相关数据。课程还介绍因果推断方法以及缺失/不完整数据的处理。学生既学习建模框架本身,也理解其适用边界,并在真实数据集上(通常使用 R)拟合与解读这些模型。
The MAST90139 syllabus, topic by topicMAST90139 大纲 · 逐个主题
Review of linear models & statistical inference线性模型与统计推断回顾
Multiple linear regression, least squares, the Gauss-Markov framework, and inference for model parameters as the foundation for everything that follows.
多元线性回归、最小二乘、Gauss-Markov 框架,以及对模型参数的推断,作为后续所有内容的基础。
Generalized linear models (GLMs)广义线性模型(GLM)
The course's spine. A GLM has three parts — an exponential-family random component, a linear predictor η = xᵀβ, and a link function g(μ) = η. From b(θ) you get the mean μ = b'(θ) and variance Var(Y) = a(φ)·b''(θ); the canonical links are log (Poisson), logit (binomial) and identity (normal). There is no closed-form MLE, so β is found by Fisher scoring, which reduces to iteratively re-weighted least squares (IRWLS) — exactly what R's glm() runs. Logistic regression (logit link) gives e^β = an odds ratio; Poisson regression (log link) gives e^β = a rate ratio, with an offset = log(exposure) when modelling rates per person-year.
本课程的主线。GLM 由三部分组成——指数族随机分量、线性预测量 η = xᵀβ,以及连接函数 g(μ) = η。由 b(θ) 可得均值 μ = b'(θ) 与方差 Var(Y) = a(φ)·b''(θ);常见的标准连接为 log(Poisson)、logit(二项)、identity(正态)。MLE 没有闭式解,需用 Fisher scoring 迭代求解,它等价于「迭代重加权最小二乘(IRWLS)」——也正是 R 中 glm() 的实现。Logistic 回归(logit 连接)中 e^β 是 odds ratio(比值比);Poisson 回归(log 连接)中 e^β 是 rate ratio(率比),建模「率」时需加 offset = log(暴露量)。
Model selection & diagnostics模型选择与诊断
Deviance D = 2[ℓ(saturated) − ℓ(model)] is the GLM analogue of the residual sum of squares and the spine of model comparison: for nested models ΔD = D₀ − D₁ ~ χ²(q₁−q₀) is the likelihood-ratio test (anova(m0, m1, test="Chi") in R). Three asymptotically-equivalent tests of the same hypothesis — likelihood-ratio (ΔD), Wald (the z value in summary()), and score — must not be confused. Model choice also uses AIC = D + 2q and BIC = D + q·log n, with step()/drop1() for backward elimination. A key exam trap: for ungrouped/binary data the residual deviance is NOT a goodness-of-fit statistic (use Hosmer–Lemeshow instead), though its change between nested models still is a valid test. Diagnostics: deviance vs Pearson residuals, QQ/half-normal plots, leverage (hatvalues) and Cook's distance.
偏差 D = 2[ℓ(饱和模型) − ℓ(当前模型)] 是残差平方和在 GLM 中的对应物,也是模型比较的核心:对嵌套模型,ΔD = D₀ − D₁ ~ χ²(q₁−q₀) 即似然比检验(R 中 anova(m0, m1, test="Chi"))。检验同一假设的三种渐近等价方法——似然比(ΔD)、Wald(summary() 中的 z value)与 score——切勿混淆。模型选择还用 AIC = D + 2q 与 BIC = D + q·log n,并以 step()/drop1() 做后向剔除。一个常考陷阱:对未分组的二元数据,残差偏差并不是拟合优度统计量(应改用 Hosmer–Lemeshow),但其在嵌套模型间的变化仍是有效检验。诊断手段:deviance 与 Pearson 残差、QQ/half-normal 图、杠杆值(hatvalues)与 Cook's distance。
Mixed models & random effects混合模型与随机效应
Linear and generalized linear mixed models for grouped/clustered data, separating fixed effects from random variation between units.
针对分组/聚类数据的线性与广义线性混合模型,区分固定效应与单位间的随机变异。
Longitudinal & correlated data纵向(重复测量)与相关数据
Modelling repeated measurements on the same subjects over time, including within-subject correlation structures.
对同一受试者随时间的重复测量建模,包括个体内部的相关结构。
Time series modelling时间序列建模
Applying statistical models to time-indexed data and accounting for temporal dependence.
将统计模型应用于带时间索引的数据,并处理时间上的相关性。
Spatial data空间数据
Models for spatially correlated observations where nearby locations are statistically dependent.
针对空间相关观测的模型——邻近位置在统计上彼此依赖。
Non-parametric regression & smoothing非参数回归与平滑
Flexible regression methods (e.g. smoothing/splines) that relax the linear functional-form assumption.
更灵活的回归方法(如平滑、样条),放宽对函数形式为线性的假设。
Missing & incomplete data缺失与不完整数据
Missingness mechanisms and principled approaches to estimation and imputation when data are incomplete.
缺失机制,以及当数据不完整时进行估计与插补(imputation)的规范方法。
Causal inference因果推断
An introduction to drawing causal conclusions from data, distinguishing association from causation.
因果推断入门:如何从数据中得出因果结论,区分相关与因果。
Computational implementation in RR 语言计算实现
Fitting, interpreting and validating the above models on real datasets using statistical software, primarily R.
使用统计软件(主要是 R)在真实数据集上拟合、解读并验证上述各类模型。
How MAST90139 is assessedMAST90139 怎么考核
Final exam: Yes期末考试:有| Component考核项 | Weight占比 | Note说明 |
|---|---|---|
| Written assignments (across the semester)学期内书面作业 | Coursework component | Up to approximately 20 pages of written assignments completed during the teaching period (handbook estimates around 20 hours of work). These typically involve fitting and interpreting models on data in R.教学期内完成的书面作业,合计约 20 页(handbook 估计约 20 小时工作量),通常需要在 R 中对数据拟合并解读模型。 |
| Final written examination (end of semester)期末闭卷笔试 | Examination component | An end-of-semester written examination held during the official exam period. Past exam papers (e.g. 2022 Semester 1) confirm a final exam is part of the assessment.学期末在正式考试周举行的闭卷笔试。历年真题(如 2022 年第一学期)可证实期末考试是考核的一部分。 |
Assessed by written assignments during the semester plus a final written examination at the end of semester. Confirm the exact assignment/exam weight split on the official handbook and your subject LMS each year.
考核为学期内书面作业 + 期末闭卷笔试。作业与考试的具体分值占比请每年以官方 handbook 与课程 LMS 为准。
When each MAST90139 task is dueMAST90139 各项考核时间
Test yourself: MAST90139 practice questions自测一下:MAST90139 练习题
- The probability of success increases by 1.51 for each one-unit increase in x.
- The odds of success are multiplied by about 1.51 for each one-unit increase in x, holding other covariates fixed.
- x explains 51% of the variation in the response.
- The response increases by 0.41 units for each one-unit increase in x.
- x 每增加一个单位,成功的概率增加 1.51。
- 在其他自变量固定时,x 每增加一个单位,成功的比值(odds)乘以约 1.51。
- x 解释了响应变量 51% 的变异。
- x 每增加一个单位,响应增加 0.41 个单位。
Show answer查看答案
- ΔD > 5.99, so the extra terms in m1 are significant; this is the likelihood-ratio (LR) test.
- ΔD > 5.99, so the smaller model m0 is preferred; this is the Wald test.
- ΔD < 5.99, so the models are equivalent; this is the score test.
- Deviance cannot be compared between models; you must use the raw AIC values only.
- ΔD > 5.99,所以 m1 中新增的项显著;这是似然比(LR)检验。
- ΔD > 5.99,所以更小的模型 m0 更优;这是 Wald 检验。
- ΔD < 5.99,所以两模型等价;这是 score 检验。
- 偏差不能在模型间比较,只能看原始 AIC 值。
Show answer查看答案
- It estimates a separate slope vector for each response category, like a baseline-category logit.
- It uses one common slope vector across all cut-points (only the thresholds differ), so e^γ is a cumulative odds ratio that is the same at every threshold.
- It requires the response to be modelled as Poisson counts with a log link.
- The slopes printed by polr can be interpreted directly with no sign adjustment.
- 它为每个响应类别估计一组单独的斜率向量,就像 baseline-category logit 那样。
- 它在所有分割点上共用一组斜率向量(仅阈值不同),因此 e^γ 是在每个阈值上都相同的累积比值比。
- 它要求把响应建模为 Poisson 计数并用 log 连接。
- polr 打印的斜率可以直接解读,无需调整符号。
Show answer查看答案
High-value exam questions in MAST90139MAST90139 高频考点 · 考试风格题
A worked MAST90139 problemMAST90139 例题
Reading a logistic-regression coefficient as an odds ratio (the #1 examined skill)把 logistic 回归系数读成 odds ratio(最常考的核心技能)
In the Western Collaborative Group Study (the faraway `wcgs` heart-disease data used in Ch2), a binary logistic regression models coronary heart disease (chd) on cigarettes smoked per day and height. The R summary() reports the cigarettes coefficient as Estimate β̂ = 0.0231 (log-odds scale), with a small Std. Error and significant Pr(>|z|). What does this coefficient mean, and how would you state the effect of smoking on heart-disease risk?
在 Western Collaborative Group Study(Ch2 使用的 faraway `wcgs` 心脏病数据集)中,用二元 logistic 回归把冠心病(chd)对「每天吸烟支数」和「身高」建模。R 的 summary() 报告吸烟支数的系数为 Estimate β̂ = 0.0231(log-odds 尺度),标准误很小且 Pr(>|z|) 显著。这个系数代表什么?你会如何陈述吸烟对患心脏病风险的影响?
The logistic model is logit(π) = log[π/(1−π)] = β₀ + β₁·cigs + β₂·height, so β̂ is on the LOG-ODDS scale — not directly interpretable. Exponentiate to get the odds ratio: e^β̂ = e^0.0231 = 1.023. Interpret multiplicatively: each extra cigarette per day multiplies the estimated odds of CHD by about 1.023, i.e. raises the odds by (e^0.0231 − 1)×100% ≈ 2.3%, holding height fixed. A 95% CI for the OR is (e^(β̂−1.96·se), e^(β̂+1.96·se)); if it excludes 1 the effect is significant at 5%. Watch the standard exam trap: a model can show high overall accuracy yet near-zero sensitivity for a rare event (the wcgs CHD model has specificity ≈ 0.999 but sensitivity ≈ 0.008), so always check sensitivity and the ROC/AUC (here AUC ≈ 0.737), not just accuracy.
Logistic 模型为 logit(π) = log[π/(1−π)] = β₀ + β₁·cigs + β₂·height,因此 β̂ 在 LOG-ODDS(对数比值)尺度上,不能直接解读。取指数得到 odds ratio:e^β̂ = e^0.0231 = 1.023。按乘法解读:在身高固定的前提下,每天多吸一支烟,会把估计的患 CHD 比值乘以约 1.023,即提高约 (e^0.0231 − 1)×100% ≈ 2.3%。OR 的 95% 置信区间为 (e^(β̂−1.96·se), e^(β̂+1.96·se));若区间不含 1,则在 5% 水平上显著。注意常考陷阱:模型整体准确率很高,但对罕见事件灵敏度可能接近 0(wcgs 的 CHD 模型 specificity ≈ 0.999 而 sensitivity ≈ 0.008),所以一定要看 sensitivity 与 ROC/AUC(此处 AUC ≈ 0.737),而不能只看 accuracy。
MAST90139 glossaryMAST90139 术语表
- Generalized linear model (GLM)广义线性模型(GLM)
- A regression framework that extends linear models to exponential-family responses (e.g. binary, count) via a link function.
- 把线性模型推广到指数族响应(如二元、计数)并通过连接函数建立关系的回归框架。
- Link function连接函数
- The function relating the linear predictor to the mean of the response in a GLM (e.g. logit, log).
- GLM 中把线性预测量与响应均值联系起来的函数(如 logit、log)。
- Mixed model / random effects混合模型 / 随机效应
- A model containing both fixed effects and random effects to capture variation between groups or clusters.
- 同时包含固定效应与随机效应的模型,用于刻画组别或聚类之间的变异。
- Longitudinal data纵向(重复测量)数据
- Repeated measurements taken on the same subjects over time, requiring within-subject correlation to be modelled.
- 对同一受试者随时间重复测量得到的数据,需对个体内部的相关性建模。
- Deviance偏差(deviance)
- A measure of lack of fit for a GLM, generalizing the residual sum of squares and used in model comparison.
- 衡量 GLM 拟合不足的指标,是残差平方和的推广,用于模型比较。
- Non-parametric regression非参数回归
- Regression that does not assume a fixed functional form, letting the data determine the shape of the relationship (e.g. smoothing, splines).
- 不假设固定函数形式的回归,由数据决定关系形状(如平滑、样条)。
- Missing data mechanism (MCAR / MAR / MNAR)缺失机制(MCAR / MAR / MNAR)
- Classification of why data are missing, which determines whether estimates can be unbiased and which method applies.
- 对数据为何缺失的分类,决定估计是否可能无偏以及应采用哪种处理方法。
- Imputation插补
- Replacing missing values with estimated ones (e.g. multiple imputation) so standard analyses can proceed.
- 用估计值替代缺失值(如多重插补),使常规分析得以进行。
- Causal inference因果推断
- Methods for estimating the effect of an intervention or treatment, going beyond mere statistical association.
- 估计某干预或处理效应的方法,超越单纯的统计相关。
- Spatial correlation空间相关
- Statistical dependence between observations based on their geographic proximity.
- 观测值之间因地理邻近而产生的统计依赖。
- Odds ratio (OR) / rate ratio (RR)比值比(OR)/ 率比(RR)
- The exponentiated GLM coefficient e^β: a multiplicative effect on the odds of success (odds ratio, logistic/logit) or on the expected count or rate (rate ratio, Poisson/log). Interpreting these is the single most-examined skill in MAST90139.
- GLM 系数取指数得到的 e^β:对成功比值的乘法效应(odds ratio,logistic/logit)或对期望计数/率的乘法效应(rate ratio,Poisson/log)。解读这两者是 MAST90139 最常考的核心技能。
- Exponential family指数族
- The class of distributions (normal, binomial, Poisson, etc.) whose density has the canonical form (yθ − b(θ))/a(φ) + c(y,φ); it underpins every GLM, with mean b'(θ) and variance a(φ)·b''(θ).
- 一类分布(正态、二项、Poisson 等),其密度可写成标准形式 (yθ − b(θ))/a(φ) + c(y,φ);它是所有 GLM 的基础,均值为 b'(θ),方差为 a(φ)·b''(θ)。
- IRWLS (iteratively re-weighted least squares)迭代重加权最小二乘(IRWLS)
- The algorithm — equivalent to Fisher scoring — that R's glm() uses to find the maximum-likelihood estimate β̂ when no closed form exists, repeatedly solving a weighted least-squares problem XᵀWX·β = XᵀWz.
- R 中 glm() 用来在没有闭式解时求极大似然估计 β̂ 的算法,等价于 Fisher scoring,反复求解加权最小二乘问题 XᵀWX·β = XᵀWz。
- Proportional-odds (cumulative logit) model比例优势(累积 logit)模型
- An ordinal-response model where a single slope vector γ applies at every cut-point (only the thresholds θ_r differ), so e^γ is a cumulative odds ratio that is the same at all thresholds. Fitted in R with MASS::polr (note polr negates the slope sign).
- 一种有序响应模型:单一斜率向量 γ 适用于每个分割点(仅阈值 θ_r 不同),因此 e^γ 是在所有阈值上都相同的累积比值比。R 中用 MASS::polr 拟合(注意 polr 会取斜率的相反符号)。
MAST90139 — common questionsMAST90139 常见问题
How is MAST90139 assessed?MAST90139 怎么考核?
What are the prerequisites for MAST90139?MAST90139 的先修要求是什么?
What does MAST90139 actually cover?MAST90139 到底学什么?
How heavy is the R / computing workload in MAST90139?MAST90139 的 R / 编程工作量大吗?
Is it okay to use AskSia for MAST90139 under Melbourne's academic integrity policy?在墨大学术诚信政策下,MAST90139 能用 AskSia 吗?
Other UniMelb course guides墨大 其他课程指南
AskSia is an independent study aid and is not affiliated with, endorsed by, or sponsored by The University of Melbourne. Course details may change — always confirm against the official handbook. Read about how this guide is built. AskSia 是独立的学习辅助工具,与墨尔本大学没有任何隶属、背书或赞助关系。课程信息可能变动,请始终以官方 handbook 为准。了解本指南的编写方法。