UniMelb · MAST90139 · Statistical Modelling for Data Science

MAST90139: pass the exams, not just read the notes

Your complete guide to University of Melbourne's statistical modelling for data science unit. See where the marks are, work real practice questions, and study with an AI tutor that knows MAST90139.

12.5 credit points Graduate (Master of Data Science) Offered S1 ~55% exams School of Mathematics and Statistics

Learn with AskSia Explore the AskSia library

Sia generates MAST90139 practice questions, walks through binary response (logistic regression) and the generalised linear model framework step by step, and quizzes you on the material the exam weights most heavily.

Try a real exam-style question

Worked example

Multiple choice · solution revealed after you answer

You fit a binary logistic regression in R, glm(chd ~ smoker, family = binomial), where chd is coronary heart disease (1 = yes) and smoker is 1 for smokers and 0 for non-smokers. The R summary reports the smoker coefficient (Estimate) as 0.693, with the intercept already accounted for. Holding nothing else in the model, what is the estimated effect of being a smoker on the ODDS of heart disease?

Worked solution

In logistic regression the coefficient is on the LOG-ODDS scale: the smoker estimate beta = 0.693 is the change in log-odds of heart disease for smokers versus non-smokers, not the odds itself.

Convert to the odds ratio by exponentiating: OR = e to the beta = e^0.693. Since 0.693 is approximately ln(2), e^0.693 = 2.0.
Interpretation: the odds of heart disease for smokers are about 2.0 times the odds for non-smokers, i.e. they roughly double, an increase of about (2.0 − 1) × 100% = 100%.
So the answer is option index 1: the odds roughly double (OR approximately 2.0).

The trap: Reading the coefficient 0.693 straight off the printout as if it were the odds ratio (option 0) or as a percentage-point change in probability (option 3). The R Estimate is on the log-odds scale; you must exponentiate to get the odds ratio. A positive coefficient gives an OR above 1 (odds increase), so e^(-0.693) = 0.5 (option 2) has the sign backwards. Only e^(+0.693) = 2.0 is correct. classic slip!

Generate 10 more like this with Sia

your whole grade↘

Where your grade comes from Exams 55% · Assignment 45%

151515Final · 55%

One exam decides 55% of your grade. Covers the full GLM arc, Ch1 to Ch7, weighted toward reading R output and the deviance and odds-ratio interpretation skills. This whole page is built around that.

Overview

What MAST90139 is, and where it sits

MAST90139 Statistical Modelling for Data Science is the University of Melbourne's graduate course on the generalised linear model (GLM) and its families, a core subject in the Master of Data Science taught out of the School of Mathematics and Statistics. It starts from a review of the normal linear model and then systematically extends it in four directions, so that one framework handles binary, count, categorical and clustered responses. The whole subject is taught and assessed in R, built around Julian Faraway's Extending the Linear Model with R and its faraway package datasets (gavote, wcgs, pneumo and the like).

The arc runs through seven chapters: a linear-model review (Ch1), binary logistic regression with odds ratios, diagnostics and ROC and AUC (Ch2), the general GLM machinery of the exponential family, link functions and IRLS estimation with deviance-based inference (Ch3), binomial and grouped logistic regression for categorical data (Ch4), log-linear models for contingency tables (Ch5), multicategory nominal and ordinal responses by baseline-category and proportional-odds logit (Ch6), and random and mixed-effects models for correlated data (Ch7). One reflex is trained throughout: given a printed R coefficient table, name the right model and link, do the deviance arithmetic to test fit or compare nested models, and translate a coefficient into a real-world odds ratio, rate ratio or cumulative-odds statement.

Assessment is three substantial R assignments submitted to Gradescope (binary and grouped logistic, binomial dose-response, and ordinal and multinomial coal-miner data) plus a closed-book final exam worth 55% of the mark. The subject assumes a solid grounding in linear models, probability and matrix algebra, and it is mathematically and computationally demanding: you are expected to both derive the theory by hand and run and interpret the models in R.

How it differs from its first-year siblings. COMP90038 Algorithms and Complexity is the closest same-university graduate quantitative sibling: it teaches the algorithmic and computational-cost side of data science, where MAST90139 teaches the statistical-modelling side (how to fit, check and interpret models of real data). Together they cover the algorithm and statistics halves of a data-science toolkit. ECON10004 is an applied first-year unit that uses simpler statistical reasoning, a much lighter quantitative load than this graduate GLM subject.

Difficulty & time commitment

Is MAST90139 hard, and how much time does it take?

MAST90139 is manageable if you keep a weekly rhythm and treat the back half as the main event. Across student reviews the pattern is consistent: it starts gently and steepens, and the heaviest assessment is the part that separates grades.

Difficulty

3.7 / 5

Hard. Gentle early, demanding back half. Hard to fail with steady work; an HD takes consistent practice.

Exam load

55%

The exams decide most of the grade. The heaviest single component is 55%.

Weekly time

~11 hrs

The standard load for a 12.5-credit-point unit, around 1.5 hours per credit point per week including class.

Ch1 to Ch2 (linear-model review, binary logistic)building

Ch3 to Ch5 (GLM framework, categorical and log-linear)steep

Ch6 to Ch7 (multicategory and mixed-effects models)most abstract

The difficulty curve and the assessment weighting point the same way: the back half is harder and worth more. Front-loading effort there is the highest-return decision in the unit.

Is this unit for you

Who tends to do well, and who tends to struggle

You will likely do well if

You are fluent with linear regression in matrix form, probability and basic matrix algebra, so the GLM extension builds on solid foundations rather than shaky ones.
You keep up with the weekly R practicals and can read a glm, multinom or polr summary fluently: Estimate, Std. Error, z value, null and residual deviance and AIC.
You drill the two core moves until they are automatic: exponentiating a coefficient into an odds ratio or rate ratio, and computing the change in deviance between nested models as a chi-squared likelihood-ratio test.
You build the one permitted A4 notes page early and rehearse the exponential-family, link-function and deviance results on it rather than discovering it the night before.

You may struggle if

You are rusty on linear models, probability or matrix algebra; the exponential-family and IRLS material assumes them and moves fast.
You can run glm in R but cannot interpret the output: stating an odds ratio without exponentiating, or quoting a small residual deviance as goodness of fit for ungrouped binary data.
You leave the abstract middle and back (Ch3 GLM theory, Ch5 log-linear models, Ch6 multinomial and ordinal) to cram, even though they carry the heaviest exam weight.
You rely on having software and notes in the exam; it is closed-book with only one A4 page, so the deviance and odds-ratio arithmetic has to be in your hands.

do this ↘

What HD students do differently

Master the interpretation reflex: for any fitted coefficient, state e to the beta as an odds ratio (logistic), rate ratio (Poisson and log-linear) or cumulative odds ratio (proportional-odds), with the correct one-unit-change wording and a 95% confidence interval where asked.
Know the deviance machinery cold: residual deviance versus chi-squared for goodness of fit on grouped data, change in deviance versus chi-squared with the right degrees of freedom for nested models, and why deviance is not a goodness-of-fit statistic for ungrouped binary data (use Hosmer-Lemeshow).
Keep the three families straight: logistic and binomial (logit link), Poisson and log-linear (log link, offsets for rates), and the multicategory split between baseline-category logit (nominal) and proportional-odds logit (ordinal, watch the polr sign convention).
Work the past R printouts by hand: name the model and link, do the deviance arithmetic, then write the plain-English interpretation, exactly the move the three assignments and the final exam reward.

Syllabus

The 7 topics, chapter by chapter

The exam-weight marker on each topic shows where the marks concentrate. The amber topics carry the highest exam weight.

Ch1

Ch1 · Linear model review and the road to GLM

Faraway, Extending the Linear Model with R; gavote case study

The normal linear model in matrix form, OLS equals MLE, the hat matrix and leverage, the LINE assumptions and diagnostics, weighted least squares, transformation and variable selection, set up as the four extensions that lead to the GLM.

Lower exam weight

Ch2

Ch2 · Binary response (logistic regression)

wcgs heart-disease data; Faraway Ch2

Binary logistic regression, the logit link and inverse logit, odds and odds ratios as e to the beta, logistic inference, deviance and Pearson residuals, model selection by AIC and BIC, goodness of fit, and classification by sensitivity, specificity, the ROC curve and AUC.

High exam weightQuiz me on binary response (logistic regression) →

Ch3

Ch3 · The generalised linear model framework

Exponential family and IRLS; Faraway Ch6 to Ch8

The three components of a GLM, the exponential family with canonical parameter and dispersion, mean and variance from b'(theta) and b''(theta), link functions, MLE by Fisher scoring and iteratively reweighted least squares, deviance, and the likelihood-ratio, Wald and score tests.

High exam weightQuiz me on the generalised linear model framework →

Ch4

Ch4 · Categorical data analysis by logistic regression

Binomial and grouped logistic; Faraway Ch4

Binomial and grouped logistic regression, testing fit adequacy and term significance, residuals, dose-response and LD50, overdispersion and the quasi-binomial fix, with worked examples on car preference and voting behaviour.

High exam weightQuiz me on categorical data analysis by logistic regression →

Ch5

Ch5 · Contingency tables by log-linear models

Poisson log-linear models; Faraway Ch5

Log-linear (Poisson) models for two-way and higher contingency tables, independence as no interaction, the cross-product (odds) ratio, the logistic-to-log-linear relation, hierarchical models, collapsibility, uniform association and model selection.

High exam weightQuiz me on contingency tables by log-linear models →

Ch6

Ch6 · GLMs for multicategorical responses

Multinomial and ordinal models; Faraway Ch7, Agresti

Nominal responses by baseline-category (multinomial) logit with the random-utility motivation, and ordinal responses by cumulative threshold models, the proportional-odds model and its constant cumulative odds ratio, plus sequential and adjacent-categories logit, fitted with multinom and polr.

High exam weightQuiz me on glms for multicategorical responses →

Ch7

Ch7 · Random and mixed-effects models

lme4; Faraway, mixed-models chapters

Random and mixed-effects models for clustered and correlated normal and non-normal data, two-stage linear random-effects, random effects in GLMs, posterior-mode and integration-based estimation, and marginal versus conditional models, fitted with lmer and glmer.

Lower exam weight

How it's assessed

Assessment structure

Component	Weight	Format & timing
Assignment 1	15%	R-based logistic-regression study (binary and grouped logistic on domestic-violence predictors), submitted to Gradescope via Canvas. Due 11:59pm Thu 2 Apr 2026 (date confirmed in source; weight allocation across the three assignments per the official MAST90139info.pdf). Covers the linear-model review and binary and grouped logistic regression (Ch1, Ch2 and Ch4).
Assignment 2	15%	R-based binomial dose-response study (beetle-mortality data, comparison of logit, probit and complementary log-log links), submitted to Gradescope. Due 11:59pm Fri 1 May 2026 (date confirmed in source). Covers binomial and grouped logistic regression and link comparison (Ch4).
Assignment 3	15%	R-based ordinal and multinomial study (coal-miner pneumoconiosis, three-category ordinal response), submitted to Gradescope. Due 11:59pm Sun 31 May 2026 (date confirmed in source). Covers multicategory nominal and ordinal responses (Ch6).
Final exam	55%	Closed-book written exam, 3 hours writing plus 15 minutes reading: 8 questions worth 110 marks in total. Non-programmable calculator and one double-sided A4 page of student notes permitted. Semester 1 examination period (the source cover sheet is for Semester 1 2026; confirm the S2 2026 date against the official exam timetable). Covers the full GLM arc, Ch1 to Ch7, weighted toward reading R output and the deviance and odds-ratio interpretation skills.

Assignment 115%

R-based logistic-regression study (binary and grouped logistic on domestic-violence predictors), submitted to Gradescope via Canvas.

Assignment 215%

R-based binomial dose-response study (beetle-mortality data, comparison of logit, probit and complementary log-log links), submitted to Gradescope.

Assignment 315%

R-based ordinal and multinomial study (coal-miner pneumoconiosis, three-category ordinal response), submitted to Gradescope.

Final exam55%

Closed-book written exam, 3 hours writing plus 15 minutes reading: 8 questions worth 110 marks in total. Non-programmable calculator and one double-sided A4 page of student notes permitted.

Pass on a weighted average of at least 50%. A compulsory plagiarism declaration quiz must be completed (a gate, not weighted). No further single-component hurdle is stated in the materials reviewed.
Final exam: 8 questions worth 110 marks across the GLM arc. Expect to read a glm, multinom or polr summary or an anova(test=Chi) table, do the deviance or delta-deviance-versus-chi-squared arithmetic by hand, and translate coefficients into odds ratios, rate ratios or cumulative-odds statements. The exam is worth 55% of the final mark.
Calculator policy: Final exam: non-programmable calculator permitted. One double-sided A4 page of handwritten student notes is permitted in the exam.

read this! If you read nothing else

This is an exam-cram unit. With the exams at 55% of the grade and the final exam alone at 55%, your result is overwhelmingly decided by how well you perform under time pressure. Covers the full GLM arc, Ch1 to Ch7, weighted toward reading R output and the deviance and odds-ratio interpretation skills.

Final exam timing: approx Nov 2026 (S2 offering, confirm against the official exam timetable). Confirm the exact date and venue on the official exam timetable.

How to actually pass it

A weekly rhythm, two checklists, and the traps to avoid

The unit rewards consistency over cramming, and practice over re-reading. Here is the loop that works, then what to have nailed before each exam.

The weekly loop

Before lecture

Skim the relevant Faraway chapter so the lecture confirms rather than introduces, and note the dataset (gavote, wcgs, pneumo and so on) the chapter is built on.

During lecture

Write out each model, link function and key formula by hand (exponential-family form, link, deviance) rather than only following slides; mark which R function fits it.

In the computer-lab tutorial

Run that week's R practical yourself in R, then read every line of the glm, multinom or polr output and translate each coefficient into an odds ratio, rate ratio or cumulative-odds statement.

End of each chapter

Add the chapter's model, link, deviance result and interpretation rule to your single A4 notes page, and re-derive the key formula from scratch once.

Before the mid-semester checklist

Be fluent with the linear-model review (matrix form, hat matrix, leverage, LINE assumptions) so the GLM extension has a foundation.
Drill binary logistic regression: the logit link and inverse logit, exponentiating coefficients into odds ratios, and reading the z value, deviances and AIC in the R summary.
Practise the classification block: confusion matrix, sensitivity and specificity, the ROC curve and AUC, and why high accuracy can hide near-zero sensitivity for rare events.
Lock down the exponential-family and link-function definitions and the deviance formula for the binomial and Poisson families.

Before the final heaviest topics

Prioritise the heaviest chapters: binary logistic (Ch2), the GLM framework (Ch3), and the multicategory models (Ch6), then log-linear models (Ch5) and binomial and grouped logistic (Ch4).
Rehearse the interpretation reflex on real R printouts: odds ratio for logistic, rate ratio for Poisson and log-linear, cumulative odds ratio for proportional-odds, each with the correct wording.
Drill the deviance arithmetic: residual deviance versus chi-squared for grouped goodness of fit, and change in deviance versus chi-squared with correct degrees of freedom for nested-model likelihood-ratio tests.
Build and memorise the one permitted double-sided A4 notes page: link functions, deviance formulas, the three tests (LR, Wald, score), the proportional-odds and baseline-category logit forms and the polr sign warning.
Time yourself across 8-question, 110-mark practice so the closed-book derivation-plus-interpretation pace is familiar.

The mistakes that cost marks

Reading a coefficient as the odds ratio. The R Estimate is on the log-odds (or log-rate) scale. You must exponentiate: e to the beta is the odds ratio or rate ratio. Quoting the raw coefficient as the OR, or forgetting that a negative coefficient means an OR below 1, is the single most common interpretation error and it loses easy marks throughout the exam.

Quoting residual deviance as goodness of fit for binary data. For ungrouped or binary data (group size 1) the residual deviance is a function of the fitted probabilities alone and is not a valid goodness-of-fit statistic. Use Hosmer-Lemeshow for binary goodness of fit. The change in deviance between nested models is still a valid likelihood-ratio test in all cases.

Confusing nominal and ordinal multinomial models. Baseline-category (multinomial) logit has its own coefficient set for each non-baseline category; proportional-odds logit has one slope vector plus the threshold intercepts. They are different models with different parameter counts. Also remember the polr sign convention: multiply polr slopes by -1 before stating the odds-ratio direction in the lecture's notation.

Ignoring overdispersion and offsets in count models. Poisson counts often show overdispersion (residual deviance much greater than its degrees of freedom); switch to quasi-Poisson and compare nested models with F rather than chi-squared, and do not double-adjust standard errors. To model a rate rather than a count, include log(exposure) as an offset, or the coefficient interpretation changes completely.

Build a study plan with Sia → Drill the back-half topics →

Teaching team

Who teaches MAST90139

The bios below are factual. The star ratings are not ours: they are impressions from students who have taken the unit, so you can hear from people who sat in the lectures.

Subject coordinator and lecturer

A/Prof. Guoqi Qian

Associate Professor in the School of Mathematics and Statistics, University of Melbourne, who coordinates and lectures MAST90139 Statistical Modelling for Data Science.

Student ratingNo student ratings yet

Add your review →

Subject assistant and tutor

Luyi Shen

Subject assistant for MAST90139 in the School of Mathematics and Statistics, supporting the R computer-lab tutorials.

Student ratingNo student ratings yet

Add your review →

Teaching team as listed in the unit materials reviewed. AskSia does not rate lecturers; star ratings are submitted by students who have taken MAST90139.

Where it fits

Prerequisites, related units & why it matters

Graduate subject in the Master of Data Science. Assumes a solid background in linear regression, probability and matrix algebra; the subject builds the GLM directly on top of the normal linear model. Taught and assessed entirely in R, so prior R exposure helps. Confirm the exact prerequisite list against the University Handbook entry.

COMP90038Algorithms and ComplexityClosest same-university graduate sibling ECON10004Introductory MicroeconomicsA lighter applied unit that uses introductory statistical re ExploreAll Business & Economics unitsUniMelb discipline hub BibleFull MAST90139 exam bibleEvery chapter, worked examples and exam strategy

Why it matters beyond the grade. MAST90139 installs the core applied-statistics toolkit for data science: fitting, checking and interpreting generalised linear models (logistic, Poisson and log-linear, multinomial and mixed-effects) on real categorical and count data in R. The skills, especially translating a fitted coefficient into an odds ratio or rate ratio and judging model fit by deviance, underpin work in data science, biostatistics, analytics and quantitative research.

Your MAST90139 study toolkit

Study the unit with Sia, not just read about it

Each tool already knows MAST90139: your syllabus, your texts, and where the marks are. Grouped by how you study, from first contact to exam week.

1 · Learn itunderstand the material

💬AI tutorAsk anything about MAST90139 and get step-by-step answers. 📤Explain my notesUpload your slides or lecture and Sia breaks them down. 📑Topic summariserCondense a week into the essentials you actually need.

2 · Practise ittest yourself

📝Practice quiz generatorUnlimited exam-style MCQs and short-answer on any topic. 📊Past paper analysisSee what the exams actually test, topic by topic. ✓Assignment & problem helpWork through problem sets one step at a time.

3 · Revise & cramlock it in before the exam

🃏FlashcardsKey concepts and formulas as a spaced-repetition deck. 📋Cheatsheet maker NEWAuto-build a one-page exam cheatsheet for the unit. 🧠Mindmap generatorSee how the topics connect on one visual map.

4 · Discuss itcompare notes

👥Community Q&AAsk other MAST90139 students and share what worked.

FAQ

Frequently asked questions

Is MAST90139 hard?

It is a hard graduate subject. The content is mathematically dense (exponential-family theory, IRLS, deviance and likelihood-ratio inference) and you must both derive results by hand and fit and interpret the models in R. The 55% closed-book final, with only one A4 page of notes allowed, concentrates the stakes. It is very manageable if you are comfortable with linear models, probability and matrix algebra and keep up with the weekly R practicals, but it punishes falling behind.

How is MAST90139 assessed?

Three R-based assignments submitted to Gradescope (binary and grouped logistic, binomial dose-response, and ordinal and multinomial data) plus a closed-book final exam worth 55% of the mark. The assignment due dates are confirmed in the subject materials (2 April, 1 May and 31 May 2026); the exact weight split across the three assignments is set in the official MAST90139info.pdf, which should be confirmed there.

What is the final exam like?

Based on the Semester 1 2026 exam cover sheet, it is a closed-book written exam of 3 hours writing time plus 15 minutes reading time, with 8 questions worth 110 marks, and it is worth 55% of your final mark. A non-programmable calculator and one double-sided A4 page of handwritten notes are permitted. Expect to read R output, do deviance arithmetic by hand and interpret coefficients as odds ratios, rate ratios or cumulative odds.

What software does the subject use?

R throughout, in computer-lab tutorials. The core package is faraway (it supplies the datasets such as gavote, wcgs and pneumo), and you also use MASS (for polr ordinal models), nnet (for multinom multinomial models) and lme4 (for lmer and glmer mixed-effects models). The standard glm function fits the logistic, Poisson and log-linear models.

What is the single most examined skill?

Turning a printed R coefficient into a real-world interpretation: e to the beta as an odds ratio (logistic), a rate ratio (Poisson and log-linear) or a cumulative odds ratio (proportional-odds). Closely behind it is the deviance arithmetic, computing the residual deviance for goodness of fit and the change in deviance between nested models as a likelihood-ratio (chi-squared) test. Practise both on real R printouts.

Which textbook does it follow?

Julian Faraway's Extending the Linear Model with R is the core reference; the faraway R package supplies the datasets used in lectures and practicals. Agresti's Categorical Data Analysis is cited for the multicategory material (for example adjacent-categories logit). Lecture slides and weekly R practical sheets are provided on Canvas.

Study MAST90139 with Sia

Work through binary response (logistic regression), the generalised linear model framework, categorical data analysis by logistic regression and the rest of the unit with a tutor that knows it and quizzes you on the topics the assessments weight most heavily.

Start studying with Sia