University of Sydney · S1 2026 · FACULTY OF BUSINESS & ECONOMICS

BUSS6002 · Data Science In Business

- one subject, every graph, every model, every mark
50% final exam · hurdle14 Chapters9-page Bible
Our own words - no uploaded lecturer files
Built to mirror S1 2026 · updated this semester
Chapter 10 of 11 · BUSS6002

Maximum Likelihood & Optimisation

Week 10 pulls back the curtain on how a model's parameters are actually estimated. The big idea is that every estimator in the unit is the solution of an optimisation problem — the parameter value that makes an objective function as large or as small as possible. The unit uses two principles: least squares (minimise the residual sum of squares) and maximum likelihood (choose the θ that makes the observed data most plausible, θ̂ = argmax L(θ)). The optimisation is solved either analytically (set the derivative to zero and solve, as for OLS) or iteratively with gradient descent when there is no closed form, as for logistic regression.

In this chapter

What this chapter covers

  • 011. Estimation is optimisation — argmin/argmax is the estimate; argmin (location) vs min (value)
  • 022. Likelihood — L(θ) = p(data|θ) = ∏ p(yᵢ|θ) for an independent sample
  • 033. Maximum likelihood — θ̂ = argmax L(θ): the value that makes the data most plausible
  • 044. Log-likelihood — ℓ(θ) = Σ log p(yᵢ|θ); why we log (underflow, product→sum, monotonic)
  • 055. The MLE recipe — log → differentiate → set = 0 → solve → check 2nd derivative < 0
  • 066. Worked distributions — Bernoulli L = θ^h(1−θ)^t; Poisson MLE = sample mean; exponential
  • 077. Analytic vs iterative — closed-form OLS β̂ = (XᵀX)⁻¹Xᵀy vs gradient descent
  • 088. Gradient descent & logistic regression — θ ← θ − α∇f on f = −ℓ; logistic has no closed form
Worked example · free

Derive the maximum likelihood estimator for a Poisson sample

Q [5 marks]. Let y₁,…,yₙ be an i.i.d. sample from a Poisson(λ) distribution with pmf p(y|λ) = λ^y e^(−λ) / y!. Show that the maximum likelihood estimate of λ is the sample mean ȳ, and evaluate it for the sample y = (2, 4, 3, 5, 1).
  • +1Likelihood (product over the independent sample). L(λ) = ∏ᵢ λ^(yᵢ) e^(−λ) / yᵢ!.
  • +1Log-likelihood (turns the product into a sum). ℓ(λ) = (Σᵢ yᵢ) log λ − nλ − Σᵢ log(yᵢ!); drop the last term, which has no λ.
  • +1Differentiate. dℓ/dλ = (Σᵢ yᵢ)/λ − n.
  • +1Set to zero and solve. (Σᵢ yᵢ)/λ − n = 0 ⟹ λ = (Σᵢ yᵢ)/n = ȳ; the second derivative d²ℓ/dλ² = −(Σᵢ yᵢ)/λ² < 0 confirms a maximum.
  • +1Evaluate. For y = (2,4,3,5,1), Σyᵢ = 15 and n = 5, so λ̂ = 15/5 = 3.
λ̂_MLE = ȳ = (Σᵢ yᵢ)/n, and for the given sample λ̂ = 15/5 = 3.
Sia tip — The MLE recipe is mechanical: log-likelihood → differentiate → set = 0 → solve → check the second derivative is negative. Drop additive constants like Σ log(yᵢ!) early — they vanish on differentiation and save time. The same skeleton derives the exponential MLE (θ̂ = 1/ȳ) and the Bernoulli MLE (θ̂ = sample proportion).
Glossary

Key terms

Objective function
A function that scores how good a parameter value is. Estimation maximises or minimises it; least squares minimises the residual sum of squares, maximum likelihood maximises the likelihood. The estimate is the location of the extreme point.
Likelihood L(θ)
The joint probability of the observed data read as a function of the parameter, L(θ) = p(data|θ) = ∏ᵢ p(yᵢ|θ) for an independent sample. It answers 'if θ were true, how plausible is the data we actually saw?' It is a function of θ (the data is fixed) and need not integrate to 1.
Maximum likelihood estimate (MLE)
θ̂ = argmax_θ L(θ): the parameter value that makes the observed data most plausible. If the model is correctly specified the MLE is, in a precise sense, the best estimate available.
Log-likelihood ℓ(θ)
ℓ(θ) = log L(θ) = Σᵢ log p(yᵢ|θ). We maximise the log because it avoids numerical underflow, turns a product into an easier-to-differentiate sum, and is monotonic so the argmax is unchanged.
argmin vs min
argmin is the LOCATION (the x that is best) and is the parameter estimate; min is the VALUE (the best f). For f(x) = 3 + (x−2)², argmin f = 2 but min f = 3 — two different numbers.
Analytic / closed-form solution
Solving dℓ/dθ = 0 algebraically for an exact, one-shot answer. OLS has the closed form β̂ = (XᵀX)⁻¹Xᵀy; the Poisson and exponential MLEs are also closed form. Many models (logistic regression) have none.
Gradient descent
An iterative optimiser that minimises an objective by repeatedly stepping downhill: θ_{k+1} = θ_k − α∇f(θ_k). Because it minimises, we apply it to the NEGATIVE log-likelihood f = −ℓ when we want to maximise the likelihood.
Learning rate (step size) α
The size of each gradient-descent step. Too large overshoots or diverges; too small converges very slowly. Gradient descent can also settle in a local (not global) optimum when the objective is non-convex.
Cross-entropy / log-loss
The objective minimised when fitting logistic regression, equal to the negative Bernoulli log-likelihood f = −Σᵢ[yᵢ log pᵢ + (1−yᵢ) log(1−pᵢ)]. Its gradient is ∇f(β) = Xᵀ(p − y); it has no closed-form minimiser, so gradient descent is required.
FAQ

Maximum Likelihood & Optimisation FAQ

Why do we maximise the log-likelihood instead of the likelihood?

Three reasons, all examinable. (1) Numerical underflow: a product of thousands of probabilities (each < 1) collapses to ~0 in floating point, while summing their logs does not. (2) The log turns the product ∏ᵢ p(yᵢ|θ) into a sum Σᵢ log p(yᵢ|θ), which is far easier to differentiate. (3) The log is strictly increasing (monotonic), so it does not move the optimum — argmax L = argmax ℓ. As a bonus, for exponential-family densities the log cancels the exponential.

What is the difference between argmin and min?

argmin is the location — the x (parameter value) that optimises the objective — and that is your estimate. min is the value of the objective at its best point. For f(x) = 5 + 3(x−4)², the argmin is 4 and the min is 5. A completed-square form c + a(x−h)² with a > 0 has minimum value c at x = h, so you can read both off without calculus.

Why does logistic regression need gradient descent when OLS does not?

OLS minimises a quadratic loss whose derivative is linear, so setting it to zero gives the closed-form β̂ = (XᵀX)⁻¹Xᵀy in one shot. The logistic log-likelihood's score equation, ∇f(β) = Xᵀ(p − y) = 0 with pᵢ = σ(xᵢᵀβ), is nonlinear in β and has no algebraic solution. So logistic regression is fit by maximum likelihood iteratively — gradient descent on the cross-entropy f = −ℓ.

Gradient descent maximises or minimises — and what does that mean for MLE?

Gradient descent MINIMISES. Maximum likelihood wants to maximise ℓ, so you minimise the negative log-likelihood f = −ℓ instead (maximising ℓ is the same as minimising −ℓ). Forgetting this minus sign is the classic mistake: the update is θ_{k+1} = θ_k − α∇f(θ_k) applied to f = −ℓ.

Do maximum likelihood and least squares ever give the same answer?

Yes. For a linear regression with Normally distributed errors, ε ~ N(0, σ²), maximising the likelihood yields exactly the least-squares estimator β̂ = (XᵀX)⁻¹Xᵀy. So under Gaussian noise OLS is MLE — a favourite true/false MCQ. The two principles diverge for other models (e.g. logistic regression, which has no least-squares closed form).

What goes wrong if the sample is not independent?

The likelihood only factorises into a product ∏ᵢ p(yᵢ|θ) — and the log-likelihood into a sum Σᵢ log p(yᵢ|θ) — when the observations are independent. If they are dependent you cannot split the joint probability that way, so the standard product/sum derivation does not apply. The exam likes to test that you know the independence assumption is what makes the recipe work.

Study strategy

Exam move

Treat Week 10 as a single repeatable move: estimation is optimisation, so every question is really argmin or argmax. Drill the MLE recipe until it is automatic — write the likelihood as a product, log it into a sum, differentiate, set to zero, solve, then confirm the second derivative is negative — using the Poisson derivation as your template and rehearsing the exponential and Bernoulli versions too. Keep the vocabulary razor-sharp: argmin (location) vs min (value), likelihood as a function of θ not the data, and the three reasons we log. Memorise the analytic-vs-iterative split: OLS has the closed form β̂ = (XᵀX)⁻¹Xᵀy, logistic regression has none and needs gradient descent on the cross-entropy f = −ℓ with update θ ← θ − α∇f. Practise one gradient-descent step by hand and be ready to state the learning-rate trade-off (too big diverges, too small crawls) and that OLS equals MLE under Gaussian errors. Finally, rehearse writing a short NumPy grid search or gradient-descent loop from memory, since the final exam includes hand-written Python.

A+Everything unlocked
Unlocks this Bible + all 203 of your University of Sydney subjects - and 1,000+ Bibles across every Australian university.
Sia - your BUSS6002 tutor, unlimited, worked the way the exam marks it
The full 9-page Bible + practice bank with worked solutions
Chrome extension - sync your LMS so Sia knows your deadlines
Bilingual EN / Chinese on every Bible and every Sia answer
$25/ month
30-day money-back · cancel in one tap · how it works
Unlock the full BUSS6002 Bible + 203 University of Sydney subjects解锁完整 BUSS6002 Bible + University of Sydney 203 门科目
$25/mo