University of Sydney · S1 2026 · FACULTY OF BUSINESS & ECONOMICS

BUSS6002 · Data Science In Business

- one subject, every graph, every model, every mark
50% final exam · hurdle14 Chapters9-page Bible
Our own words - no uploaded lecturer files
Built to mirror S1 2026 · updated this semester
Chapter 11 of 11 · BUSS6002

Big Data Solutions

Week 11 reframes big data as a computation problem rather than a storage one: when n or p reaches the millions, the same estimators (OLS, logistic regression) become too slow or impossible to fit directly. The unit asks you to classify a dataset by shapewide (large p, many predictors) versus tall (large n, many observations) — and reach for the matching fix, to reason about an algorithm's cost with Big-O notation, and to explain why stochastic gradient descent (SGD) scales to huge n where full-batch gradient descent does not. It is examined in the final only (Week 11 falls after the mid-semester cut-off), and it is conceptual, so marks come from precise definitions in the course's notation.

In this chapter

What this chapter covers

  • 011. Big data as a compute problem — the 4 V's (Volume, Variety, Velocity, Veracity) resurfacing as a bottleneck
  • 022. Wide data — large p (many predictors, few observations); fix = reduce predictors / less-flexible model
  • 033. Tall data — large n (many observations, few predictors); fix = row-blocks / stochastic gradient descent
  • 044. Why wide breaks OLS — p > n overfits and XᵀX becomes singular (no unique inverse)
  • 055. Big-O notation — an asymptotic upper bound on worst-case running time vs input size n
  • 066. The growth ladder — O(1) < O(log n) < O(n) < O(n log n) < O(n²) < O(2ⁿ)
  • 077. Full-batch gradient descent — gradient over all n rows costs O(n) per step
  • 088. Stochastic gradient descent — one-observation gradient, O(1) per step, noisy but scalable
Worked example · free

Full-batch gradient descent versus SGD on two million rows

Q [4 marks]. You fit a linear model on n = 2,000,000 observations by minimising the least-squares loss with gradient descent. (a) State the per-step time complexity of full-batch gradient descent in big-O of n, and explain why. (b) What does stochastic gradient descent (SGD) compute instead, and at what per-step cost? (c) Why is SGD preferred for this tall data?
  • +1Full-batch cost. The least-squares gradient ∇f(β | D) = −2Xᵀ(y − Xβ) is formed over the whole sample. Because X has n rows, computing Xβ and Xᵀ(·) touches all n = 2,000,000 observations, so one step is O(n) — cost grows linearly with the number of observations.
  • +1What SGD computes. SGD replaces the exact gradient with a noisy estimate from a single random observation, ∇f(β | yᵢ, xᵢ) = −2xᵢᵀ(yᵢ − xᵢβ), which costs O(1) per step — independent of n.
  • +1More updates per unit compute. For the same budget as one full-batch step, SGD can take up to two million cheap steps, so the parameters update far more frequently.
  • +1Why it wins. Each SGD step is noisier, but the frequent updates make real progress toward the minimum in a fraction of the wall-clock time. SGD trades gradient accuracy for per-step speed, which is the right trade when n is enormous.
(a) O(n) per step, because the full gradient reads all n rows of X. (b) A one-observation gradient −2xᵢᵀ(yᵢ − xᵢβ), costing O(1) per step in n. (c) For tall data each cheap O(1) step updates the parameters far more often than one expensive O(n) step, so SGD converges much faster in wall-clock time — it trades gradient accuracy for per-step speed.
Sia tip — Anchor the two costs: full-batch = O(n) (touches every row); SGD = O(1) in n (touches one row). The O(1) refers only to the dependence on sample size n — SGD still needs more steps because its gradient is noisy, so it is scalable, not 'free'.
Glossary

Key terms

Big data (4 V's)
Data characterised by Volume (size), Variety (multiple formats), Velocity (high-speed generation) and Veracity (trustworthiness). 'Big in bytes' alone is not big data — small data can be large yet fail the 4-V test. Volume and velocity are what make estimation a computational challenge.
Wide data
A dataset with a large number of predictors p but few observations n (p > n, often p ≫ n) — a short, fat matrix. It overfits easily and makes XᵀX singular; the remedy is to reduce predictors (dimensionality reduction) and/or use a less-flexible model.
Tall data
A dataset with few predictors p but a very large number of observations n (n ≫ p) — a long, narrow matrix. Each pass over the data is expensive, so the remedy is to divide it into row-blocks or use stochastic gradient descent.
Big-O notation
An asymptotic upper bound on an algorithm's worst-case running time as a function of input size n. T(n) = O(f(n)) means there are constants k and n₀ with T(n) ≤ k·f(n) for all n ≥ n₀. It ignores constants and lower-order terms to compare how algorithms scale.
Algorithmic complexity ladder
The ordering of common growth rates from cheapest to most expensive: O(1) < O(log n) < O(n) < O(n log n) < O(n²) < O(2ⁿ). Exponential dominates polynomial dominates linear dominates logarithmic; big-data methods aim to stay polynomial or better.
Gradient descent (full batch)
An iterative optimiser that steps against the gradient by a learning rate α: θ_{j+1} = θ_j − α·∇f(θ_j | D). For the OLS loss the gradient −2Xᵀ(y − Xβ) is computed over all n observations, so each step is O(n). Used when a model (e.g. logistic regression) has no closed form.
Stochastic gradient descent (SGD)
Gradient descent that approximates the full gradient using a single random observation, ∇f(β | yᵢ, xᵢ) = −2xᵢᵀ(yᵢ − xᵢβ). Each step costs O(1) in n — cheap and frequent but noisy — which makes it the standard scalable fix for tall (large-n) data.
Dimensionality reduction
The wide-data remedy of cutting the number of predictors — by projection, feature selection or regularisation — so a model with p ≫ n stops fitting noise. It deliberately adds a little bias to buy a large drop in variance, the same trade-off used in model selection.
FAQ

Big Data Solutions FAQ

What is the difference between wide and tall data?

Wide data has a large number of predictors p but few observations n (p > n) — a short, fat matrix, like genomics with tens of thousands of gene columns for a handful of patients. Tall data has few predictors but a huge number of observations n (n ≫ p) — a long, narrow matrix, like every transaction a platform logs. The two need opposite fixes, so identifying which dimension is big is the whole task.

Which fix goes with which shape?

Wide (large p): reduce the number of predictors — dimensionality reduction, feature selection or a less-flexible model — because p > n overfits and XᵀX becomes singular. Tall (large n): divide the data into row-blocks or use stochastic gradient descent, because each full pass over the rows is expensive. The classic exam error is swapping them: 'divide into blocks' is the tall remedy, not the wide one.

What exactly does Big-O notation tell me?

It is an asymptotic upper bound on the worst-case running time as the input grows. O(n) means run-time grows at most linearly for large n — not exactly n operations — and constants drop out, so O(3n) and O(n) are the same class. It lets you compare how algorithms scale rather than count seconds, and the goal at scale is to stay polynomial or better, never exponential.

Why does full-batch gradient descent become too slow on big data?

Each step computes the gradient over the entire sample. For the least-squares loss that gradient is −2Xᵀ(y − Xβ), and because X has n rows the step touches all n observations — O(n) per iteration. When n runs into the millions, one step is expensive and you need many, so the iterative fit becomes impractical.

How does SGD make gradient descent scalable?

Instead of the full gradient it uses a noisy estimate from one randomly chosen observation, −2xᵢᵀ(yᵢ − xᵢβ), which costs O(1) per step regardless of n. For the same compute as one full-batch step you can take up to n cheap steps, so the parameters update far more often and converge much faster in wall-clock time. The trade is gradient accuracy for per-step speed, which wins when n is enormous.

Is this chapter on the mid-semester exam?

No. Week 11 falls after the mid-semester cut-off, which covers Weeks 1–6 only, so Big Data Solutions is examined in the final exam (which covers all weeks). Expect a 1-mark MCQ on wide-vs-tall or ordering Big-O growth rates and a short-answer on gradient-descent complexity and SGD.

Study strategy

Exam move

Treat Week 11 as easy, high-yield final-exam marks that reward crisp definitions rather than arithmetic. Lock the central mapping cold: wide = large p = cut the columns (dimensionality reduction / simpler model); tall = large n = block the rows or use SGD — and rehearse the trap that 'divide into blocks' is the tall fix, not the wide one. Memorise the Big-O ladder O(1) < O(log n) < O(n) < O(n log n) < O(n²) < O(2ⁿ) and sketch it on your A4 note sheet so any ordering MCQ is a read-off; remember big-O is an upper bound on the worst case, not an exact count. Be able to state, in one breath, that full-batch gradient descent is O(n) per step because the gradient −2Xᵀ(y − Xβ) reads every row, while SGD is O(1) in n because it uses one observation — noisier but scalable. Finally, connect the wide-data remedy back to the bias–variance and model-selection ideas of the model-evaluation chapter, and always write your answers in the course's notation.

A+Everything unlocked
Unlocks this Bible + all 203 of your University of Sydney subjects - and 1,000+ Bibles across every Australian university.
Sia - your BUSS6002 tutor, unlimited, worked the way the exam marks it
The full 9-page Bible + practice bank with worked solutions
Chrome extension - sync your LMS so Sia knows your deadlines
Bilingual EN / Chinese on every Bible and every Sia answer
$25/ month
30-day money-back · cancel in one tap · how it works
Unlock the full BUSS6002 Bible + 203 University of Sydney subjects解锁完整 BUSS6002 Bible + University of Sydney 203 门科目
$25/mo