University of Sydney · S1 2026 · FACULTY OF INFORMATION TECHNOLOGY

DATA3404 · Scalable Data Management

- one subject, every graph, every model, every mark
50% final exam · hurdle7 Chapters65-page Bible
Our own words - no uploaded lecturer files
Built to mirror S1 2026 · updated this semester
The Complete Exam Bible · S1 2026

Scalable Data Management

— one engine, every cost model, every page of I/O, every mark

Scalable Data Management teaches how a database engine actually works underneath SQL — how data is laid out on disk pages, how indexes turn a full scan into a lookup, how external sorting and the five join algorithms run, how the query optimiser picks a plan, and how the whole thing scales out across a cluster with parallel databases and Spark. The exam does not test SQL syntax; it tests whether you can predict the engine's cost by hand — page I/Os for a join, passes of external sort, the height of a B+-tree, the optimiser's size estimates. The final exam is 60% of your grade and double-hurdled (you need ≥40% on the exam and ≥50% overall to pass), so this guide teaches each topic to exam standard: the mechanism, the cost formula examiners expect, and exactly where the marks are won.

DATA3404 · University of Sydney
Assessment

How DATA3404 is assessed

ComponentWeightFormat
Final examination · hurdle60%Paper-based, 2 hours, semi-open book · ≥40% hurdle — and you also need ≥50% overall to pass (double hurdle)
Practical assignments (×2)~24%DB programming & testing, then SQL + Spark tuning — group work across the semester
Quizzes, participation & a concept presentation~16%Weekly quizzes and a group video presentation — confirm the exact split and the ‘semi-open book’ rules in your unit outline
Worked example · free

Block nested-loop join — the I/O cost, step by step

Q [5 marks]. You must join Student (bR = 100 pages) with Enrollment (bS = 400 pages) using a block nested-loop join with a buffer of M = 10 pages. Take Student as the outer relation. Compute the total I/O cost in page transfers (the output is not counted).
  • +1Recall the BNLJ cost formula: cost = bR + ⌈bR / (M − 2)⌉ × bS. One buffer frame is reserved for reading the inner relation and one for output, so each outer block is (M − 2) pages.
  • +1Outer-block size: M − 2 = 10 − 2 = 8 pages per block of the outer relation.
  • +1Number of outer blocks: ⌈bR / (M − 2)⌉ = ⌈100 / 8⌉ = 13 blocks.
  • +1Plug in: cost = 100 + 13 × 400 = 100 + 5,200.
  • +1State the answer: total cost = 5,300 page I/Os — Student is read once, and Enrollment is re-scanned once per outer block (13 times).
Total I/O = bR + ⌈bR/(M−2)⌉ × bS = 100 + ⌈100/8⌉ × 400 = 100 + 13 × 400 = 5,300 page transfers.
Sia tip — Always state which relation you made the outer one — it changes the block count and therefore the cost. Making the smaller relation the outer one minimises the re-scans of the inner. And the output is never counted in DATA3404's I/O cost model.
Glossary

Key terms

Page I/O
The transfer of one fixed-size page between disk and memory — the single unit DATA3404 measures every cost in. The whole exam is about counting page I/Os; CPU time, sequential-vs-random access and the output write are all ignored in the cost model.
B+-tree
The default database index: a balanced, high-fan-out tree whose leaves hold the data entries in sorted order. A search costs ⌈logF B⌉ + 1 page I/Os, where F is the fan-out, so even a huge table is reached in a handful of page reads instead of a full scan.
External merge sort
The algorithm for sorting data that does not fit in memory. With N pages and B buffer frames it runs in 1 + ⌈logB−1 ⌈N/B⌉⌉ passes, each pass reading and writing all N pages, so the total cost is 2N × (number of passes).
Selectivity (reduction factor)
The fraction of rows a predicate keeps — an equality on a key with V distinct values has selectivity 1/V. The optimiser multiplies these reduction factors to estimate the size of each intermediate result, which then drives every join cost estimate.
Data partitioning
Splitting a relation across the nodes of a cluster — round-robin (even load, no pruning), range (prunes range queries), or hash (prunes equality and enables co-located joins). The chosen scheme decides which queries can skip nodes and which distributed join is cheapest.
FAQ

DATA3404 FAQ

Is DATA3404 hard?

It is calculation-dense rather than conceptually deep. Most marks come from costing the engine by hand — join I/O, sort passes, B+-tree height, optimiser size estimates — so the difficulty is precision and speed under exam conditions, not memorising SQL. Once you can run the cost chains automatically, the numbers stop surprising you.

How is DATA3404 assessed?

The final exam is 60% and double-hurdled: you need at least 40% on the exam and at least 50% overall to pass the unit. The remaining 40% is continuous — two practical assignments (database programming & testing, then SQL + Spark tuning), weekly quizzes and participation, and a group concept video. Confirm this year's exact split in your unit outline.

What is on the DATA3404 final exam?

Storage and buffer management, indexing (B+-trees and hashing), external sorting and query processing, the five join algorithms and their I/O cost, query optimisation (selectivity and plan costing), and the scale-out stack — parallel and distributed databases, HDFS/MapReduce and Spark. The headline repeated tasks are computing a join's I/O cost and the number of passes of external sort.

Is the DATA3404 exam open book?

The unit overview describes the exam as semi-open book, but the exact materials you may bring are set on the current unit outline and exam instructions — do not assume you can bring this guide or unlimited notes. Study as if it were closed book and treat any permitted sheet as a bonus; confirm the rules on Canvas before the exam.

Do I need to be good at SQL or coding to pass DATA3404?

The practical assignments train your SQL and Spark, but the 60% exam tests the engine's internals — it asks you to predict cost, not to write code. You can write perfect SQL and still lose the exam if you can't cost a join or count sort passes, so prioritise the cost models when revising.

Is using AskSia for DATA3404 cheating?

No. AskSia is a study reference written in our own words — we host none of your lecturer's files, and Sia teaches you the method to earn the marks; it does not complete or sit your assessments.

Study strategy

How to study for the exam

Treat DATA3404 as a set of cost chains you can run on autopilot, because every high-value exam question is one of them: relation sizes + buffer → pick the join → compute I/O; N pages + B frames → sort passes → 2N × passes; records + fan-out → B+-tree height → search cost; query → push selections down → estimate sizes → cost the plan. Drill each chain until you can write the formula and every page count from memory, because method marks are real — show the formula and the arithmetic even if the final number slips. The join and external-sort cost calculations are the two most repeated tasks in the whole unit and clear the 40% exam hurdle comfortably; the Spark and NoSQL material is lighter and conceptual, so bank the cost-formula marks first.

A+Everything unlocked
Unlocks this Bible + all 8 of your University of Sydney subjects - and 1,000+ Bibles across every Australian university.
Sia - your DATA3404 tutor, unlimited, worked the way the exam marks it
The full 65-page Bible + practice bank with worked solutions
Chrome extension - sync your LMS so Sia knows your deadlines
Bilingual EN / Chinese on every Bible and every Sia answer
$25/ month
30-day money-back · cancel in one tap · how it works
Unlock the full DATA3404 Bible + 8 University of Sydney subjects解锁完整 DATA3404 Bible + University of Sydney 8 门科目
$25/mo