University of Sydney · S1 2026 · FACULTY OF INFORMATION TECHNOLOGY

DATA3404 · Scalable Data Management

- one subject, every graph, every model, every mark

50% final exam · hurdle7 Chapters65-page Bible

Our own words - no uploaded lecturer files

Built to mirror S1 2026 · updated this semester

The Complete Exam Bible · S1 2026

Scalable Data Management

— one engine, every cost model, every page of I/O, every mark

Scalable Data Management teaches how a database engine actually works underneath SQL — how data is laid out on disk pages, how indexes turn a full scan into a lookup, how external sorting and the five join algorithms run, how the query optimiser picks a plan, and how the whole thing scales out across a cluster with parallel databases and Spark. The exam does not test SQL syntax; it tests whether you can predict the engine's cost by hand — page I/Os for a join, passes of external sort, the height of a B+-tree, the optimiser's size estimates. The final exam is 60% of your grade and double-hurdled (you need ≥40% on the exam and ≥50% overall to pass), so this guide teaches each topic to exam standard: the mechanism, the cost formula examiners expect, and exactly where the marks are won.

DATA3404 · University of Sydney

Contents · the whole subject, one map

What DATA3404 covers

Seven exam topics → one cost-model map. Each links to its free chapter guide.

01Data Storage and FilesPages & records · slotted pages · heap vs sorted · row vs column · the buffer pool 02IndexingB+-trees · fan-out & height · extendible hashing · clustered vs unclustered 03Sorting and Query ProcessingThe operator pipeline · access paths · external merge sort & its pass count 04Join AlgorithmsNested-loop · block · index · sort-merge · grace hash — the five I/O cost formulas 05Query OptimisationRA equivalences · selectivity & reduction factors · costing plans · System-R DP 06Parallel DatabasesPartitioning · speed-up vs scale-up · distributed joins · HDFS · MapReduce 07Big Data with SparkRDDs · the lazy DAG · narrow vs wide dependencies · NoSQL & CAP · streaming

Assessment

How DATA3404 is assessed

Component	Weight	Format
Final examination · hurdle	60%	Paper-based, 2 hours, semi-open book · ≥40% hurdle — and you also need ≥50% overall to pass (double hurdle)
Practical assignments (×2)	~24%	DB programming & testing, then SQL + Spark tuning — group work across the semester
Quizzes, participation & a concept presentation	~16%	Weekly quizzes and a group video presentation — confirm the exact split and the ‘semi-open book’ rules in your unit outline

Worked example · free

Block nested-loop join — the I/O cost, step by step

Q [5 marks]. You must join Student (b_R = 100 pages) with Enrollment (b_S = 400 pages) using a block nested-loop join with a buffer of M = 10 pages. Take Student as the outer relation. Compute the total I/O cost in page transfers (the output is not counted).

+1Recall the BNLJ cost formula: cost = b_R + ⌈b_R / (M − 2)⌉ × b_S. One buffer frame is reserved for reading the inner relation and one for output, so each outer block is (M − 2) pages.
+1Outer-block size: M − 2 = 10 − 2 = 8 pages per block of the outer relation.
+1Number of outer blocks: ⌈b_R / (M − 2)⌉ = ⌈100 / 8⌉ = 13 blocks.
+1Plug in: cost = 100 + 13 × 400 = 100 + 5,200.
+1State the answer: total cost = 5,300 page I/Os — Student is read once, and Enrollment is re-scanned once per outer block (13 times).

Total I/O = b_R + ⌈b_R/(M−2)⌉ × b_S = 100 + ⌈100/8⌉ × 400 = 100 + 13 × 400 = 5,300 page transfers.

Sia tip — Always state which relation you made the outer one — it changes the block count and therefore the cost. Making the smaller relation the outer one minimises the re-scans of the inner. And the output is never counted in DATA3404's I/O cost model.

Glossary

Key terms

Page I/O: The transfer of one fixed-size page between disk and memory — the single unit DATA3404 measures every cost in. The whole exam is about counting page I/Os; CPU time, sequential-vs-random access and the output write are all ignored in the cost model.
B+-tree: The default database index: a balanced, high-fan-out tree whose leaves hold the data entries in sorted order. A search costs ⌈log_F B⌉ + 1 page I/Os, where F is the fan-out, so even a huge table is reached in a handful of page reads instead of a full scan.
External merge sort: The algorithm for sorting data that does not fit in memory. With N pages and B buffer frames it runs in 1 + ⌈log_B−1 ⌈N/B⌉⌉ passes, each pass reading and writing all N pages, so the total cost is 2N × (number of passes).
Selectivity (reduction factor): The fraction of rows a predicate keeps — an equality on a key with V distinct values has selectivity 1/V. The optimiser multiplies these reduction factors to estimate the size of each intermediate result, which then drives every join cost estimate.
Data partitioning: Splitting a relation across the nodes of a cluster — round-robin (even load, no pruning), range (prunes range queries), or hash (prunes equality and enables co-located joins). The chosen scheme decides which queries can skip nodes and which distributed join is cheapest.

FAQ

DATA3404 FAQ

Is DATA3404 hard?

It is calculation-dense rather than conceptually deep. Most marks come from costing the engine by hand — join I/O, sort passes, B+-tree height, optimiser size estimates — so the difficulty is precision and speed under exam conditions, not memorising SQL. Once you can run the cost chains automatically, the numbers stop surprising you.

How is DATA3404 assessed?

The final exam is 60% and double-hurdled: you need at least 40% on the exam and at least 50% overall to pass the unit. The remaining 40% is continuous — two practical assignments (database programming & testing, then SQL + Spark tuning), weekly quizzes and participation, and a group concept video. Confirm this year's exact split in your unit outline.

What is on the DATA3404 final exam?

Storage and buffer management, indexing (B+-trees and hashing), external sorting and query processing, the five join algorithms and their I/O cost, query optimisation (selectivity and plan costing), and the scale-out stack — parallel and distributed databases, HDFS/MapReduce and Spark. The headline repeated tasks are computing a join's I/O cost and the number of passes of external sort.

Is the DATA3404 exam open book?

The unit overview describes the exam as semi-open book, but the exact materials you may bring are set on the current unit outline and exam instructions — do not assume you can bring this guide or unlimited notes. Study as if it were closed book and treat any permitted sheet as a bonus; confirm the rules on Canvas before the exam.

Do I need to be good at SQL or coding to pass DATA3404?

The practical assignments train your SQL and Spark, but the 60% exam tests the engine's internals — it asks you to predict cost, not to write code. You can write perfect SQL and still lose the exam if you can't cost a join or count sort passes, so prioritise the cost models when revising.

Is using AskSia for DATA3404 cheating?

No. AskSia is a study reference written in our own words — we host none of your lecturer's files, and Sia teaches you the method to earn the marks; it does not complete or sit your assessments.

Study strategy

How to study for the exam

Treat DATA3404 as a set of cost chains you can run on autopilot, because every high-value exam question is one of them: relation sizes + buffer → pick the join → compute I/O; N pages + B frames → sort passes → 2N × passes; records + fan-out → B+-tree height → search cost; query → push selections down → estimate sizes → cost the plan. Drill each chain until you can write the formula and every page count from memory, because method marks are real — show the formula and the arithmetic even if the final number slips. The join and external-sort cost calculations are the two most repeated tasks in the whole unit and clear the 40% exam hurdle comfortably; the Spark and NoSQL material is lighter and conceptual, so bank the cost-formula marks first.

A+Everything unlocked

Unlocks this Bible + all 8 of your University of Sydney subjects - and 1,000+ Bibles across every Australian university.

Sia - your DATA3404 tutor, unlimited, worked the way the exam marks it

The full 65-page Bible + practice bank with worked solutions

Chrome extension - sync your LMS so Sia knows your deadlines

Bilingual EN / Chinese on every Bible and every Sia answer

$25/ month

30-day money-back · cancel in one tap · how it works