DATA3404 · Scalable Data Management
Scalable Data Management
Scalable Data Management teaches how a database engine actually works underneath SQL — how data is laid out on disk pages, how indexes turn a full scan into a lookup, how external sorting and the five join algorithms run, how the query optimiser picks a plan, and how the whole thing scales out across a cluster with parallel databases and Spark. The exam does not test SQL syntax; it tests whether you can predict the engine's cost by hand — page I/Os for a join, passes of external sort, the height of a B+-tree, the optimiser's size estimates. The final exam is 60% of your grade and double-hurdled (you need ≥40% on the exam and ≥50% overall to pass), so this guide teaches each topic to exam standard: the mechanism, the cost formula examiners expect, and exactly where the marks are won.
What DATA3404 covers
Seven exam topics → one cost-model map. Each links to its free chapter guide.
How DATA3404 is assessed
| Component | Weight | Format |
|---|---|---|
| Final examination · hurdle | 60% | Paper-based, 2 hours, semi-open book · ≥40% hurdle — and you also need ≥50% overall to pass (double hurdle) |
| Practical assignments (×2) | ~24% | DB programming & testing, then SQL + Spark tuning — group work across the semester |
| Quizzes, participation & a concept presentation | ~16% | Weekly quizzes and a group video presentation — confirm the exact split and the ‘semi-open book’ rules in your unit outline |
Block nested-loop join — the I/O cost, step by step
- +1Recall the BNLJ cost formula: cost = bR + ⌈bR / (M − 2)⌉ × bS. One buffer frame is reserved for reading the inner relation and one for output, so each outer block is (M − 2) pages.
- +1Outer-block size: M − 2 = 10 − 2 = 8 pages per block of the outer relation.
- +1Number of outer blocks: ⌈bR / (M − 2)⌉ = ⌈100 / 8⌉ = 13 blocks.
- +1Plug in: cost = 100 + 13 × 400 = 100 + 5,200.
- +1State the answer: total cost = 5,300 page I/Os — Student is read once, and Enrollment is re-scanned once per outer block (13 times).
Key terms
- Page I/O
- The transfer of one fixed-size page between disk and memory — the single unit DATA3404 measures every cost in. The whole exam is about counting page I/Os; CPU time, sequential-vs-random access and the output write are all ignored in the cost model.
- B+-tree
- The default database index: a balanced, high-fan-out tree whose leaves hold the data entries in sorted order. A search costs ⌈logF B⌉ + 1 page I/Os, where F is the fan-out, so even a huge table is reached in a handful of page reads instead of a full scan.
- External merge sort
- The algorithm for sorting data that does not fit in memory. With N pages and B buffer frames it runs in 1 + ⌈logB−1 ⌈N/B⌉⌉ passes, each pass reading and writing all N pages, so the total cost is 2N × (number of passes).
- Selectivity (reduction factor)
- The fraction of rows a predicate keeps — an equality on a key with V distinct values has selectivity 1/V. The optimiser multiplies these reduction factors to estimate the size of each intermediate result, which then drives every join cost estimate.
- Data partitioning
- Splitting a relation across the nodes of a cluster — round-robin (even load, no pruning), range (prunes range queries), or hash (prunes equality and enables co-located joins). The chosen scheme decides which queries can skip nodes and which distributed join is cheapest.
DATA3404 FAQ
Is DATA3404 hard?
It is calculation-dense rather than conceptually deep. Most marks come from costing the engine by hand — join I/O, sort passes, B+-tree height, optimiser size estimates — so the difficulty is precision and speed under exam conditions, not memorising SQL. Once you can run the cost chains automatically, the numbers stop surprising you.
How is DATA3404 assessed?
The final exam is 60% and double-hurdled: you need at least 40% on the exam and at least 50% overall to pass the unit. The remaining 40% is continuous — two practical assignments (database programming & testing, then SQL + Spark tuning), weekly quizzes and participation, and a group concept video. Confirm this year's exact split in your unit outline.
What is on the DATA3404 final exam?
Storage and buffer management, indexing (B+-trees and hashing), external sorting and query processing, the five join algorithms and their I/O cost, query optimisation (selectivity and plan costing), and the scale-out stack — parallel and distributed databases, HDFS/MapReduce and Spark. The headline repeated tasks are computing a join's I/O cost and the number of passes of external sort.
Is the DATA3404 exam open book?
The unit overview describes the exam as semi-open book, but the exact materials you may bring are set on the current unit outline and exam instructions — do not assume you can bring this guide or unlimited notes. Study as if it were closed book and treat any permitted sheet as a bonus; confirm the rules on Canvas before the exam.
Do I need to be good at SQL or coding to pass DATA3404?
The practical assignments train your SQL and Spark, but the 60% exam tests the engine's internals — it asks you to predict cost, not to write code. You can write perfect SQL and still lose the exam if you can't cost a join or count sort passes, so prioritise the cost models when revising.
Is using AskSia for DATA3404 cheating?
No. AskSia is a study reference written in our own words — we host none of your lecturer's files, and Sia teaches you the method to earn the marks; it does not complete or sit your assessments.
How to study for the exam
Treat DATA3404 as a set of cost chains you can run on autopilot, because every high-value exam question is one of them: relation sizes + buffer → pick the join → compute I/O; N pages + B frames → sort passes → 2N × passes; records + fan-out → B+-tree height → search cost; query → push selections down → estimate sizes → cost the plan. Drill each chain until you can write the formula and every page count from memory, because method marks are real — show the formula and the arithmetic even if the final number slips. The join and external-sort cost calculations are the two most repeated tasks in the whole unit and clear the 40% exam hurdle comfortably; the Spark and NoSQL material is lighter and conceptual, so bank the cost-formula marks first.