DATA3404: pass the exams, not just read the notes
Your complete guide to University of Sydney's scalable data management unit. See where the marks are, work real practice questions, and study with an AI tutor that knows DATA3404.
Sia generates DATA3404 practice questions, walks through storage engines and tree-based indexing (b+-trees) step by step, and quizzes you on the material the exam weights most heavily.
Find what is wrong
A team runs DATA3404 Assignment 2 in PySpark. A small lookup DataFrame `regions` (a few KB) is joined to a huge `trips` DataFrame, then the count is printed. The job triggers a full shuffle of `trips` across the cluster and runs far slower than expected. Which single change fixes the performance bug?
regions = spark.read.parquet('regions') # tiny
trips = spark.read.parquet('trips') # billions of rows
joined = trips.join(regions, 'region_id')
result = joined.filter(joined.fare > 100)
print(result.count())
By default Spark uses a shuffle sort-merge join, which shuffles BOTH inputs across the cluster on the join key. Shuffling a billions-row table is the expensive step here.
Wrapping the small DataFrame in broadcast(regions) forces this strategy: trips.join(broadcast(regions), 'region_id') removes the shuffle of trips.
cache() only stores regions in memory, it does not stop the shuffle of trips. repartition adds more shuffling, not less. Pushing the filter down is a real optimisation but the fare>100 filter is on trips after the join and Catalyst already pushes such filters, so it does not remove the join shuffle; the join strategy is the bug.
The trap: Reaching for cache() or repartition() because the table is reused or skewed. Neither removes the shuffle of the huge table. The fix is choosing the broadcast join strategy so only the tiny table is moved. classic slip!
One exam decides 60% of your grade. Hurdle task: at least 40% required on this exam. This whole page is built around that.
Overview
What DATA3404 is, and where it sits
DATA3404 is a third-year database-systems-internals and big-data unit where you stop using databases and start understanding how they are built. You look under the hood of the engines that store, index, and query large data collections. The single-node half covers storage layout, buffer management, B+-tree, hash and bitmap indexing, external sorting, the three join algorithms, and cost-based query optimisation. The distributed half covers partitioning, replication, the CAP theorem, HDFS and MapReduce, Apache Spark (RDD, DataFrame and DAG execution), data-stream processing, and NoSQL.
The practical assignments use PostgreSQL, the SimpleDB educational engine, and Spark/Databricks, but the heavily weighted final exam does not assess your code. It tests hand-worked I/O cost models: join cost counting, external-sort pass formulas, B+-tree height and insertion mechanics, extendible-hashing splits, buffer-replacement traces, and selectivity-based query-optimisation costing. The lecturer motto is to learn the principles, not the software.
It is a 6-credit-point Level 3 unit in the School of Computer Science. It builds on an intermediate databases unit (DATA2001, DATA2901, ISYS2120, INFO2120 or INFO2820), is prohibited against INFO3504 and INFO3404, and feeds data-engineering, data-science and systems pathways.
Official outline: sydney.edu.au · DATA3404 outline. Always treat the official outline and the exam timetable as authoritative.
Difficulty & time commitment
Is DATA3404 hard, and how much time does it take?
DATA3404 is manageable if you keep a weekly rhythm and treat the back half as the main event. Across student reviews the pattern is consistent: it starts gently and steepens, and the heaviest assessment is the part that separates grades.
A read across student reviews and course feedback. See what students say ↓
The difficulty curve and the assessment weighting point the same way: the back half is harder and worth more. Front-loading effort there is the highest-return decision in the unit.
Is this unit for you
Who tends to do well, and who tends to struggle
You will likely do well if
- You enjoy working cost formulas by hand and stay precise with ceilings, floors, page counts and buffer sizes under exam time pressure
- You build a one-page cost-formula reference early (join formulas, external-sort passes, B+-tree height, extendible-hashing rules) and drill the worked examples until plugging in is automatic
- You treat the weekly quizzes and worksheets as exam rehearsal rather than box-ticking
- You can coordinate a group of three across two practical assignments without letting the SQL and Spark work crowd out exam-theory revision
You may struggle if
- You rely on coding ability to carry you, because the 60% exam explicitly does not assess code, it assesses the maths behind the engine
- You leave the cost models until swotvac, because they compound (joins reuse sorting, distributed joins reuse single-node join cost) and cannot be crammed cold
- You coast on group assignments and skip the individual quizzes and worksheets that rehearse the exam
- You are shaky on the prerequisite SQL, relational algebra or basic programming the unit assumes
- Memorise and can re-derive every cost formula: NLJ b_R + |R|*b_S, BNLJ b_R + ceil(b_R/(M-2))*b_S, INLJ b_R + |R|*(1+h+matches), sort-merge, hash 3*(b_R+b_S), and external-sort passes = 1 + ceil(log_(B-1) ceil(N/B))
- Always pick the smaller relation as the outer in a nested-loops join and explain why
- Practise the full chain without notes: a B+-tree insertion trace, an extendible-hashing build, a CLOCK or GCLOCK buffer trace, and a Car/Trip-style optimisation costing
- Carry single-node join reasoning into the distributed setting (per-node cost, then divide by the number of nodes) and pick the right distributed-join strategy
- Confirm the exact semi-open-book permitted materials with the coordinator and build your allowed notes accordingly
Syllabus
The 13 topics, week by week
The exam-weight marker on each topic shows where the marks concentrate. The amber topics carry the highest exam weight.
T1 · Organisation, admin and introduction
Hellerstein, Stonebraker & Hamilton (2007), Architecture of a Database SystemUnit-in-brief and the learn-the-principles-not-the-software motto; storage hierarchy from registers to disk and the access gap; pages as the unit of disk I/O; MB (decimal) versus MiB (binary).
T2 · Storage engines and physical data organisation
Ramakrishnan & Gehrke (storage); Petrov, Database InternalsHeap versus sorted file cost table; fixed and variable-length records; slotted page and TID; PostgreSQL TOAST; row versus column store and PAX; the buffer manager (pin count, dirty bit); replacement policies FIFO, LRU, MRU, LFU, CLOCK and GCLOCK.
T3 · Tree-based indexing (B+-trees)
Ramakrishnan & Gehrke (tree indexing)Index as an access path via a search key; ISAM versus B+-tree; entries only in leaves with sibling pointers; search cost = height + 1; insertion (copy up) and deletion (redistribute or merge); index classification (primary/secondary, clustered/unclustered, dense/sparse).
T4 · Hash-based and bitmap indexing
Ramakrishnan & Gehrke (hash indexing)Static hashing and overflow chains; extendible hashing with a directory of size 2^(global depth), local versus global depth, and the directory-doubling rule; bitmap indexes for low-cardinality attributes; bitmap versus Bloom filter; the EGSR index-choice heuristic and covering indexes.
T5 · Query processing, relational algebra and external sorting
Ramakrishnan & Gehrke (external sorting, query evaluation)SQL to relational algebra to physical plan; the six basic RA operators; access paths and CNF matching; materialisation versus pipelining; external merge sort with pass-0 runs = ceil(N/B), merging B-1 runs at a time, passes = 1 + ceil(log_(B-1) ceil(N/B)) and total I/O = 2N times passes.
T6 · Query execution and join algorithms
Ramakrishnan & Gehrke (join algorithms)The exam gold. Join types (theta, equi, natural, outer); simple, page and block nested-loops join; index nested-loops join; sort-merge join; grace/hash join at 3*(b_R+b_S); the canonical Student-Enrolled cost chain; choosing the smaller relation as outer.
T7 · Query optimisation and cost estimation
Ramakrishnan & Gehrke (query optimisation)Cost-based optimisation; RA equivalence rules (cascade, commute, push selections and projections down); selectivity = 1/V on V distinct values; left-deep join trees and System-R dynamic programming; statistics and the uniform-distribution assumption; the Car/Trip selectivity chain. Assignment 1 due.
T8 · Distributed data management: partitioning and replication
Lecture notes on distributed data managementParallel architectures (shared-memory, shared-disk, shared-nothing); speed-up versus scale-up; the CAP theorem; horizontal (sharding) versus vertical partitioning; round-robin, hash and range placement; co-partitioning on the join key; synchronous versus asynchronous replication; cluster MTBF scaling.
T9 · Distributed query and data processing; MapReduce
Lecture notes; Apache Spark / MapReduce materialDistributed join approaches (local reference, broadcast, distributed-shuffle, fragment-and-replicate); distributed parallel hash join; the MapReduce model (map, shuffle and sort, reduce) and WordCount; data locality and shared-nothing execution; the 9-node R-join-S cost comparison.
T10 · Distributed dataflow platforms: Spark and Flink
Apache Spark / Databricks documentationRDD (immutable, partitioned, in-memory, fault-tolerant via lineage); lazy transformations versus actions and the DAG; the DataFrame API and .explain(); Spark join strategies (broadcast, shuffle sort-merge, shuffle hash); the Catalyst optimiser and Adaptive Query Execution; HDFS with NameNode and DataNodes, 64 MB blocks and 3x replication.
T11 · Data stream management systems
Lecture notes on data stream processingStream versus batch processing; Spark Streaming; the pipelined dataflow runtime (Flink) for streams; continuous, unbounded data and the motivation for windowing.
T12 · NoSQL stores
Lecture notes on NoSQL storesNoSQL as non-relational or Not-Only-SQL; the drivers of scalability and schema flexibility; families (key-value, wide-column, document, graph); decoupled lakehouse architecture; MongoDB sharding and asynchronous primary-copy replication. Assignment 2 due Week 13.
T13 · Unit review and Assignment 2 due
Revision across all lecturesSynthesis across storage, indexing, query processing, optimisation and the distributed/Spark half; exam-style cost-model revision. Assignment 2 due.
How it's assessed
Assessment structure
| Component | Weight | Format & timing |
|---|---|---|
| Weekly homework quizzes | 6% | Individual Canvas quizzes on database-engine concepts; once started, a single timed submission with no resubmission. Weekly from Week 2. Closed-knowledge concept tests; together with the exam, about 66% of the unit is individually examined under time pressure. |
| Weekly participation | 5% | Submit a completed (unmarked) worksheet, or take part in the Week 12 or 13 tutorial peer activity. Mere attendance does not count. Marked as a percentage of the possible weekly participations. Weekly from Week 2. |
| Practical Assignment 1: Database Programming and Testing | 12% | Groupwork in threes: extend the SimpleDB educational engine in Python by implementing one chosen component such as a buffer replacer, external sort, a join algorithm, or an index structure. Due Week 7. |
| Practical Assignment 2: Complex Querying and Performance Tuning | 12% | Groupwork in threes: complex querying and performance tuning in SQL and Apache Spark (PostgreSQL versus Databricks/Spark) with query-plan analysis and a report. Due Week 13. |
| Presentation of DB Concepts | 5% | Groupwork in threes: a roughly 6-minute video explaining a data-processing concept, marked by tutors and peers. Generative AI is allowed for research and design but not to generate the video or audio. Across multiple weeks. |
| Final Exam | 60% | Paper-based, 2 hours, semi-open book; mostly short answer with possibly some MCQ; does not assess coding. Covers lectures, labs, homework and assignments. Formal exam period. Hurdle task: at least 40% required on this exam. |
- You must achieve at least 40% in the final exam AND an overall final mark of 50 or more. A student who misses either may be given a maximum final mark of no more than 45 regardless of their average. The final exam is flagged a hurdle task.
- Paper-based, 2 hours, semi-open book, mostly short answer with possibly some MCQ, testing hand-worked engine internals and cost models rather than coding.
- Calculator policy: Not specified in the available course pages; confirm with the unit coordinator.
This is an exam-cram unit. With the exams at 60% of the grade and the final exam alone at 60%, your result is overwhelmingly decided by how well you perform under time pressure. Hurdle task: at least 40% required on this exam.
Final exam timing: Formal examination period, Semester 1 2026 (exact date set by the USyd exams timetable). Confirm the exact date and venue on the official exam timetable.
How to actually pass it
A weekly rhythm, two checklists, and the traps to avoid
The unit rewards consistency over cramming, and practice over re-reading. Here is the loop that works, then what to have nailed before each exam.
The weekly loop
Before the mid-semester checklist
- Do each weekly worksheet by hand and submit it for the 5% participation
- Sit each weekly quiz as timed exam rehearsal (6%)
- After each lecture add its cost formula and one worked example to a master one-pager
- Lock in your SimpleDB component for Assignment 1 (Week 7) well before the deadline
- Drill the single-node mechanics early: B+-tree height and insertion, extendible-hashing splits, external-sort passes, and the storage cost table
Before the final heaviest topics
- Re-derive every join cost formula from memory and know which join works for equi versus non-equi conditions
- Drill B+-tree insertion and deletion and height estimation; extendible-hashing splits and the directory-doubling rule; LRU, CLOCK and GCLOCK traces; external-sort passes and total I/O
- Work a Car/Trip-style optimisation chain: selectivity = 1/V, push selections down, pick the index pair, compare plan costs
- Practise distributed joins: compute per-node cost for local-reference, broadcast, shuffle-hash and fragment-and-replicate and compare to centralised
- Confirm the exact semi-open-book permitted materials with the coordinator and prepare your allowed notes accordingly
- Remember the hurdle: target well above 40% on the exam and 50% overall, since failing either caps you at 45
The mistakes that cost marks
Counting on your code to carry you. The 60% exam does not assess your code. Strong assignment marks will not save you if you cannot hand-compute cost models. Skew revision to cost estimation, join I/O counting, indexing maths, external-sort passes and query-optimisation rewriting.
Assuming the exam is fully open book. It is described as semi-open book, but the available course pages do not specify which materials are permitted. Do not assume you can bring everything; confirm the allowed materials with the coordinator first.
Using the wrong block size in BNLJ. This course's worksheets use M-2 (one input page and one output page reserved), not M-1. Mixing them up changes every block nested-loops answer.
Forgetting to put the smaller relation as the outer. In nested-loops joins the outer relation should be the smaller one. Forgetting this gives a much larger and wrong cost.
Trusting the published week numbering. The Unit Schedule page's week numbers can disagree with the quiz and worksheet ordering. Follow the lecture and worksheet sequence: sorting, then joins, then optimisation, then distributed.
Leaving the cost models until swotvac. They compound. Joins reuse sorting, distributed joins reuse single-node join cost, so they cannot be crammed cold. Build the master one-pager from Week 2.
Teaching team
Who teaches DATA3404
The bios below are factual. The star ratings are not ours: they are impressions from students who have taken the unit, so you can hear from people who sat in the lectures.
Uwe Roehm
Associate Professor in the School of Computer Science. Research in database systems and data management, scalable data science and big data, cloud computing, machine learning with database systems, data management on multicore hardware, and replication and caching. Organises the School's Database Research Group. Staff profile
Teaching team as listed in the unit materials reviewed. AskSia does not rate lecturers; star ratings are submitted by students who have taken DATA3404.
Formula & concept sheet
The vocabulary and formulas you must own
- Slotted page
- A page layout for variable-length records: a header plus a slot directory of (offset, length) entries growing from one end and records from the other. A record can move within the page without changing its TID.
- TID (Tuple ID)
- A row identifier of the form (page id, slot number); the slot-directory indirection lets a record move within its page without the TID changing.
- Buffer manager
- The component that brings disk pages into memory frames, tracks a pin count and dirty bit per frame, and evicts unpinned frames by a replacement policy (writing back dirty pages first).
- CLOCK (second chance)
- A buffer-replacement policy using a circular frame list and a reference bit: on eviction a set bit is cleared and the hand advances; a clear bit, if unpinned, is evicted.
- B+-tree
- A dynamic, balanced, multi-level index with entries only in the leaves and sibling pointers between leaves for range scans; search cost is about height + 1.
- Clustered index
- An index whose data records are stored in the same order as the index entries; at most one per table. It makes range scans cheap because matches sit on consecutive pages.
- Extendible hashing
- A dynamic hash index using a directory of size 2^(global depth); a bucket overflow splits the bucket and doubles the directory only when the bucket's local depth equals the global depth.
- External merge sort
- Sorting data larger than memory: a pass-0 sort phase producing ceil(N/B) runs, then merge passes combining B-1 runs at a time.
- Selectivity
- The fraction of input tuples a predicate keeps; for an equality on an attribute with V distinct values it is estimated as 1/V, assuming a uniform distribution.
- Left-deep join tree
- A join order in which the right input of every join is a base relation, enabling fully pipelined plans; System-R's dynamic-programming optimiser considers only these.
- Shared-nothing
- A parallel architecture where each node has its own CPU, memory and disk and communicates only by network; it scales to thousands of nodes and is the practice standard for big data.
- CAP theorem
- In a distributed data system you can guarantee at most two of Consistency, Availability and Partition tolerance.
- RDD (Resilient Distributed Dataset)
- Spark's immutable, partitioned, in-memory collection that is fault-tolerant via lineage: a lost partition is rebuilt from its derivation.
- Lazy evaluation
- In Spark, transformations only record a DAG of operations; nothing executes until an action is called.
Common acronyms: B+-tree search cost = ceil(log_F B) + 1 = index height + 1 · External sort passes = 1 + ceil(log_(B-1) ceil(N/B)); total I/O = 2N * passes · Simple nested-loops join = b_R + |R| * b_S (smaller relation as outer) · Page nested-loops join = b_R + b_R * b_S · Block nested-loops join = b_R + ceil(b_R/(M-2)) * b_S · Index nested-loops join = b_R + |R| * c, with c = 1 + h + matches · Sort-merge join = b_R + b_S (sorted), else sort(R) + sort(S) + b_R + b_S · Grace/hash join = 3 * (b_R + b_S) · Equality selectivity on V distinct values = 1/V; matches = |R|/V · Cluster MTBF for k identical nodes = MTBF_1 / k.
What students say
What students actually say about DATA3404
Recurring themes from student reviews, paraphrased in our own words.
- No first-hand student difficulty rating is currently available for this unit, so the difficulty read is editorial: a 60% semi-open-book final, a double hurdle, and hand-worked cost models place it in the harder quant/CS cluster
- A substantial library of student-shared materials exists (lecture notes, topic summaries, tutorial and worksheet solutions, and final-exam study notes), pointing to heavy demand for revision support in this unit
- Organised, cost-model-focused revision notes are what students reach for, with at least one widely viewed compiled note set described by its author as having achieved a high distinction
- The most-shared materials cluster around the exam-heavy mechanics: buffer management and page structures, the storage layer, query processing, and distributed data management
Recurring student opinions, paraphrased and aggregated, not official course information.
Set texts
The prescribed reading
The syllabus references map straight onto these.
Architecture of a Database System
Hellerstein, J. M., Stonebraker, M. & Hamilton, J. (2007), Foundations and Trends in Databases. Publisher page
Database Management Systems (3rd ed.)
Ramakrishnan, R. & Gehrke, J. (2003), McGraw-Hill.
Database Internals
Petrov, A. (2019), O'Reilly.
Database Systems: The Complete Book (2nd ed.)
Garcia-Molina, H., Ullman, J. D. & Widom, J.
Where it fits
Prerequisites, related units & why it matters
Formal prerequisite: DATA2001 or DATA2901 or ISYS2120 or INFO2120 or INFO2820. The unit assumes you already know SQL, schema design, relational algebra, and can program in Python or Java. It is prohibited against INFO3504 and INFO3404.
Your DATA3404 study toolkit
Study the unit with Sia, not just read about it
Each tool already knows DATA3404: your syllabus, your texts, and where the marks are. Grouped by how you study, from first contact to exam week.
FAQ
Frequently asked questions
How is DATA3404 assessed?
A 60% paper-based, 2-hour semi-open-book final exam, plus two 12% group practical assignments (a SimpleDB engine extension due Week 7 and SQL plus Apache Spark performance tuning due Week 13), a 5% group video presentation, 6% weekly Canvas quizzes, and 5% weekly participation.
What is the hurdle to pass DATA3404?
You must score at least 40% on the final exam AND at least 50% overall. If you miss either, you may be capped at a maximum final mark of 45 regardless of your average. The exam is officially flagged a hurdle task.
Is the DATA3404 exam open book?
It is described as semi-open book and is paper-based, 2 hours, mostly short answer. However, the available course pages do not specify which materials are permitted, so confirm the exact allowed materials with the unit coordinator before relying on bringing notes.
Does the DATA3404 exam test coding?
No. The final exam explicitly does not assess coding. Coding is assessed in the two group practical assignments. The exam tests hand-worked engine internals: cost models, indexing, sorting and join calculations, and query-optimisation reasoning.
What are the most important things to drill for the DATA3404 exam?
The join I/O cost formulas (nested-loops, block nested-loops, index nested-loops, sort-merge and hash), external-merge-sort pass counting, B+-tree mechanics, extendible-hashing splits and directory doubling, buffer-replacement traces (LRU, CLOCK and GCLOCK), and selectivity-based query-optimisation costing.
What background do I need for DATA3404?
The formal prerequisite is DATA2001 or DATA2901 or ISYS2120 or INFO2120 or INFO2820. The unit assumes you already know SQL, schema design and relational algebra, and can program in Python or Java. It is prohibited against INFO3504 and INFO3404.
Is there a required textbook for DATA3404?
There is no single required textbook. The course leans on the paper Architecture of a Database System (Hellerstein, Stonebraker and Hamilton, 2007) and recommends references including Ramakrishnan and Gehrke's Database Management Systems and Petrov's Database Internals.
Study DATA3404 with Sia
Work through storage engines, tree-based indexing (b+-trees), hash-based and the rest of the unit with a tutor that knows it and quizzes you on the topics the assessments weight most heavily.
Start studying with Sia