USyd · DATA3404 · Scalable Data Management

DATA3404: pass the exams, not just read the notes

Your complete guide to University of Sydney's scalable data management unit. See where the marks are, work real practice questions, and study with an AI tutor that knows DATA3404.

6 credit points Level 3 undergrad Offered S1 ~60% exams School of Computer Science

Learn with AskSia Explore the AskSia library

Sia generates DATA3404 practice questions, walks through storage engines and tree-based indexing (b+-trees) step by step, and quizzes you on the material the exam weights most heavily.

Spot the bug

Find what is wrong

Multiple choice · the fix is revealed after you answer

A team runs DATA3404 Assignment 2 in PySpark. A small lookup DataFrame `regions` (a few KB) is joined to a huge `trips` DataFrame, then the count is printed. The job triggers a full shuffle of `trips` across the cluster and runs far slower than expected. Which single change fixes the performance bug?

    regions = spark.read.parquet('regions')   # tiny
    trips   = spark.read.parquet('trips')     # billions of rows
    joined  = trips.join(regions, 'region_id')
    result  = joined.filter(joined.fare > 100)
    print(result.count())

The fix

By default Spark uses a shuffle sort-merge join, which shuffles BOTH inputs across the cluster on the join key. Shuffling a billions-row table is the expensive step here.

When one side is tiny (below spark.sql.autoBroadcastJoinThreshold), the right strategy is a broadcast join: send the small table to every node and do a local hash join, so the huge table never moves.
Wrapping the small DataFrame in broadcast(regions) forces this strategy: trips.join(broadcast(regions), 'region_id') removes the shuffle of trips.
cache() only stores regions in memory, it does not stop the shuffle of trips. repartition adds more shuffling, not less. Pushing the filter down is a real optimisation but the fare>100 filter is on trips after the join and Catalyst already pushes such filters, so it does not remove the join shuffle; the join strategy is the bug.

The trap: Reaching for cache() or repartition() because the table is reused or skewed. Neither removes the shuffle of the huge table. The fix is choosing the broadcast join strategy so only the tiny table is moved. classic slip!

Generate more debugging drills with Sia

your whole grade↘

Where your grade comes from Exams 60% · Projects 24% · Quizzes 6% · Participation 5% · Presentations 5%

6512125Final · 60%

One exam decides 60% of your grade. Hurdle task: at least 40% required on this exam. This whole page is built around that.

Overview

What DATA3404 is, and where it sits

DATA3404 is a third-year database-systems-internals and big-data unit where you stop using databases and start understanding how they are built. You look under the hood of the engines that store, index, and query large data collections. The single-node half covers storage layout, buffer management, B+-tree, hash and bitmap indexing, external sorting, the three join algorithms, and cost-based query optimisation. The distributed half covers partitioning, replication, the CAP theorem, HDFS and MapReduce, Apache Spark (RDD, DataFrame and DAG execution), data-stream processing, and NoSQL.

The practical assignments use PostgreSQL, the SimpleDB educational engine, and Spark/Databricks, but the heavily weighted final exam does not assess your code. It tests hand-worked I/O cost models: join cost counting, external-sort pass formulas, B+-tree height and insertion mechanics, extendible-hashing splits, buffer-replacement traces, and selectivity-based query-optimisation costing. The lecturer motto is to learn the principles, not the software.

It is a 6-credit-point Level 3 unit in the School of Computer Science. It builds on an intermediate databases unit (DATA2001, DATA2901, ISYS2120, INFO2120 or INFO2820), is prohibited against INFO3504 and INFO3404, and feeds data-engineering, data-science and systems pathways.

How it differs from its first-year siblings. Where QBUS5001 and BUSS6002 use data tools to answer business questions, DATA3404 is about the machinery underneath: it asks why a query is slow and how the engine executes it, costed by hand in page I/Os, rather than how to model or interpret the data.

Official outline: sydney.edu.au · DATA3404 outline. Always treat the official outline and the exam timetable as authoritative.

Difficulty & time commitment

Is DATA3404 hard, and how much time does it take?

DATA3404 is manageable if you keep a weekly rhythm and treat the back half as the main event. Across student reviews the pattern is consistent: it starts gently and steepens, and the heaviest assessment is the part that separates grades.

Difficulty

3.5 / 5

Hard. Gentle early, demanding back half. Hard to fail with steady work; an HD takes consistent practice.

Exam load

60%

The exams decide most of the grade. The heaviest single component is 60%.

Weekly time

~10 hrs

The standard load for a 6-credit-point unit, around 1.5 hours per credit point per week including class.

A read across student reviews and course feedback. See what students say ↓

Weeks 1 to 4 (storage, indexing)building

Weeks 5 to 7 (sorting, joins, optimisation)steep

Weeks 8 to 13 (distributed, Spark, NoSQL)reuses join cost reasoning per node

The difficulty curve and the assessment weighting point the same way: the back half is harder and worth more. Front-loading effort there is the highest-return decision in the unit.

Is this unit for you

Who tends to do well, and who tends to struggle

You will likely do well if

You enjoy working cost formulas by hand and stay precise with ceilings, floors, page counts and buffer sizes under exam time pressure
You build a one-page cost-formula reference early (join formulas, external-sort passes, B+-tree height, extendible-hashing rules) and drill the worked examples until plugging in is automatic
You treat the weekly quizzes and worksheets as exam rehearsal rather than box-ticking
You can coordinate a group of three across two practical assignments without letting the SQL and Spark work crowd out exam-theory revision

You may struggle if

You rely on coding ability to carry you, because the 60% exam explicitly does not assess code, it assesses the maths behind the engine
You leave the cost models until swotvac, because they compound (joins reuse sorting, distributed joins reuse single-node join cost) and cannot be crammed cold
You coast on group assignments and skip the individual quizzes and worksheets that rehearse the exam
You are shaky on the prerequisite SQL, relational algebra or basic programming the unit assumes

do this ↘

What HD students do differently

Memorise and can re-derive every cost formula: NLJ b_R + |R|*b_S, BNLJ b_R + ceil(b_R/(M-2))*b_S, INLJ b_R + |R|*(1+h+matches), sort-merge, hash 3*(b_R+b_S), and external-sort passes = 1 + ceil(log_(B-1) ceil(N/B))
Always pick the smaller relation as the outer in a nested-loops join and explain why
Practise the full chain without notes: a B+-tree insertion trace, an extendible-hashing build, a CLOCK or GCLOCK buffer trace, and a Car/Trip-style optimisation costing
Carry single-node join reasoning into the distributed setting (per-node cost, then divide by the number of nodes) and pick the right distributed-join strategy
Confirm the exact semi-open-book permitted materials with the coordinator and build your allowed notes accordingly

Syllabus

The 13 topics, week by week

The exam-weight marker on each topic shows where the marks concentrate. The amber topics carry the highest exam weight.

T1 · Organisation, admin and introduction

Hellerstein, Stonebraker & Hamilton (2007), Architecture of a Database System

Unit-in-brief and the learn-the-principles-not-the-software motto; storage hierarchy from registers to disk and the access gap; pages as the unit of disk I/O; MB (decimal) versus MiB (binary).

Lower exam weight

T2 · Storage engines and physical data organisation

Ramakrishnan & Gehrke (storage); Petrov, Database Internals

Heap versus sorted file cost table; fixed and variable-length records; slotted page and TID; PostgreSQL TOAST; row versus column store and PAX; the buffer manager (pin count, dirty bit); replacement policies FIFO, LRU, MRU, LFU, CLOCK and GCLOCK.

High exam weightQuiz me on storage engines →

T3 · Tree-based indexing (B+-trees)

Ramakrishnan & Gehrke (tree indexing)

Index as an access path via a search key; ISAM versus B+-tree; entries only in leaves with sibling pointers; search cost = height + 1; insertion (copy up) and deletion (redistribute or merge); index classification (primary/secondary, clustered/unclustered, dense/sparse).

High exam weightQuiz me on tree-based indexing (b+-trees) →

T4 · Hash-based and bitmap indexing

Ramakrishnan & Gehrke (hash indexing)

Static hashing and overflow chains; extendible hashing with a directory of size 2^(global depth), local versus global depth, and the directory-doubling rule; bitmap indexes for low-cardinality attributes; bitmap versus Bloom filter; the EGSR index-choice heuristic and covering indexes.

High exam weightQuiz me on hash-based →

T5 · Query processing, relational algebra and external sorting

Ramakrishnan & Gehrke (external sorting, query evaluation)

SQL to relational algebra to physical plan; the six basic RA operators; access paths and CNF matching; materialisation versus pipelining; external merge sort with pass-0 runs = ceil(N/B), merging B-1 runs at a time, passes = 1 + ceil(log_(B-1) ceil(N/B)) and total I/O = 2N times passes.

High exam weightQuiz me on query processing →

T6 · Query execution and join algorithms

Ramakrishnan & Gehrke (join algorithms)

The exam gold. Join types (theta, equi, natural, outer); simple, page and block nested-loops join; index nested-loops join; sort-merge join; grace/hash join at 3*(b_R+b_S); the canonical Student-Enrolled cost chain; choosing the smaller relation as outer.

High exam weightQuiz me on query execution →

T7 · Query optimisation and cost estimation

Ramakrishnan & Gehrke (query optimisation)

Cost-based optimisation; RA equivalence rules (cascade, commute, push selections and projections down); selectivity = 1/V on V distinct values; left-deep join trees and System-R dynamic programming; statistics and the uniform-distribution assumption; the Car/Trip selectivity chain. Assignment 1 due.

High exam weightQuiz me on query optimisation →

T8 · Distributed data management: partitioning and replication

Lecture notes on distributed data management

Parallel architectures (shared-memory, shared-disk, shared-nothing); speed-up versus scale-up; the CAP theorem; horizontal (sharding) versus vertical partitioning; round-robin, hash and range placement; co-partitioning on the join key; synchronous versus asynchronous replication; cluster MTBF scaling.

Lower exam weight

T9 · Distributed query and data processing; MapReduce

Lecture notes; Apache Spark / MapReduce material

Distributed join approaches (local reference, broadcast, distributed-shuffle, fragment-and-replicate); distributed parallel hash join; the MapReduce model (map, shuffle and sort, reduce) and WordCount; data locality and shared-nothing execution; the 9-node R-join-S cost comparison.

Lower exam weight

W10

T10 · Distributed dataflow platforms: Spark and Flink

Apache Spark / Databricks documentation

RDD (immutable, partitioned, in-memory, fault-tolerant via lineage); lazy transformations versus actions and the DAG; the DataFrame API and .explain(); Spark join strategies (broadcast, shuffle sort-merge, shuffle hash); the Catalyst optimiser and Adaptive Query Execution; HDFS with NameNode and DataNodes, 64 MB blocks and 3x replication.

Lower exam weight

W11

T11 · Data stream management systems

Lecture notes on data stream processing

Stream versus batch processing; Spark Streaming; the pipelined dataflow runtime (Flink) for streams; continuous, unbounded data and the motivation for windowing.

Lower exam weight

W12

T12 · NoSQL stores

Lecture notes on NoSQL stores

NoSQL as non-relational or Not-Only-SQL; the drivers of scalability and schema flexibility; families (key-value, wide-column, document, graph); decoupled lakehouse architecture; MongoDB sharding and asynchronous primary-copy replication. Assignment 2 due Week 13.

Lower exam weight

W13

T13 · Unit review and Assignment 2 due

Revision across all lectures

Synthesis across storage, indexing, query processing, optimisation and the distributed/Spark half; exam-style cost-model revision. Assignment 2 due.

Lower exam weight

How it's assessed

Assessment structure

Component	Weight	Format & timing
Weekly homework quizzes	6%	Individual Canvas quizzes on database-engine concepts; once started, a single timed submission with no resubmission. Weekly from Week 2. Closed-knowledge concept tests; together with the exam, about 66% of the unit is individually examined under time pressure.
Weekly participation	5%	Submit a completed (unmarked) worksheet, or take part in the Week 12 or 13 tutorial peer activity. Mere attendance does not count. Marked as a percentage of the possible weekly participations. Weekly from Week 2.
Practical Assignment 1: Database Programming and Testing	12%	Groupwork in threes: extend the SimpleDB educational engine in Python by implementing one chosen component such as a buffer replacer, external sort, a join algorithm, or an index structure. Due Week 7.
Practical Assignment 2: Complex Querying and Performance Tuning	12%	Groupwork in threes: complex querying and performance tuning in SQL and Apache Spark (PostgreSQL versus Databricks/Spark) with query-plan analysis and a report. Due Week 13.
Presentation of DB Concepts	5%	Groupwork in threes: a roughly 6-minute video explaining a data-processing concept, marked by tutors and peers. Generative AI is allowed for research and design but not to generate the video or audio. Across multiple weeks.
Final Exam	60%	Paper-based, 2 hours, semi-open book; mostly short answer with possibly some MCQ; does not assess coding. Covers lectures, labs, homework and assignments. Formal exam period. Hurdle task: at least 40% required on this exam.

Weekly homework quizzes6%

Individual Canvas quizzes on database-engine concepts; once started, a single timed submission with no resubmission.

Weekly participation5%

Submit a completed (unmarked) worksheet, or take part in the Week 12 or 13 tutorial peer activity. Mere attendance does not count. Marked as a percentage of the possible weekly participations.

Practical Assignment 1: Database Programming and Testing12%

Groupwork in threes: extend the SimpleDB educational engine in Python by implementing one chosen component such as a buffer replacer, external sort, a join algorithm, or an index structure.

Practical Assignment 2: Complex Querying and Performance Tuning12%

Groupwork in threes: complex querying and performance tuning in SQL and Apache Spark (PostgreSQL versus Databricks/Spark) with query-plan analysis and a report.

Presentation of DB Concepts5%

Groupwork in threes: a roughly 6-minute video explaining a data-processing concept, marked by tutors and peers. Generative AI is allowed for research and design but not to generate the video or audio.

Final Exam60%

Paper-based, 2 hours, semi-open book; mostly short answer with possibly some MCQ; does not assess coding. Covers lectures, labs, homework and assignments.

You must achieve at least 40% in the final exam AND an overall final mark of 50 or more. A student who misses either may be given a maximum final mark of no more than 45 regardless of their average. The final exam is flagged a hurdle task.
Paper-based, 2 hours, semi-open book, mostly short answer with possibly some MCQ, testing hand-worked engine internals and cost models rather than coding.
Calculator policy: Not specified in the available course pages; confirm with the unit coordinator.

read this! If you read nothing else

This is an exam-cram unit. With the exams at 60% of the grade and the final exam alone at 60%, your result is overwhelmingly decided by how well you perform under time pressure. Hurdle task: at least 40% required on this exam.

Final exam timing: Formal examination period, Semester 1 2026 (exact date set by the USyd exams timetable). Confirm the exact date and venue on the official exam timetable.

How to actually pass it

A weekly rhythm, two checklists, and the traps to avoid

The unit rewards consistency over cramming, and practice over re-reading. Here is the loop that works, then what to have nailed before each exam.

The weekly loop

During the lecture

Attend or watch the 2-hour Tuesday lecture and note the week's cost formula or mechanic (buffer trace, B+-tree rule, join formula, selectivity).

Same week, by hand

Do the lab worksheet by hand (not just reading the solution) and submit it for the 5% participation while the cost method is fresh.

Same week, timed

Sit the Canvas quiz seriously as timed exam rehearsal for the 6%.

End of each week

Add the week's cost formula plus one worked example to a running master one-pager, and re-do one earlier topic's calculation so the mechanics do not decay.

Before the mid-semester checklist

Do each weekly worksheet by hand and submit it for the 5% participation
Sit each weekly quiz as timed exam rehearsal (6%)
After each lecture add its cost formula and one worked example to a master one-pager
Lock in your SimpleDB component for Assignment 1 (Week 7) well before the deadline
Drill the single-node mechanics early: B+-tree height and insertion, extendible-hashing splits, external-sort passes, and the storage cost table

Before the final heaviest topics

Re-derive every join cost formula from memory and know which join works for equi versus non-equi conditions
Drill B+-tree insertion and deletion and height estimation; extendible-hashing splits and the directory-doubling rule; LRU, CLOCK and GCLOCK traces; external-sort passes and total I/O
Work a Car/Trip-style optimisation chain: selectivity = 1/V, push selections down, pick the index pair, compare plan costs
Practise distributed joins: compute per-node cost for local-reference, broadcast, shuffle-hash and fragment-and-replicate and compare to centralised
Confirm the exact semi-open-book permitted materials with the coordinator and prepare your allowed notes accordingly
Remember the hurdle: target well above 40% on the exam and 50% overall, since failing either caps you at 45

The mistakes that cost marks

Counting on your code to carry you. The 60% exam does not assess your code. Strong assignment marks will not save you if you cannot hand-compute cost models. Skew revision to cost estimation, join I/O counting, indexing maths, external-sort passes and query-optimisation rewriting.

Assuming the exam is fully open book. It is described as semi-open book, but the available course pages do not specify which materials are permitted. Do not assume you can bring everything; confirm the allowed materials with the coordinator first.

Using the wrong block size in BNLJ. This course's worksheets use M-2 (one input page and one output page reserved), not M-1. Mixing them up changes every block nested-loops answer.

Forgetting to put the smaller relation as the outer. In nested-loops joins the outer relation should be the smaller one. Forgetting this gives a much larger and wrong cost.

Trusting the published week numbering. The Unit Schedule page's week numbers can disagree with the quiz and worksheet ordering. Follow the lecture and worksheet sequence: sorting, then joins, then optimisation, then distributed.

Leaving the cost models until swotvac. They compound. Joins reuse sorting, distributed joins reuse single-node join cost, so they cannot be crammed cold. Build the master one-pager from Week 2.

Build a study plan with Sia → Drill the back-half topics →

Teaching team

Who teaches DATA3404

The bios below are factual. The star ratings are not ours: they are impressions from students who have taken the unit, so you can hear from people who sat in the lectures.

Unit coordinator and lecturer

Uwe Roehm

Associate Professor in the School of Computer Science. Research in database systems and data management, scalable data science and big data, cloud computing, machine learning with database systems, data management on multicore hardware, and replication and caching. Organises the School's Database Research Group. Staff profile

Student ratingNo student ratings yet

Add your review →

Teaching team as listed in the unit materials reviewed. AskSia does not rate lecturers; star ratings are submitted by students who have taken DATA3404.

Formula & concept sheet

The vocabulary and formulas you must own

Slotted page: A page layout for variable-length records: a header plus a slot directory of (offset, length) entries growing from one end and records from the other. A record can move within the page without changing its TID.
TID (Tuple ID): A row identifier of the form (page id, slot number); the slot-directory indirection lets a record move within its page without the TID changing.
Buffer manager: The component that brings disk pages into memory frames, tracks a pin count and dirty bit per frame, and evicts unpinned frames by a replacement policy (writing back dirty pages first).
CLOCK (second chance): A buffer-replacement policy using a circular frame list and a reference bit: on eviction a set bit is cleared and the hand advances; a clear bit, if unpinned, is evicted.
B+-tree: A dynamic, balanced, multi-level index with entries only in the leaves and sibling pointers between leaves for range scans; search cost is about height + 1.
Clustered index: An index whose data records are stored in the same order as the index entries; at most one per table. It makes range scans cheap because matches sit on consecutive pages.
Extendible hashing: A dynamic hash index using a directory of size 2^(global depth); a bucket overflow splits the bucket and doubles the directory only when the bucket's local depth equals the global depth.

External merge sort: Sorting data larger than memory: a pass-0 sort phase producing ceil(N/B) runs, then merge passes combining B-1 runs at a time.
Selectivity: The fraction of input tuples a predicate keeps; for an equality on an attribute with V distinct values it is estimated as 1/V, assuming a uniform distribution.
Left-deep join tree: A join order in which the right input of every join is a base relation, enabling fully pipelined plans; System-R's dynamic-programming optimiser considers only these.
Shared-nothing: A parallel architecture where each node has its own CPU, memory and disk and communicates only by network; it scales to thousands of nodes and is the practice standard for big data.
CAP theorem: In a distributed data system you can guarantee at most two of Consistency, Availability and Partition tolerance.
RDD (Resilient Distributed Dataset): Spark's immutable, partitioned, in-memory collection that is fault-tolerant via lineage: a lost partition is rebuilt from its derivation.
Lazy evaluation: In Spark, transformations only record a DAG of operations; nothing executes until an action is called.

Common acronyms: B+-tree search cost = ceil(log_F B) + 1 = index height + 1 · External sort passes = 1 + ceil(log_(B-1) ceil(N/B)); total I/O = 2N * passes · Simple nested-loops join = b_R + |R| * b_S (smaller relation as outer) · Page nested-loops join = b_R + b_R * b_S · Block nested-loops join = b_R + ceil(b_R/(M-2)) * b_S · Index nested-loops join = b_R + |R| * c, with c = 1 + h + matches · Sort-merge join = b_R + b_S (sorted), else sort(R) + sort(S) + b_R + b_S · Grace/hash join = 3 * (b_R + b_S) · Equality selectivity on V distinct values = 1/V; matches = |R|/V · Cluster MTBF for k identical nodes = MTBF_1 / k.

Drill these as flashcards → Map them with Sia →

What students say

What students actually say about DATA3404

Recurring themes from student reviews, paraphrased in our own words.

On difficulty

No first-hand student difficulty rating is currently available for this unit, so the difficulty read is editorial: a 60% semi-open-book final, a double hurdle, and hand-worked cost models place it in the harder quant/CS cluster

Practise these topics with Sia →

How students revise

A substantial library of student-shared materials exists (lecture notes, topic summaries, tutorial and worksheet solutions, and final-exam study notes), pointing to heavy demand for revision support in this unit
Organised, cost-model-focused revision notes are what students reach for, with at least one widely viewed compiled note set described by its author as having achieved a high distinction

Make your own notes and flashcards →

Before the exams

The most-shared materials cluster around the exam-heavy mechanics: buffer management and page structures, the storage layer, query processing, and distributed data management

Get instant walkthroughs →

Recurring student opinions, paraphrased and aggregated, not official course information.

Set texts

The prescribed reading

The syllabus references map straight onto these.

Core reference paper

Architecture of a Database System

Hellerstein, J. M., Stonebraker, M. & Hamilton, J. (2007), Foundations and Trends in Databases. Publisher page

Recommended

Database Management Systems (3rd ed.)

Ramakrishnan, R. & Gehrke, J. (2003), McGraw-Hill.

Recommended

Database Internals

Petrov, A. (2019), O'Reilly.

Recommended

Database Systems: The Complete Book (2nd ed.)

Garcia-Molina, H., Ullman, J. D. & Widom, J.

Where it fits

Prerequisites, related units & why it matters

Formal prerequisite: DATA2001 or DATA2901 or ISYS2120 or INFO2120 or INFO2820. The unit assumes you already know SQL, schema design, relational algebra, and can program in Python or Java. It is prohibited against INFO3504 and INFO3404.

QBUS5001Foundations of Data Analytics for BusinessAdjacent data unit (not same discipline) BUSS6002Data Science in BusinessAdjacent data unit (not same discipline) BUSS1020Quantitative Business AnalysisAdjacent quantitative unit (not same discipline) ExploreAll Information Technology unitsUSyd discipline hub

Why it matters beyond the grade. The cost-model fluency this unit builds (join I/O, index access, external sort, Spark execution plans) is exactly what data-engineering and backend-performance roles screen for. It turns a SQL user into someone who understands why a query is slow and how to engineer data systems at scale.

Your DATA3404 study toolkit

Study the unit with Sia, not just read about it

Each tool already knows DATA3404: your syllabus, your texts, and where the marks are. Grouped by how you study, from first contact to exam week.

1 · Learn itunderstand the material

💬AI tutorAsk anything about DATA3404 and get step-by-step answers. 📤Explain my notesUpload your slides or lecture and Sia breaks them down. 📑Topic summariserCondense a week into the essentials you actually need.

2 · Practise ittest yourself

📝Practice quiz generatorUnlimited exam-style MCQs and short-answer on any topic. 📊Past paper analysisSee what the exams actually test, topic by topic. ✓Assignment & problem helpWork through problem sets one step at a time.

3 · Revise & cramlock it in before the exam

🃏FlashcardsKey concepts and formulas as a spaced-repetition deck. 📋Cheatsheet maker NEWAuto-build a one-page exam cheatsheet for the unit. 🧠Mindmap generatorSee how the topics connect on one visual map.

4 · Discuss itcompare notes

👥Community Q&AAsk other DATA3404 students and share what worked.

FAQ

Frequently asked questions

How is DATA3404 assessed?

A 60% paper-based, 2-hour semi-open-book final exam, plus two 12% group practical assignments (a SimpleDB engine extension due Week 7 and SQL plus Apache Spark performance tuning due Week 13), a 5% group video presentation, 6% weekly Canvas quizzes, and 5% weekly participation.

What is the hurdle to pass DATA3404?

You must score at least 40% on the final exam AND at least 50% overall. If you miss either, you may be capped at a maximum final mark of 45 regardless of your average. The exam is officially flagged a hurdle task.

Is the DATA3404 exam open book?

It is described as semi-open book and is paper-based, 2 hours, mostly short answer. However, the available course pages do not specify which materials are permitted, so confirm the exact allowed materials with the unit coordinator before relying on bringing notes.

Does the DATA3404 exam test coding?

No. The final exam explicitly does not assess coding. Coding is assessed in the two group practical assignments. The exam tests hand-worked engine internals: cost models, indexing, sorting and join calculations, and query-optimisation reasoning.

What are the most important things to drill for the DATA3404 exam?

The join I/O cost formulas (nested-loops, block nested-loops, index nested-loops, sort-merge and hash), external-merge-sort pass counting, B+-tree mechanics, extendible-hashing splits and directory doubling, buffer-replacement traces (LRU, CLOCK and GCLOCK), and selectivity-based query-optimisation costing.

What background do I need for DATA3404?

The formal prerequisite is DATA2001 or DATA2901 or ISYS2120 or INFO2120 or INFO2820. The unit assumes you already know SQL, schema design and relational algebra, and can program in Python or Java. It is prohibited against INFO3504 and INFO3404.

Is there a required textbook for DATA3404?

There is no single required textbook. The course leans on the paper Architecture of a Database System (Hellerstein, Stonebraker and Hamilton, 2007) and recommends references including Ramakrishnan and Gehrke's Database Management Systems and Petrov's Database Internals.

Study DATA3404 with Sia

Work through storage engines, tree-based indexing (b+-trees), hash-based and the rest of the unit with a tutor that knows it and quizzes you on the topics the assessments weight most heavily.

Start studying with Sia