University of Sydney · S1 2026 · FACULTY OF INFORMATION TECHNOLOGY

DATA3404 · Scalable Data Management

- one subject, every graph, every model, every mark

50% final exam · hurdle14 Chapters8-page Bible

Our own words - no uploaded lecturer files

Built to mirror S1 2026 · updated this semester

Chapter 1 of 7 · DATA3404

Data Storage and Files

Everything DATA3404 charges you for is measured in page I/Os — the number of fixed-size pages moved between disk and memory. This chapter builds that cost vocabulary from the ground up: how a record is laid out inside a slotted page, which file organisation (heap, sorted, or hashed) makes a given operation cheap, how a row-store and a column-store differ for analytical workloads, and how the buffer manager decides which page to evict when memory is full. Get the counting right here and the indexing, sorting, join and optimisation chapters are just the same arithmetic at scale. The exam tasks are concrete: trace a buffer-replacement policy over a reference string and count the misses, compare heap-vs-sorted access cost, explain the slotted-page and record formats, and decide row-store vs column-store for a workload.

In this chapter

What this chapter covers

011.1 The storage hierarchy — why I/O is the only cost that matters
02Records & the slotted-page layout (fixed vs variable length, NULL bitmap, TID)
03File organisations: heap vs sorted vs hashed — the access-cost table
04Row-store vs column-store for OLTP vs OLAP workloads
05The buffer pool and replacement policies (LRU, CLOCK / GCLOCK)

Worked example · free

Worked example: heap vs sorted file — equality search cost

Q [4 marks]. A relation occupies B = 1,000 data pages. (a) What is the cost of an equality search on a non-key attribute in a heap file? (b) What is the cost of an equality search on the sort key of a sorted file? (c) Why is an insert cheaper in the heap file?

+1(a) Heap, equality search: records are unordered, so in the worst case you must scan every page — cost = B = 1,000 page I/Os (on average B/2 if you can stop at the first match, but the exam usually counts the full scan for a non-key attribute).
+1(b) Sorted file, search on the sort key: binary search reaches the page in ⌈log₂ B⌉ = ⌈log₂ 1,000⌉ = 10 page I/Os.
+1Compare: 10 vs 1,000 — ordering on the search key turns an O(B) scan into an O(log₂ B) descent, the same lever an index pulls.
+1(c) Insert: a heap insert just appends to the last page (cost ~2 I/O: read + write), whereas a sorted file must find the position and shift records to keep the order — so the heap trades cheap inserts for expensive searches.

Heap equality search ≈ B = 1,000 I/Os; sorted-file search on the sort key = ⌈log₂ 1,000⌉ = 10 I/Os; the heap insert is cheap because it appends rather than maintaining order.

Sia tip — Always say which attribute the search is on: a sorted file only helps when you search (or range-scan) on its sort key — a query on any other attribute is back to a full scan.

Glossary

Key terms

Page (block): The fixed-size unit of transfer between disk and memory — typically a few kilobytes. All DATA3404 cost is counted in pages read or written, never in bytes or rows.
Slotted page: The standard page layout for variable-length records: a slot directory at the end of the page points to each record's offset, so records can grow, shrink or be deleted without rewriting the whole page and a record is addressed by a TID (page id + slot number).
Heap file: An unordered file of records. Inserts are cheap (append to the last page) but an equality or range search must scan every page — cost O(B) in pages.
Buffer pool: The region of memory holding pages read from disk. When it is full the buffer manager must evict a page using a replacement policy; a page marked dirty must be written back, costing an extra I/O.
Column store: A layout that stores each column contiguously rather than each row. Analytical (OLAP) queries that touch few columns read far less data and compress better; row stores still win for transactional (OLTP) point reads and writes of whole rows.

FAQ

Data Storage and Files FAQ

Why does DATA3404 only count page I/Os and ignore CPU and the output?

Because disk transfer dominates the cost of a data-management query by orders of magnitude, so a simplified model that counts only pages read and written from disk captures almost all of the real cost and makes the arithmetic tractable in an exam. The unit's cost model explicitly ignores CPU time, the difference between sequential and random access, and the cost of writing the final output — so never add output pages to a join or sort cost.

What is the difference between a heap, a sorted and a hashed file?

A heap file keeps records in no particular order — cheap inserts, expensive searches. A sorted file keeps them ordered on a sort key — a binary search on that key is O(log B), but inserts must maintain the order. A hashed file places each record in a bucket by a hash of the search key — near O(1) for equality on that key, but useless for range queries. The right choice depends on the workload's dominant operation.

When does a column store beat a row store?

On analytical (OLAP) workloads: queries that scan many rows but only a few columns (aggregations, reports). A column store reads only the columns it needs and compresses each column well, so it moves far fewer pages. A row store wins for transactional (OLTP) workloads — point lookups and inserts/updates of whole rows — because the entire row is on one page.

How do I trace a CLOCK / LRU replacement policy in the exam?

Walk the page-reference string one access at a time, marking hits (page already in a frame) and misses (page must be loaded, evicting one if the pool is full). LRU evicts the least-recently-used frame; CLOCK approximates LRU with a reference bit and a moving hand, giving it a second chance before eviction. Count the misses — that count is the page-I/O cost, and the policy that produces fewer misses wins for that string.

Study strategy

Exam move

Anchor everything to the single currency of the unit: page I/Os. For file organisations, memorise the small access-cost table (scan / equality / range / insert / delete for heap vs sorted vs hashed) so you can read off the cheapest organisation for a stated workload. For the buffer pool, practise tracing LRU and CLOCK over a reference string until counting misses is mechanical, and remember a dirty-page eviction costs an extra write. For storage layout, be able to draw a slotted page and say why variable-length records and the NULL bitmap need it, and state the one rule for row vs column: few-columns-many-rows analytics → column store; whole-row transactions → row store. This chapter sets the cost vocabulary every later chapter reuses, so make the counting automatic here.

Keep going — explore the course

A+Everything unlocked

Unlocks this Bible + all 8 of your University of Sydney subjects - and 1,000+ Bibles across every Australian university.

Sia - your DATA3404 tutor, unlimited, worked the way the exam marks it

The full 8-page Bible + practice bank with worked solutions

Chrome extension - sync your LMS so Sia knows your deadlines

Bilingual EN / Chinese on every Bible and every Sia answer

$25/ month

30-day money-back · cancel in one tap · how it works