DATA3404 · Scalable Data Management
Data Storage and Files
Everything DATA3404 charges you for is measured in page I/Os — the number of fixed-size pages moved between disk and memory. This chapter builds that cost vocabulary from the ground up: how a record is laid out inside a slotted page, which file organisation (heap, sorted, or hashed) makes a given operation cheap, how a row-store and a column-store differ for analytical workloads, and how the buffer manager decides which page to evict when memory is full. Get the counting right here and the indexing, sorting, join and optimisation chapters are just the same arithmetic at scale. The exam tasks are concrete: trace a buffer-replacement policy over a reference string and count the misses, compare heap-vs-sorted access cost, explain the slotted-page and record formats, and decide row-store vs column-store for a workload.
What this chapter covers
- 011.1 The storage hierarchy — why I/O is the only cost that matters
- 02Records & the slotted-page layout (fixed vs variable length, NULL bitmap, TID)
- 03File organisations: heap vs sorted vs hashed — the access-cost table
- 04Row-store vs column-store for OLTP vs OLAP workloads
- 05The buffer pool and replacement policies (LRU, CLOCK / GCLOCK)
Worked example: heap vs sorted file — equality search cost
- +1(a) Heap, equality search: records are unordered, so in the worst case you must scan every page — cost = B = 1,000 page I/Os (on average B/2 if you can stop at the first match, but the exam usually counts the full scan for a non-key attribute).
- +1(b) Sorted file, search on the sort key: binary search reaches the page in ⌈log2 B⌉ = ⌈log2 1,000⌉ = 10 page I/Os.
- +1Compare: 10 vs 1,000 — ordering on the search key turns an O(B) scan into an O(log₂ B) descent, the same lever an index pulls.
- +1(c) Insert: a heap insert just appends to the last page (cost ~2 I/O: read + write), whereas a sorted file must find the position and shift records to keep the order — so the heap trades cheap inserts for expensive searches.
Key terms
- Page (block)
- The fixed-size unit of transfer between disk and memory — typically a few kilobytes. All DATA3404 cost is counted in pages read or written, never in bytes or rows.
- Slotted page
- The standard page layout for variable-length records: a slot directory at the end of the page points to each record's offset, so records can grow, shrink or be deleted without rewriting the whole page and a record is addressed by a TID (page id + slot number).
- Heap file
- An unordered file of records. Inserts are cheap (append to the last page) but an equality or range search must scan every page — cost O(B) in pages.
- Buffer pool
- The region of memory holding pages read from disk. When it is full the buffer manager must evict a page using a replacement policy; a page marked dirty must be written back, costing an extra I/O.
- Column store
- A layout that stores each column contiguously rather than each row. Analytical (OLAP) queries that touch few columns read far less data and compress better; row stores still win for transactional (OLTP) point reads and writes of whole rows.
Data Storage and Files FAQ
Why does DATA3404 only count page I/Os and ignore CPU and the output?
Because disk transfer dominates the cost of a data-management query by orders of magnitude, so a simplified model that counts only pages read and written from disk captures almost all of the real cost and makes the arithmetic tractable in an exam. The unit's cost model explicitly ignores CPU time, the difference between sequential and random access, and the cost of writing the final output — so never add output pages to a join or sort cost.
What is the difference between a heap, a sorted and a hashed file?
A heap file keeps records in no particular order — cheap inserts, expensive searches. A sorted file keeps them ordered on a sort key — a binary search on that key is O(log B), but inserts must maintain the order. A hashed file places each record in a bucket by a hash of the search key — near O(1) for equality on that key, but useless for range queries. The right choice depends on the workload's dominant operation.
When does a column store beat a row store?
On analytical (OLAP) workloads: queries that scan many rows but only a few columns (aggregations, reports). A column store reads only the columns it needs and compresses each column well, so it moves far fewer pages. A row store wins for transactional (OLTP) workloads — point lookups and inserts/updates of whole rows — because the entire row is on one page.
How do I trace a CLOCK / LRU replacement policy in the exam?
Walk the page-reference string one access at a time, marking hits (page already in a frame) and misses (page must be loaded, evicting one if the pool is full). LRU evicts the least-recently-used frame; CLOCK approximates LRU with a reference bit and a moving hand, giving it a second chance before eviction. Count the misses — that count is the page-I/O cost, and the policy that produces fewer misses wins for that string.
Exam move
Anchor everything to the single currency of the unit: page I/Os. For file organisations, memorise the small access-cost table (scan / equality / range / insert / delete for heap vs sorted vs hashed) so you can read off the cheapest organisation for a stated workload. For the buffer pool, practise tracing LRU and CLOCK over a reference string until counting misses is mechanical, and remember a dirty-page eviction costs an extra write. For storage layout, be able to draw a slotted page and say why variable-length records and the NULL bitmap need it, and state the one rule for row vs column: few-columns-many-rows analytics → column store; whole-row transactions → row store. This chapter sets the cost vocabulary every later chapter reuses, so make the counting automatic here.