Scalable Data Management
Sem 1 2026 · Side 1 of 2
Whole-unit revision · all topics
0 · Exam Blueprintread first
DATA3404 is about database-engine internals + scaling out — "learn the principles, not the software." The graded weight: Final 60% (paper-based, 2h, semi-open book[confirm permitted materials]) + 2×12% group practicals (SimpleDB engine; SQL + Spark) + 6% quizzes + 5% participation + 5% presentation.
Double hurdle: you must score ≥40% on the exam AND ≥50% overall — fail either and the final mark is capped at 45.
The exam costs engine internals, NOT SQL: it is mostly short-answer (some MCQ) and explicitly does not assess your code. Marks live in hand-worked cost models: buffer/CLOCK, B+-tree & hash math, external-sort passes, the join cost formulas, and query-optimisation costing.
1 · Storage & FilesLec 02
DB file = a sequence of fixed-size pages; each page holds many records, aligned to block-oriented disk. The access gap: disk latency ≫ memory ⇒ minimise page I/O.
File organisations
- Heap — record placed anywhere with free space; best when access is a full scan.
- Sorted — ordered by search key; binary search, good for range/ordered retrieval; insert shifts records (expensive).
- Index — tree/hash access path; faster updates than a sorted file.
Access cost (B = #pages)
| Op | Heap | Sorted |
|---|---|---|
| Scan all | O(B) | O(B) |
| Equality | B/2 avg | log₂B |
| Insert | 2 (rd+wr last) | log₂B + B |
| Delete | B/2 + 1 | log₂B + 1 |
Query types: exact-match / point (key ⇒ 0–1 row) vs range (BETWEEN, LIKE 'x%' ⇒ many).
Raw flat files (CSV/JSON) = a linear heap: full scan to find a row, inserts shift everything, no index. Hence pages + indexes. Sorted files are rarely used directly — instead use an index-organised (clustered) file.
1b · Records & Slotted Pagevar-length
Variable-length record best format = an array of field offsets in the header (direct i-th field access, cheap NULLs). NULL-bitmap: 1 bit/field, 1 byte covers 8.
Slotted page = Page Header + Slot Directory (grows one way) + Records (grow the other). Slot entry = (offset, length). TID = ⟨page id, slot #⟩; a record can move within a page without changing its TID (the slot still points to it).
Big values: overflow or out-of-line storage (PostgreSQL TOAST: ~2 KiB chunks, varlena pointer, compressed). A record must fit a page (≈ 8 KiB − header).
Fixed-length record = array of same-size records (struggles with var strings / NULLs); variable-length uses delimiters, length-prefixes, or the offset array.
PostgreSQL row = RowHeader + NULL-bitmap (⌈(#cols+7)/8⌉ B, 0 if all NOT NULL) + RowData (fixed then var cols). Page slot = 4 B ItemId (offset+length).
2 · Row vs Column StoreOLTP vs OLAP
| Row store | Column store | |
|---|---|---|
| Layout | all attrs together | one file/attr |
| Good for | OLTP, full-row | OLAP, few cols |
| I/O | reads whole row | reads only needed cols |
| Compress | mixed types | better (uniform) |
| Update | cheap | expensive |
PAX (Partition Attributes Across) = hybrid: rows in a page, page split into n minipages, one per attr ⇒ better cache locality. Parquet = open column format (nested data, à la Dremel). Wide-column (BigTable, HBase) ≠ column DBMS: column families stored row-wise, for sparse data.
Column DBMS: MonetDB, Vertica, SAP HANA, Druid. Column-store cons: updates & full-row reads are expensive; bad for small tables.
3 · Buffer ManagementLec 02
Buffer manager loads pages into frames. On a request: in buffer? return it; else pick an unpinned frame, if dirty write it back, then load. Pin count > 0 ⇒ not evictable; requestor must unpin & flag modification. Dirty pages written deferred by a background writer.
Read efficiency (hit ratio)(req − physical I/O) / req · aim ≥ 80%
Why DBMS not OS: needs to pin, force-to-disk for recovery, tune replacement, prefetch, avoid double-buffering.
3b · Replacement Policiesmemorise CLOCK
| Policy | Evicts |
|---|---|
| FIFO | oldest arrival (by age) |
| LRU | least-recently-used (locality) |
| MRU | most-recently-used |
| LFU | smallest use count |
| CLOCK | "second chance" (below) |
| GCLOCK | CLOCK + ref counter |
CLOCK = circular list + hand; each frame has a reference bit (set 1 on access). On eviction: if R=1 clear it to 0 and advance; if R=0 (unpinned) evict. GCLOCK uses a counter decremented each sweep, evict at 0 ⇒ protects hot pages. Most DBMS use a CLOCK variant.
3c · Trace Methodhow to score it
Tabulate the reference string left→right with one column per access; mark hit (in buffer) or miss (disk I/O). On a miss with a full buffer apply the policy to choose the victim. Count misses = disk I/Os (cold-start misses included).
FIFO ignores re-use (by age); LRU/MRU reorder a recency queue on every access; LFU/GCLOCK track counts since load. GCLOCK typically yields fewer misses than CLOCK because its counter protects hot pages.
Classify policies by age (FIFO, since arrival) vs references (LRU/MRU/CLOCK by last reference; LFU/GCLOCK by all references).
4 · B+-TreeLec 03 · every year
Dynamic, balanced, multi-level. All root→leaf paths equal length; entries only in leaves; double-linked sibling pointers between leaves for range scans. Interior nodes = separators only.
Layout P₀ K₁ P₁ … Kₙ Pₙ: go P₀ if k<K₁; Pₙ if k≥Kₙ; else Pᵢ for Kᵢ≤k<Kᵢ₊₁. Each node (order d): leaf holds d–2d entries; non-root ≥ half full; fan-out F = avg children (often >100).
Search cost (one match)≈ ⌈log_F N⌉ + 1 (N = #leaf pages)
= index height + 1
In practice order 100, fill 66% ⇒ F≈133. Height 3 ⇒ 133³ ≈ 2.35M rows; height 4 ⇒ ≈313M. Height rarely > 3–4.
Insert
Find leaf; if room insert sorted; else split leaf (keep first ⌈n/2⌉, copy up new node's min key); if parent full split index node (remove middle kₘ, push it up). Root split ⇒ height +1.
Delete
Remove; if ≥ half full done; else redistribute (borrow from sibling) else merge + drop separator. Many systems skip rebalance on delete.
4b · Worked · Heightre-numbered
Table = 800,000 records, unclustered B+-tree order 50 (≤101 children, leaf ≤100 entries).
#leaf pages = 800,000 / 100 = 8,000
height = ⌈log₁₀₁ 8,000⌉ = 2 ⇒ 3 levels
search 1 match = h + 1 = 3 page reads
Top levels are usually cached (L1=1, L2=F, L3=F² pages) ⇒ root + interior rarely cost real I/O.
ISAM = the static cousin: sparse separators over clustered leaves, but inserts spawn overflow chains ⇒ degrades for dynamic data.
4c · Storage HierarchyLec 01
Registers → CPU cache → main memory → disk → tertiary. The access gap: secondary-storage latency ≫ memory latency, so data moves in page units.
Units trap: SI MB = 10⁶ B; IEC MiB = 1024² = 1,048,576 B; KiB=1024, GiB=1024³. OSes report binary but label "MB"; drive makers use decimal.
Prefetching reads several pages ahead on predicted sequential access (e.g. DB2 8/32 pages).
4d · Split Rules · Order 2leaf vs node
Order d=2 ⇒ node holds 2d=4 entries. Insert into a full leaf {40,40,42,45,50}:
Leaf split (copy up)keep first ⌈5/2⌉=3 = {40,40,42},
new leaf {45,50}, copy up 45 to parent
Index-node split (push up)root {18,40,45,73,91} full ⇒ keep {18,40},
new node {73,91}, push up middle 45 as new root ⇒ height +1
Key difference: leaf split copies the separator up (it still lives in a leaf); index split moves it up (removed from the node).
Range search: descend to the first matching leaf, then follow sibling pointers along the leaf level to the end of the range — no re-descending.
5 · Hash IndexingLec 04
Entries → buckets by h(v); address = h(v) mod M. Equality cost = #pages in bucket = 1 if no overflow ⇒ beats a B+-tree for equality. No range search.
Static hashing: fixed M, overflow chains grow long (like ISAM). Fixed by dynamic schemes.
Extendible hashing
Directory of size 2^(global depth); pick bucket by the last global-depth bits of h(r). Local depth = #bits deciding one bucket; global depth = max over all.
Insert: room ⇒ insert. Full ⇒ split bucket (rehash on 1 extra bit, both local depths +1).
Directory-doubling ruledouble IFF local depth = global depth
(before split). If local < global → just
re-point one directory entry, NO doubling.
Equality = 1 disk access if directory in memory (else 2). Delete: empty bucket merges with split image; halve directory if all entries equal their image.
5b · Worked · Extendible Buildre-numbered
Insert 4,6,10,14 then 22,34,38 with h(v)=v mod 8, bucket capacity 3, start 1 bucket.
- 4,6,10 fill bucket (global=local 0).
- Insert 14 ⇒ full & local=global ⇒ split + double directory to size 2, depth 1.
- Insert 22 ⇒ a bucket's local depth → 2 > global(1) ⇒ double again to size 4, depth 2.
- 34,38 then fit by re-pointing (local < global) — no doubling.
5c · Bitmap & BloomOLAP
Bitmap index: for field with m distinct values, m bit-vectors of length n (#rows); vector v has 1 where row=v. Answer queries with AND/OR/NOT/COUNT bit-ops. Space O(m·n) ⇒ compress (RLE) when m large. Best for few distinct values.
e.g. males with rating<4 = (R1 OR R2 OR R3) AND Gender_M, then COUNT the result vector.
Bloom filter = probabilistic membership: no false negatives, possible false positives ("definitely not" / "possibly in"). Element hashed to several bits in a small array.
5d · Hash vs B+-Treepick the index
| Need | Use |
|---|---|
| Equality only | hash (1 page/bucket) |
| Range / ordered | B+-tree (hash can't) |
| Prefix of composite | B+-tree |
| Group-by / sort | clustered B+-tree |
Extendible-hash equality = 1 disk access if the directory fits in memory (else 2). 1M rows / 4K-row pages ⇒ ~25,000 directory entries ⇒ fits.
Linear hashing = directory-free dynamic scheme (round-robin bucket splits); both fix static hashing's overflow chains.
Mini-trace (cap-2, h=v mod 8): 0,4 fill a bucket; insert 1 ⇒ local=global ⇒ split + double (g=1) ⇒ {0,4}/{1}; insert 3 after 5 ⇒ local=global=1 ⇒ double again (g=2) ⇒ {1,5}/{3}.
6 · Index Classification4 dimensions
| Dimension | Meaning |
|---|---|
| Primary/secondary | key sets file order (often integrated) vs not |
| Unique/non-unique | over a candidate key or not |
| Clustered/unclustered | data & index ordered same; ≤1 clustered/table |
| Dense/sparse | entry per record vs per page |
Rules: a main/integrated index is always clustered; unclustered must be dense; sparse needs a candidate-key search key. Search key ≠ key (search key need not be unique).
EGSR choice rule = Equality, GroupBy, Sort, Range. Equality attrs first (any type); range ⇒ tree index; group-by ⇒ index on those attrs; sort ⇒ clustered B-tree.
Composite (A1,A2) = lexical; supports prefix-key & index-only (covering) plans. Order: equality keys before range keys.
Worthwhile if lookup + retrieval < full-scan cost.
6b · Range-Scan Costfavourite
10,000 data pages, 100 rows in range, 20 rows/page:
| Access path | Page I/O |
|---|---|
| Heap | 10,000 (whole file) |
| Sorted on key | log₂10000 + 5 ≈ 19 |
| Unclustered idx | (h+1) + 100 (1 I/O / row) |
| Clustered idx | (h+1) + 5 (rows packed) |
The unclustered penalty = one random I/O per matching row; clustered packs matches on consecutive pages.
6c · Multi-Attr Index Matchprefix rule
Selection salary<50000 ∧ age=21 ∧ make='x'. Match rule: all predicate attrs must be a prefix of the key, AND any range (inequality) attr must be last.
| Index key | Matches? |
|---|---|
| (salary,age,make) | ✗ range not last |
| (age,salary,make) | ✗ range not last |
| (age,make,salary) | ✓ prefix + range last |
Hash index matches only on equality over every key attr; a tree index matches a prefix (e.g. <a,b,c> matches a=5 ∧ b>3, not b=3 alone).
6d · Worked · File Scanheap vs sorted
Person: 1,000,000 rows, 10/page ⇒ 100,000 pages; age uniform 0–99.
- age=30 on heap (1% match) ⇒ still scan all 100,000 (unsorted).
- Sorted on name ⇒ no help for an age predicate ⇒ 100,000.
- Sorted on age ⇒ binary search + clustered ⇒ ~1,000 pages (range [30,35) ⇒ ~5,000).
Lesson: an index/order only helps if it matches the query's predicate attribute — a name-sorted file is useless for an age query.
6e · Covering Indexindex-only
An index-only (covering) plan answers a query from the index alone, never fetching the heap tuple. e.g. (city,name) answers SELECT name WHERE city=:v. Clustering not needed for index-only plans — usually a tree index.
7 · External Merge SortLec 05
For ORDER BY, DISTINCT, sort-merge join, bulk B+-tree load. Sort N pages with B buffer frames:
Pass 0 (sort): read all N (B at a time), sort each chunk, write ⇒ ⌈N/B⌉ runs of B pages. Cost 2N.
Merge passes: merge B−1 runs at a time (1 output buffer); each pass reads+writes all N ⇒ 2N.
#passes & total I/O#passes = 1 + ⌈log_(B−1) ⌈N/B⌉⌉
Total I/O = 2N · (#passes)
Run length after merge pass i = (B−1)ⁱ·B. In practice rarely > 2–3 passes. Clustered B+-tree on the sort col ⇒ read leaves in order (good); unclustered ⇒ random I/O/record (bad).
7b · Worked · Sortcount the passes
N = 10,000 pages, B = 30:
Pass 0 runs = ⌈10000/30⌉ = 334
merge B−1 = 29 at a time
P1: ⌈334/29⌉ = 12 runs
P2: ⌈12/29⌉ = 1 ⇒ 2 merge passes
total passes = 1 + 2 = 3
Total I/O = 2·10000·3 = 60,000
8 · Query PipelineLec 05
SQL (declarative) → Parser → Optimizer → Executor. SQL → relational-algebra logical tree → physical plan (operators + algorithm). Many plans per SQL; optimiser picks one.
Materialization (set-at-a-time): write temp relation, next op reads it — always works, costly I/O. Pipelining (tuple-at-a-time): pass each row on — cheap, but the inner join input, sorts, hash joins & aggregations can't pipeline their input.
Reduction factor = #out / #in (= selectivity). Cost = #block transfers, output ignored.
8b · Relational Algebrathe operators
6 basic: ∪ union, − difference, σ select (rows), π project (cols), × cross-product, ρ rename. Derived: ∩, ⋈ join, ÷ division. Set ops need union-compatible schemas (same arity, names, domains).
Join R⋈_θ S = σ_θ(R×S); equi-join = only equalities; natural join = equi-join on all same-named attrs, keep one copy. Complex conditions → CNF to match access paths.
Access paths: table scan, scan+filter, index scan, index-only (covering) scan.
σ (selection) = subset of rows; π (projection) = subset of columns (strict RA removes duplicates ⇒ SQL DISTINCT). σ and π are usually fused into the access-path loop.
An expression tree shows logical RA operators; a query execution plan shows the physical operators + algorithms chosen.
Formula Beltside 1
heap eq = B/2 · sorted eq = log₂B
B+tree search = ⌈log_F N⌉+1 = h+1
ext-sort #passes = 1+⌈log_(B−1)⌈N/B⌉⌉
total I/O = 2N·#passes
ext-hash: double iff local=global depth
hit ratio = (req−I/O)/req ≥ 80%