DATA3404 · Scalable Data Management
Indexing
An index is an access path that finds the rows matching a search key without reading the whole table — the single biggest lever on query cost, turning an O(B) scan of every page into an O(logF B) descent of a few. This chapter covers what a data entry is, the dense-vs-sparse and clustered-vs-unclustered distinctions that decide range-scan cost, the B+-tree (the database's default index) with its fan-out, height and split rules, and extendible hashing with bucket splits and directory doubling. The exam lives here: it hands you a query plus some indexes and asks “what is the access cost in page I/Os?” You must compute a B+-tree's height and search cost, trace an insertion through a leaf split and an index-node split, trace extendible-hashing splits to a final global depth, and decide which index suits a given query.
What this chapter covers
- 012.1 Why index — the lookup-vs-scan trade
- 02Data entries, search keys, and dense vs sparse indexes
- 03Clustered vs unclustered — and why it dominates range-scan cost
- 04B+-trees: fan-out, height ⌈logF B⌉ + 1, leaf split (copy-up) & node split (push-up)
- 05Extendible hashing: bucket splits, local vs global depth, directory doubling
- 06Choosing an index: equality / range / group-by, and the composite-key prefix rule
Worked example: B+-tree height and search cost
- +1(a) Height from fan-out: a B+-tree of fan-out F holding N data entries has height about ⌈logF N⌉. Here log100 1,000,000 = log(106)/log(102) = 6/2 = 3 levels below the root.
- +1Read it as levels: root → one internal level → leaf — a handful of nodes regardless of how big the table is. That is the whole point of the high fan-out.
- +1(b) Search cost: you read one page per level on the way down, then the leaf, so cost = ⌈logF N⌉ + 1. With height 3 that is 3 + 1 = 4 page I/Os.
- +1Contrast with a scan: a full heap scan of the same data could be hundreds of thousands of page I/Os — the index reaches the row in 4. State that comparison; it is where the marks are.
Key terms
- Data entry
- What an index leaf actually stores for a key value — either the full record, the record id (a TID), or a list of TIDs. The choice (Alternative 1/2/3) decides whether the index is also the file (clustered) or a separate structure pointing into it.
- Clustered index
- An index whose order matches the physical order of the data file, so matching rows of a range query lie on consecutive pages. A clustered range scan reads only the qualifying pages; an unclustered one may pay a separate page I/O for every matching row — the single biggest cost difference in the chapter.
- Fan-out (F)
- The number of children an internal B+-tree node points to. A high fan-out keeps the tree shallow: a table of N entries has height about ⌈logF N⌉, so even millions of rows are reached in three or four page reads.
- Leaf split (copy-up)
- When a B+-tree leaf overflows on insert, it splits in two and the first key of the new leaf is copied up into the parent (the key stays in the leaf too). An internal-node split instead pushes up the middle key, which is why the two cases are traced differently.
- Extendible hashing
- A dynamic hash index in which a directory of 2global depth pointers maps the top bits of the hash to buckets. When a bucket overflows it splits; if its local depth equalled the global depth, the directory doubles. The exam traces inserts to a final global depth and directory size.
Indexing FAQ
What is the difference between a clustered and an unclustered index, and why does it matter so much?
A clustered index stores (or orders) the data in the same order as the index key, so a range query's matching rows sit on consecutive pages and the scan reads only those pages. An unclustered index just points into an unordered file, so each matching row can be on a different page — a range that returns r rows can cost up to r extra page I/Os. This single distinction often decides whether an index helps a range query at all, so always check whether the index is clustered before costing a range scan.
How do I compute a B+-tree's height and search cost?
Height is about ⌈logF N⌉, where F is the fan-out (entries per node) and N is the number of data entries. The equality-search cost is height + 1 page I/Os — one read per level down to and including the leaf. Quote the formula ⌈logF N⌉ + 1; the marks are in showing the fan-out reasoning, not just the final number.
When does the B+-tree help and when does a hash index help?
A B+-tree supports both equality and range queries and keeps data sorted, so it helps ORDER BY, GROUP BY and range predicates. A hash index supports only equality (and only on the full key) but is slightly faster for it; it cannot answer range queries because hashing destroys order. So: range or sort needed → B+-tree; pure equality on a complete key → hashing is a fine choice.
What is the prefix-match rule for a composite (multi-attribute) index?
A composite index on (A, B, C) can be used for a query only if the query constrains a prefix of the key — A alone, or A and B, or all three — in that order. A query that constrains only B cannot use the index, because the index is sorted by A first. So order the columns of a composite key to match the most common query patterns, putting the equality-tested columns first.
Exam move
Make the B+-tree your default mental model and the cost formula ⌈logF N⌉ + 1 reflexive. Drill three traces until they are automatic: a B+-tree insert that causes a leaf split (copy-up) and a node split (push-up); an extendible-hashing sequence of inserts that splits a bucket and doubles the directory, tracking local vs global depth; and an index-choice decision for equality, range and group-by queries. The most marks-losing trap is forgetting that clustered vs unclustered changes a range scan's cost dramatically, so always state which one you have before costing a range. For composite keys, apply the prefix-match rule. Tie every answer back to the page-I/O count — that is the number the question wants.