University of Sydney · S1 2026 · FACULTY OF INFORMATION TECHNOLOGY

DATA3404 · Scalable Data Management

Q: How do I compute a B+-tree's height and search cost?

Height is about &lceil;logF N&rceil;, where F is the fan-out (entries per node) and N is the number of data entries. The equality-search cost is height + 1 page I/Os — one read per level down to and including the leaf. Quote the formula &lceil;logF N&rceil; + 1; the marks are in showing the fan-out reasoning, not just the final number.

- one subject, every graph, every model, every mark

50% final exam · hurdle14 Chapters9-page Bible

Our own words - no uploaded lecturer files

Built to mirror S1 2026 · updated this semester

Chapter 2 of 7 · DATA3404

Indexing

An index is an access path that finds the rows matching a search key without reading the whole table — the single biggest lever on query cost, turning an O(B) scan of every page into an O(log_F B) descent of a few. This chapter covers what a data entry is, the dense-vs-sparse and clustered-vs-unclustered distinctions that decide range-scan cost, the B+-tree (the database's default index) with its fan-out, height and split rules, and extendible hashing with bucket splits and directory doubling. The exam lives here: it hands you a query plus some indexes and asks “what is the access cost in page I/Os?” You must compute a B+-tree's height and search cost, trace an insertion through a leaf split and an index-node split, trace extendible-hashing splits to a final global depth, and decide which index suits a given query.

In this chapter

What this chapter covers

012.1 Why index — the lookup-vs-scan trade
02Data entries, search keys, and dense vs sparse indexes
03Clustered vs unclustered — and why it dominates range-scan cost
04B+-trees: fan-out, height ⌈log_F B⌉ + 1, leaf split (copy-up) & node split (push-up)
05Extendible hashing: bucket splits, local vs global depth, directory doubling
06Choosing an index: equality / range / group-by, and the composite-key prefix rule

Worked example · free

Worked example: B+-tree height and search cost

Q [4 marks]. A B+-tree indexes a relation of 1,000,000 records. Each index node holds up to F = 100 entries (fan-out 100). (a) Estimate the height of the tree. (b) What is the cost in page I/Os of an equality search that finds one matching data entry?

+1(a) Height from fan-out: a B+-tree of fan-out F holding N data entries has height about ⌈log_F N⌉. Here log₁₀₀ 1,000,000 = log(10⁶)/log(10²) = 6/2 = 3 levels below the root.
+1Read it as levels: root → one internal level → leaf — a handful of nodes regardless of how big the table is. That is the whole point of the high fan-out.
+1(b) Search cost: you read one page per level on the way down, then the leaf, so cost = ⌈log_F N⌉ + 1. With height 3 that is 3 + 1 = 4 page I/Os.
+1Contrast with a scan: a full heap scan of the same data could be hundreds of thousands of page I/Os — the index reaches the row in 4. State that comparison; it is where the marks are.

Height ≈ ⌈log₁₀₀ 1,000,000⌉ = 3; equality search cost = height + 1 = ⌈log_F N⌉ + 1 = 4 page I/Os.

Sia tip — Quote the formula ⌈log_F N⌉ + 1 explicitly — method marks are awarded for showing the fan-out reasoning even if you round the height differently. And note this is the cost to a single matching entry; an unclustered range scan then pays one extra I/O per matching row.

Glossary

Key terms

Data entry: What an index leaf actually stores for a key value — either the full record, the record id (a TID), or a list of TIDs. The choice (Alternative 1/2/3) decides whether the index is also the file (clustered) or a separate structure pointing into it.
Clustered index: An index whose order matches the physical order of the data file, so matching rows of a range query lie on consecutive pages. A clustered range scan reads only the qualifying pages; an unclustered one may pay a separate page I/O for every matching row — the single biggest cost difference in the chapter.
Fan-out (F): The number of children an internal B+-tree node points to. A high fan-out keeps the tree shallow: a table of N entries has height about ⌈log_F N⌉, so even millions of rows are reached in three or four page reads.
Leaf split (copy-up): When a B+-tree leaf overflows on insert, it splits in two and the first key of the new leaf is copied up into the parent (the key stays in the leaf too). An internal-node split instead pushes up the middle key, which is why the two cases are traced differently.
Extendible hashing: A dynamic hash index in which a directory of 2^{global depth} pointers maps the top bits of the hash to buckets. When a bucket overflows it splits; if its local depth equalled the global depth, the directory doubles. The exam traces inserts to a final global depth and directory size.

FAQ

Indexing FAQ

What is the difference between a clustered and an unclustered index, and why does it matter so much?

A clustered index stores (or orders) the data in the same order as the index key, so a range query's matching rows sit on consecutive pages and the scan reads only those pages. An unclustered index just points into an unordered file, so each matching row can be on a different page — a range that returns r rows can cost up to r extra page I/Os. This single distinction often decides whether an index helps a range query at all, so always check whether the index is clustered before costing a range scan.

How do I compute a B+-tree's height and search cost?

Height is about ⌈log_F N⌉, where F is the fan-out (entries per node) and N is the number of data entries. The equality-search cost is height + 1 page I/Os — one read per level down to and including the leaf. Quote the formula ⌈log_F N⌉ + 1; the marks are in showing the fan-out reasoning, not just the final number.

When does the B+-tree help and when does a hash index help?

A B+-tree supports both equality and range queries and keeps data sorted, so it helps ORDER BY, GROUP BY and range predicates. A hash index supports only equality (and only on the full key) but is slightly faster for it; it cannot answer range queries because hashing destroys order. So: range or sort needed → B+-tree; pure equality on a complete key → hashing is a fine choice.

What is the prefix-match rule for a composite (multi-attribute) index?

A composite index on (A, B, C) can be used for a query only if the query constrains a prefix of the key — A alone, or A and B, or all three — in that order. A query that constrains only B cannot use the index, because the index is sorted by A first. So order the columns of a composite key to match the most common query patterns, putting the equality-tested columns first.

Study strategy

Exam move

Make the B+-tree your default mental model and the cost formula ⌈log_F N⌉ + 1 reflexive. Drill three traces until they are automatic: a B+-tree insert that causes a leaf split (copy-up) and a node split (push-up); an extendible-hashing sequence of inserts that splits a bucket and doubles the directory, tracking local vs global depth; and an index-choice decision for equality, range and group-by queries. The most marks-losing trap is forgetting that clustered vs unclustered changes a range scan's cost dramatically, so always state which one you have before costing a range. For composite keys, apply the prefix-match rule. Tie every answer back to the page-I/O count — that is the number the question wants.

Keep going — explore the course

A+Everything unlocked

Unlocks this Bible + all 8 of your University of Sydney subjects - and 1,000+ Bibles across every Australian university.

Sia - your DATA3404 tutor, unlimited, worked the way the exam marks it

The full 9-page Bible + practice bank with worked solutions

Chrome extension - sync your LMS so Sia knows your deadlines

Bilingual EN / Chinese on every Bible and every Sia answer

$25/ month

30-day money-back · cancel in one tap · how it works