COMP9120 · Database Management Systems
Storage & Indexing
Storage & Indexing is the physical layer of COMP9120 Database Management Systems at the University of Sydney: how a table's rows pack onto fixed-size pages, and how a B+-tree or hash index turns a full scan into a short lookup. Because a database is I/O-bound, this topic measures every access path in one currency — the number of page I/Os — and the final exam asks you to count those pages by hand. Master the floor/ceil arithmetic and the clustered-vs-unclustered distinction and you own the 28-mark Storage & Indexing question.
What this chapter covers
- 01Why databases are I/O-bound and why cost is counted in page I/Os
- 02Heap vs sorted vs index file organisations and their search costs
- 03Records-per-page and data-page counts: floor for capacity, ceil for page counts
- 04Fill-factor, blocking factor, and record spanning
- 05The four index axes: primary/secondary, clustered/unclustered, dense/sparse, tree/hash
- 06Dense vs sparse data entries and when each is legal
- 07Clustered vs unclustered range-scan cost: the ~one-I/O-per-row gap
- 08B+-tree fan-out, bottom-up build, height, and equality-search cost
- 09Composite search keys and the consecutive-match (prefix) rule
- 10Choosing an access path: equality vs range, clustered vs unclustered
Counting data pages plus a secondary B+-tree index
- +1Records per page uses floor (a partial record does not fit): floor(4000 / 160) = 25 records per page.
- +1Apply the fill-factor, then use ceil for the page count: 25 x 0.75 = 18.75 effective records per page, so data pages = ceil(48,000 / 18.75) = 2,560.
- +1Index entry size = search key + rowid = 8 + 4 = 12 B; entries per page = floor(4000 / 12) = 333; fan-out F = 333 x 0.75 = 249.75.
- +1The index is dense (one entry per record), so the leaf level covers all 48,000 records: leaf pages = ceil(48,000 / 249.75) = 193.
- +1Climb one level: ceil(193 / 249.75) = 1 page = the root. So the index has 2 levels and 193 + 1 = 194 pages.
- +1Total = data 2,560 + index 194 = 2,754 pages. An equality lookup on guest_id descends 2 index levels then reads 1 record page = 3 page I/Os.
Key terms
- Page (block)
- The fixed-size unit of transfer between disk and memory. The DBMS reads a whole page, never a single record, so all cost is counted in page I/Os.
- Fill-factor
- The fraction of each page actually filled (often 75%), leaving slack for future inserts. Multiply the per-page capacity by the fill-factor before counting pages.
- Heap file
- Rows stored in insertion order. Cheap to insert, but equality on a unique key scans about half the file and a range scans every page.
- Clustered (primary) index
- The data records are stored in the same order as the index, so a range match sits on consecutive pages. There is at most one clustered index per table.
- Unclustered (secondary) index
- Points into scattered data, so it must be dense; a range or non-unique match costs up to one page I/O per matching record.
- Dense vs sparse
- A dense index keeps one entry per record; a sparse index keeps one entry per page and is only legal when the data is sorted on the search key.
- Fan-out (F)
- The number of entries per B+-tree page after the fill-factor, roughly floor((pageSize - header) / entrySize) x f. High fan-out keeps the tree shallow.
- B+-tree
- A balanced, dynamic index whose data entries live in a linked leaf chain. It answers both equality and range queries; equality cost = tree levels + one record page.
Storage & Indexing FAQ
Why does COMP9120 measure everything in page I/Os instead of CPU time?
A database is I/O-bound: reading a page from disk (seek + rotational delay + transfer) is far slower than any in-memory work, and one seek dwarfs the CPU cost of processing the page. So the unit ignores CPU and counts only the number of fixed-size pages moved between disk and memory. Every file-organisation and index cost in the topic is expressed as a page count, which is exactly what the final exam asks you to compute by hand.
What is the difference between a clustered and an unclustered index for a range query?
A clustered (primary) index stores the data rows in the same order as the index, so the matches for a range lie on consecutive pages and you read about ceil(m / rows-per-page) of them. An unclustered (secondary) index points into scattered data, so each of the m matching rows can sit on a different page, costing up to one page I/O per row. That is why an unclustered range scan can become more expensive than reading the whole file, at which point the optimiser prefers a full scan.
Can AI help me with storage and indexing in COMP9120?
Yes, as a study aid. Sia, the AskSia tutor, explains storage and indexing step by step: it can walk you through why databases are I/O-bound, re-derive the floor/ceil page-counting arithmetic on a fresh example, and check your reasoning on a clustered-vs-unclustered cost comparison. Use it to understand the method and practise, not to obtain answers to a graded assignment or exam; the exam is closed book, so the goal is to be able to do the counting yourself.
Studying with AI? Sia — free AI data science tutor works through COMP9120 step by step.
Exam move
Storage & Indexing is Part 2, Question 2 of the final exam, worth 28 marks on a 2-hour (120-minute), closed-book paper of 131 marks; at roughly 0.8 minutes per mark, budget about 22 minutes for it and keep time to re-check your other answers. Drill two moves until they are automatic: page counting (records/page with floor, fill-factor, then data pages with ceil; then fan-out and B+-tree levels) and access costing (equality vs range, clustered vs unclustered). Always show your working line by line, because the marks are method marks. Remember the double hurdle: you must score at least 40% on the final exam AND at least 50% overall, or the unit mark is capped at 45. Confirm the exact weights, format and exam-period dates on Canvas for your offering.