University of Sydney · FACULTY OF COMPUTER SCIENCE

COMP9120 · Database Management Systems

- one subject, every graph, every model, every mark
50% final exam · hurdle14 Chapters7-page Bible
Our own words - no uploaded lecturer files
Updated for this semester
Chapter 11 of 12 · COMP9120

Storage & Indexing

Storage & Indexing is the physical layer of COMP9120 Database Management Systems at the University of Sydney: how a table's rows pack onto fixed-size pages, and how a B+-tree or hash index turns a full scan into a short lookup. Because a database is I/O-bound, this topic measures every access path in one currency — the number of page I/Os — and the final exam asks you to count those pages by hand. Master the floor/ceil arithmetic and the clustered-vs-unclustered distinction and you own the 28-mark Storage & Indexing question.

In this chapter

What this chapter covers

  • 01Why databases are I/O-bound and why cost is counted in page I/Os
  • 02Heap vs sorted vs index file organisations and their search costs
  • 03Records-per-page and data-page counts: floor for capacity, ceil for page counts
  • 04Fill-factor, blocking factor, and record spanning
  • 05The four index axes: primary/secondary, clustered/unclustered, dense/sparse, tree/hash
  • 06Dense vs sparse data entries and when each is legal
  • 07Clustered vs unclustered range-scan cost: the ~one-I/O-per-row gap
  • 08B+-tree fan-out, bottom-up build, height, and equality-search cost
  • 09Composite search keys and the consecutive-match (prefix) rule
  • 10Choosing an access path: equality vs range, clustered vs unclustered
Worked example · free

Counting data pages plus a secondary B+-tree index

Q [6 marks]. Table Booking(booking_id, guest_id, nights). Each record is 160 B; a page is 4096 B with a 96 B header (usable 4000 B); there are 48,000 records with a 75% fill-factor and a 4 B rowid. A dense secondary B+-tree indexes guest_id (key 8 B). How many pages does the data file plus this index occupy, and what does an equality lookup on guest_id cost?
  • +1Records per page uses floor (a partial record does not fit): floor(4000 / 160) = 25 records per page.
  • +1Apply the fill-factor, then use ceil for the page count: 25 x 0.75 = 18.75 effective records per page, so data pages = ceil(48,000 / 18.75) = 2,560.
  • +1Index entry size = search key + rowid = 8 + 4 = 12 B; entries per page = floor(4000 / 12) = 333; fan-out F = 333 x 0.75 = 249.75.
  • +1The index is dense (one entry per record), so the leaf level covers all 48,000 records: leaf pages = ceil(48,000 / 249.75) = 193.
  • +1Climb one level: ceil(193 / 249.75) = 1 page = the root. So the index has 2 levels and 193 + 1 = 194 pages.
  • +1Total = data 2,560 + index 194 = 2,754 pages. An equality lookup on guest_id descends 2 index levels then reads 1 record page = 3 page I/Os.
The data file is 2,560 pages and the dense secondary B+-tree is 194 pages (193 leaf + 1 root, 2 levels), for a total of 2,754 pages. An equality lookup on guest_id costs 2 index levels + 1 record page = 3 page I/Os.
Sia tip — Keep floor and ceil straight: capacity per page rounds DOWN, page and level counts round UP, and apply the fill-factor before you take the ceiling. A dense index sizes over records; a sparse index sizes over data pages.
Glossary

Key terms

Page (block)
The fixed-size unit of transfer between disk and memory. The DBMS reads a whole page, never a single record, so all cost is counted in page I/Os.
Fill-factor
The fraction of each page actually filled (often 75%), leaving slack for future inserts. Multiply the per-page capacity by the fill-factor before counting pages.
Heap file
Rows stored in insertion order. Cheap to insert, but equality on a unique key scans about half the file and a range scans every page.
Clustered (primary) index
The data records are stored in the same order as the index, so a range match sits on consecutive pages. There is at most one clustered index per table.
Unclustered (secondary) index
Points into scattered data, so it must be dense; a range or non-unique match costs up to one page I/O per matching record.
Dense vs sparse
A dense index keeps one entry per record; a sparse index keeps one entry per page and is only legal when the data is sorted on the search key.
Fan-out (F)
The number of entries per B+-tree page after the fill-factor, roughly floor((pageSize - header) / entrySize) x f. High fan-out keeps the tree shallow.
B+-tree
A balanced, dynamic index whose data entries live in a linked leaf chain. It answers both equality and range queries; equality cost = tree levels + one record page.
FAQ

Storage & Indexing FAQ

Why does COMP9120 measure everything in page I/Os instead of CPU time?

A database is I/O-bound: reading a page from disk (seek + rotational delay + transfer) is far slower than any in-memory work, and one seek dwarfs the CPU cost of processing the page. So the unit ignores CPU and counts only the number of fixed-size pages moved between disk and memory. Every file-organisation and index cost in the topic is expressed as a page count, which is exactly what the final exam asks you to compute by hand.

What is the difference between a clustered and an unclustered index for a range query?

A clustered (primary) index stores the data rows in the same order as the index, so the matches for a range lie on consecutive pages and you read about ceil(m / rows-per-page) of them. An unclustered (secondary) index points into scattered data, so each of the m matching rows can sit on a different page, costing up to one page I/O per row. That is why an unclustered range scan can become more expensive than reading the whole file, at which point the optimiser prefers a full scan.

Can AI help me with storage and indexing in COMP9120?

Yes, as a study aid. Sia, the AskSia tutor, explains storage and indexing step by step: it can walk you through why databases are I/O-bound, re-derive the floor/ceil page-counting arithmetic on a fresh example, and check your reasoning on a clustered-vs-unclustered cost comparison. Use it to understand the method and practise, not to obtain answers to a graded assignment or exam; the exam is closed book, so the goal is to be able to do the counting yourself.

Studying with AI? Sia — free AI data science tutor works through COMP9120 step by step.

Study strategy

Exam move

Storage & Indexing is Part 2, Question 2 of the final exam, worth 28 marks on a 2-hour (120-minute), closed-book paper of 131 marks; at roughly 0.8 minutes per mark, budget about 22 minutes for it and keep time to re-check your other answers. Drill two moves until they are automatic: page counting (records/page with floor, fill-factor, then data pages with ceil; then fan-out and B+-tree levels) and access costing (equality vs range, clustered vs unclustered). Always show your working line by line, because the marks are method marks. Remember the double hurdle: you must score at least 40% on the final exam AND at least 50% overall, or the unit mark is capped at 45. Confirm the exact weights, format and exam-period dates on Canvas for your offering.

A+Everything unlocked
Unlocks this Bible + all 25 of your University of Sydney subjects - and 1,000+ Bibles across every Australian university.
Sia - your COMP9120 tutor, unlimited, worked the way the exam marks it
The full 7-page Bible + practice bank with worked solutions
Chrome extension - sync your LMS so Sia knows your deadlines
Bilingual EN / Chinese on every Bible and every Sia answer
$25/ month
30-day money-back · cancel in one tap · how it works
Unlock the full COMP9120 Bible + 25 University of Sydney subjects解锁完整 COMP9120 Bible + University of Sydney 25 门科目
$25/mo