University of Sydney · S1 2026 · FACULTY OF INFORMATION TECHNOLOGY

DATA3404 · Scalable Data Management

- one subject, every graph, every model, every mark
50% final exam · hurdle14 Chapters6-page Bible
Our own words - no uploaded lecturer files
Built to mirror S1 2026 · updated this semester
Chapter 3 of 7 · DATA3404

Sorting and Query Processing

SQL is declarative — you say what you want, not how to get it — and the engine compiles it into an imperative plan: a tree of physical operators it actually runs. This chapter walks the path parse → optimise → execute, shows how operators stream tuples to one another under the iterator model, explains how the engine picks an access path, and then drills the headline algorithm of the unit: external merge sort. Sorting is not a niche topic — it underlies ORDER BY, DISTINCT, GROUP BY, sort-merge join and bulk B+-tree loading — and its cost formula is exam bread-and-butter. The two near-guaranteed exam moves are computing the number of sort passes and total I/O from N pages and B buffer frames, and choosing the right access path for a query.

In this chapter

What this chapter covers

  • 013.1 From SQL to a plan: parse → optimise → execute
  • 02The iterator (pull) model — open / next / close and pipelining
  • 03Access paths: when a full scan, a B+-tree, or a hash index wins
  • 04External merge sort: the pass formula and total I/O = 2N × #passes
  • 05Pipelining vs materialisation — and the extra I/O materialising costs
  • 06The relational-algebra operators and selection/projection push-down
Worked example · free

Worked example: external merge sort passes and total I/O

Q [5 marks]. You must sort a relation of N = 100 pages using a buffer of B = 5 pages. (a) How many runs are produced by the first (sorting) pass, and how big is each? (b) How many passes does external merge sort take in total? (c) What is the total I/O cost?
  • +1(a) Pass 0 (create runs): read B = 5 pages at a time, sort them in memory and write them out as a sorted run. Number of runs = ⌈N/B⌉ = ⌈100/5⌉ = 20 runs, each 5 pages long.
  • +1Merge fan-in: each later pass merges up to B − 1 = 4 runs at a time (one buffer frame is the output). So the run count shrinks by a factor of 4 per pass.
  • +1(b) Count the passes: total passes = 1 + ⌈logB−1 ⌈N/B⌉⌉ = 1 + ⌈log4 20⌉ = 1 + ⌈2.16⌉ = 1 + 3 = 4 passes (20 → 5 → 2 → 1 run).
  • +1(c) Total I/O: every pass reads and writes all N pages, so cost = 2N × (number of passes) = 2 × 100 × 4 = 800 page I/Os.
  • +1State it: 4 passes, 800 page transfers — show the 2N·passes formula and the run-count reduction, because that is where the method marks are.
Pass 0 makes ⌈100/5⌉ = 20 runs of 5 pages; total passes = 1 + ⌈log4 20⌉ = 4; total I/O = 2N × passes = 2 × 100 × 4 = 800 page transfers.
Sia tip — The merge fan-in is B − 1, not B — one frame is reserved for the output. Forgetting the −1 is the classic slip that throws the pass count off. And every pass touches all N pages, so the 2N multiplier never changes.
Glossary

Key terms

External merge sort
The algorithm for sorting more data than fits in memory. Pass 0 makes ⌈N/B⌉ sorted runs of B pages; each later pass merges B−1 runs at a time. Total passes = 1 + ⌈logB−1 ⌈N/B⌉⌉ and total I/O = 2N × passes.
Iterator (pull) model
The execution model in which each physical operator exposes open / next / close, and an operator pulls one tuple at a time from its child. This lets results stream (pipeline) through the plan tree without writing intermediate results to disk.
Access path
The method an operator uses to read a relation — a full table scan, a B+-tree index lookup, or a hash index probe. The optimiser picks the cheapest access path for each table given the available indexes and the query's predicates.
Materialisation
Writing an operator's intermediate result to disk instead of streaming it on. It costs an extra read and write of that result (extra I/O) but is sometimes forced — e.g. a sort or a hash build must consume its whole input before producing output.
Selection push-down
A relational-algebra rewrite that moves a selection (filter) as far down the plan tree as possible, so rows are discarded before the expensive joins above. It shrinks every intermediate result and is the optimiser's cheapest, most reliable win.
FAQ

Sorting and Query Processing FAQ

What is the formula for the number of passes in external merge sort?

Total passes = 1 + ⌈logB−1 ⌈N/B⌉⌉, where N is the number of pages to sort and B is the number of buffer frames. The first pass creates ⌈N/B⌉ sorted runs; each subsequent pass merges up to B−1 runs at once (one frame is the output), so the number of runs falls by a factor of B−1 per pass until a single sorted run remains.

Why is the merge fan-in B−1 and not B?

Because one buffer frame must hold the output of the merge while the other frames hold one input page from each run being merged. So with B frames you can merge at most B−1 runs at a time. Using B by mistake inflates the fan-in and underestimates the pass count — a very common arithmetic slip in the exam.

How do I compute the total I/O of a sort once I know the passes?

Every pass reads all N pages and writes all N pages, so each pass costs 2N page I/Os and the total is 2N × (number of passes). The page count N is the same on every pass because no rows are added or removed by sorting — only their order changes — so the 2N multiplier is constant.

Why does sorting matter beyond ORDER BY?

Because so many operators are built on a sorted input: DISTINCT and GROUP BY can be done by sorting then scanning for duplicates/groups; sort-merge join sorts both inputs first; and bulk-loading a B+-tree sorts the data entries. So the external-sort cost reappears inside other operators' costs, which is why DATA3404 drills its formula so hard.

Study strategy

Exam move

Make external merge sort automatic, because its cost reappears inside sort-merge join, DISTINCT, GROUP BY and bulk B+-tree loads. Memorise the two formulas as a pair — passes = 1 + ⌈logB−1 ⌈N/B⌉⌉ and total I/O = 2N × passes — and always use fan-in B−1 (reserve one frame for output). Practise the run-count reduction (e.g. 20 → 5 → 2 → 1) so you can sanity-check the log. For query processing, be able to read a small plan tree under the iterator model, say which access path is cheapest for a predicate, and explain why materialising an intermediate result costs extra I/O while pipelining does not. Finally, learn the six relational-algebra operators and the push-down rules — they set up the optimisation chapter.

A+Everything unlocked
Unlocks this Bible + all 8 of your University of Sydney subjects - and 1,000+ Bibles across every Australian university.
Sia - your DATA3404 tutor, unlimited, worked the way the exam marks it
The full 6-page Bible + practice bank with worked solutions
Chrome extension - sync your LMS so Sia knows your deadlines
Bilingual EN / Chinese on every Bible and every Sia answer
$25/ month
30-day money-back · cancel in one tap · how it works
Unlock the full DATA3404 Bible + 8 University of Sydney subjects解锁完整 DATA3404 Bible + University of Sydney 8 门科目
$25/mo