DATA3404 · Scalable Data Management
Sorting and Query Processing
SQL is declarative — you say what you want, not how to get it — and the engine compiles it into an imperative plan: a tree of physical operators it actually runs. This chapter walks the path parse → optimise → execute, shows how operators stream tuples to one another under the iterator model, explains how the engine picks an access path, and then drills the headline algorithm of the unit: external merge sort. Sorting is not a niche topic — it underlies ORDER BY, DISTINCT, GROUP BY, sort-merge join and bulk B+-tree loading — and its cost formula is exam bread-and-butter. The two near-guaranteed exam moves are computing the number of sort passes and total I/O from N pages and B buffer frames, and choosing the right access path for a query.
What this chapter covers
- 013.1 From SQL to a plan: parse → optimise → execute
- 02The iterator (pull) model — open / next / close and pipelining
- 03Access paths: when a full scan, a B+-tree, or a hash index wins
- 04External merge sort: the pass formula and total I/O = 2N × #passes
- 05Pipelining vs materialisation — and the extra I/O materialising costs
- 06The relational-algebra operators and selection/projection push-down
Worked example: external merge sort passes and total I/O
- +1(a) Pass 0 (create runs): read B = 5 pages at a time, sort them in memory and write them out as a sorted run. Number of runs = ⌈N/B⌉ = ⌈100/5⌉ = 20 runs, each 5 pages long.
- +1Merge fan-in: each later pass merges up to B − 1 = 4 runs at a time (one buffer frame is the output). So the run count shrinks by a factor of 4 per pass.
- +1(b) Count the passes: total passes = 1 + ⌈logB−1 ⌈N/B⌉⌉ = 1 + ⌈log4 20⌉ = 1 + ⌈2.16⌉ = 1 + 3 = 4 passes (20 → 5 → 2 → 1 run).
- +1(c) Total I/O: every pass reads and writes all N pages, so cost = 2N × (number of passes) = 2 × 100 × 4 = 800 page I/Os.
- +1State it: 4 passes, 800 page transfers — show the 2N·passes formula and the run-count reduction, because that is where the method marks are.
Key terms
- External merge sort
- The algorithm for sorting more data than fits in memory. Pass 0 makes ⌈N/B⌉ sorted runs of B pages; each later pass merges B−1 runs at a time. Total passes = 1 + ⌈logB−1 ⌈N/B⌉⌉ and total I/O = 2N × passes.
- Iterator (pull) model
- The execution model in which each physical operator exposes open / next / close, and an operator pulls one tuple at a time from its child. This lets results stream (pipeline) through the plan tree without writing intermediate results to disk.
- Access path
- The method an operator uses to read a relation — a full table scan, a B+-tree index lookup, or a hash index probe. The optimiser picks the cheapest access path for each table given the available indexes and the query's predicates.
- Materialisation
- Writing an operator's intermediate result to disk instead of streaming it on. It costs an extra read and write of that result (extra I/O) but is sometimes forced — e.g. a sort or a hash build must consume its whole input before producing output.
- Selection push-down
- A relational-algebra rewrite that moves a selection (filter) as far down the plan tree as possible, so rows are discarded before the expensive joins above. It shrinks every intermediate result and is the optimiser's cheapest, most reliable win.
Sorting and Query Processing FAQ
What is the formula for the number of passes in external merge sort?
Total passes = 1 + ⌈logB−1 ⌈N/B⌉⌉, where N is the number of pages to sort and B is the number of buffer frames. The first pass creates ⌈N/B⌉ sorted runs; each subsequent pass merges up to B−1 runs at once (one frame is the output), so the number of runs falls by a factor of B−1 per pass until a single sorted run remains.
Why is the merge fan-in B−1 and not B?
Because one buffer frame must hold the output of the merge while the other frames hold one input page from each run being merged. So with B frames you can merge at most B−1 runs at a time. Using B by mistake inflates the fan-in and underestimates the pass count — a very common arithmetic slip in the exam.
How do I compute the total I/O of a sort once I know the passes?
Every pass reads all N pages and writes all N pages, so each pass costs 2N page I/Os and the total is 2N × (number of passes). The page count N is the same on every pass because no rows are added or removed by sorting — only their order changes — so the 2N multiplier is constant.
Why does sorting matter beyond ORDER BY?
Because so many operators are built on a sorted input: DISTINCT and GROUP BY can be done by sorting then scanning for duplicates/groups; sort-merge join sorts both inputs first; and bulk-loading a B+-tree sorts the data entries. So the external-sort cost reappears inside other operators' costs, which is why DATA3404 drills its formula so hard.
Exam move
Make external merge sort automatic, because its cost reappears inside sort-merge join, DISTINCT, GROUP BY and bulk B+-tree loads. Memorise the two formulas as a pair — passes = 1 + ⌈logB−1 ⌈N/B⌉⌉ and total I/O = 2N × passes — and always use fan-in B−1 (reserve one frame for output). Practise the run-count reduction (e.g. 20 → 5 → 2 → 1) so you can sanity-check the log. For query processing, be able to read a small plan tree under the iterator model, say which access path is cheapest for a predicate, and explain why materialising an intermediate result costs extra I/O while pipelining does not. Finally, learn the six relational-algebra operators and the push-down rules — they set up the optimisation chapter.