COMP9120 · Database Management Systems
Query Processing & Optimisation
Query processing is how a declarative SQL statement becomes a running physical plan, and optimisation is how the engine chooses a cheap one. In COMP9120 Database Management Systems at the University of Sydney this is the Week 12 topic that ties the whole unit together: relational algebra becomes an execution plan, and storage-level page-I/O costs decide which plan wins. This chapter covers the parse → optimise → execute pipeline, heuristic and cost-based optimisation, the join-algorithm cost formulas, and external merge sort — the costing toolkit the final paper tests by hand.
What this chapter covers
- 01The three steps of query processing: parse & translate SQL to relational algebra, optimise, then execute
- 02Logical expression tree vs physical execution plan — and why one query has many plans
- 03Heuristic (algebraic) optimisation: push selections and projections down to shrink intermediate results
- 04The equivalence rules — selection cascade, join commutativity/associativity, selection distributing over a join
- 05Cost-based optimisation: annotate operators with physical algorithms and pick the lowest estimated page-I/O
- 06Join costs: tuple nested-loop bR+|R|·bS, block nested-loop bR+bR·bS, index nested-loop bR+|R|·c
- 07Why choosing the outer relation (and an index on the inner) changes the join cost
- 08External merge sort: #passes = 1 + ceil(log_(B-1) ceil(N/B)) and total I/O = 2N · #passes
- 09Sort-merge join, and materialisation vs pipelining between operators
Cost a join four ways and pick the cheapest plan
- +1Set up the formulas. Tuple nested-loop = b_outer + |outer|·b_inner (inner scanned once per outer ROW); block nested-loop = b_outer + b_outer·b_inner (inner scanned once per outer PAGE, this unit's convention).
- +1Tuple NLJ, Order outer = 500 + 5,000·1,000 = 500 + 5,000,000 = 5,000,500 I/O.
- +1Tuple NLJ, Item outer = 1,000 + 40,000·500 = 1,000 + 20,000,000 = 20,001,000 I/O. So the smaller-row relation (Order) is the better outer.
- +1Block NLJ, Order outer = 500 + 500·1,000 = 500 + 500,000 = 500,500 I/O.
- +1Block NLJ, Item outer = 1,000 + 1,000·500 = 1,000 + 500,000 = 501,000 I/O.
- +1Compare all four and state the winner: block NLJ with Order as the outer = 500,500 I/O, the cheapest.
Key terms
- Query execution plan
- The tree of physical operators the engine actually runs; produced by the optimiser from the logical relational-algebra tree, it commits to concrete algorithms and access paths, and that is where the page-I/O cost lives.
- Heuristic (algebraic) optimisation
- A logical rewrite layer that applies equivalence rules — mainly pushing selections and projections down — to produce smaller intermediate results before any physical operator is chosen.
- Cost-based optimisation
- The physical layer that annotates each operator with an algorithm and access path, estimates the page-I/O of each candidate from catalogue statistics, and keeps the plan with the lowest estimated cost.
- Selection push-down
- Moving a selection as early as possible, legal only when its predicate uses attributes of a single relation, so fewer rows reach an expensive join. Projection push-down likewise drops unneeded columns early.
- Tuple nested-loop join
- For every outer tuple, scan the whole inner relation; cost = bR + |R|·bS. Works for any join condition but is the most expensive because the inner is rescanned once per outer row.
- Block nested-loop join
- In this unit, cost = bR + bR·bS — the inner relation is scanned once per outer page rather than per outer row, which cuts the tuple-NLJ cost by roughly the blocking factor.
- Index nested-loop join
- For each outer tuple, probe an index on the inner relation's join attribute and fetch matches; cost = bR + |R|·c. Needs an equi/natural join and an index on the inner — often the cheapest when both hold.
- External merge sort
- Disk-based sort for a file too large for RAM: Pass 0 makes ceil(N/B) sorted runs; each (B-1)-way merge pass costs 2N. #passes = 1 + ceil(log_(B-1) ceil(N/B)); total I/O = 2N · #passes. It underlies ORDER BY, DISTINCT, GROUP BY and sort-merge join.
Query Processing & Optimisation FAQ
What is the difference between heuristic and cost-based optimisation?
Heuristic (algebraic) optimisation is a logical rewrite: it applies equivalence rules — mostly pushing selections and projections down — to shrink intermediate results, without knowing any page counts. Cost-based optimisation comes next: it annotates each operator with a physical algorithm and access path, estimates the page-I/O of each candidate plan from catalogue statistics such as cardinalities and blocking factors, and keeps the cheapest. Heuristics decide the shape of the tree; cost estimates decide the algorithms.
Which join algorithm is cheapest?
It depends on the inputs, so you should compute each one. Tuple nested-loop (bR + |R|·bS) is the baseline and usually the most expensive; block nested-loop (bR + bR·bS in this unit) is far cheaper because it rescans the inner once per outer page; index nested-loop (bR + |R|·c) is often cheapest when the join is an equi/natural join and there is an index on the inner relation's join attribute. Sort-merge is attractive for equi-joins when an input is already sorted or the output must be sorted. Always try both relations as the outer and pick the minimum.
Can AI help me with query processing and optimisation in COMP9120?
Yes — used as a study aid. An AI tutor such as Sia can explain the parse-optimise-execute pipeline step by step, walk you through why pushing a selection below a join is cheaper, and check your reasoning as you re-derive the join and external-sort cost formulas on your own practice numbers. It cannot sit the closed-book final exam for you and it will not guarantee a grade, so treat it as a way to understand the method and rehearse it, then practise the past-paper costing questions yourself and confirm the current unit details on Canvas.
Studying with AI? Sia — free AI data science tutor works through COMP9120 step by step.
Exam move
Treat this chapter as two skills to rehearse until they are automatic: rewriting and costing. For rewriting, practise taking a relational-algebra tree or an SFW query and pushing selections and projections below the join, naming the equivalence rule at each step. For costing, memorise the four formulas — tuple NLJ bR+|R|·bS, block NLJ bR+bR·bS, index NLJ bR+|R|·c, and external merge sort #passes = 1 + ceil(log_(B-1) ceil(N/B)) with total I/O = 2N·#passes — then drill them on your own page counts, always trying both relations as the outer and writing every intermediate value so a slip costs one line, not the whole answer. Since the final exam covers the full unit at roughly 0.9 minutes per mark, budget about nine minutes for a ten-mark costing part and show the formula, the substitution and the arithmetic. Confirm the current assessment details on Canvas.