University of Sydney · FACULTY OF COMPUTER SCIENCE

COMP9120 · Database Management Systems

- one subject, every graph, every model, every mark
Computer Science14 Chapters9-page Bible
Our own words - no uploaded lecturer files
Updated for this semester
Chapter 12 of 12 · COMP9120

Query Processing & Optimisation

Query processing is how a declarative SQL statement becomes a running physical plan, and optimisation is how the engine chooses a cheap one. In COMP9120 Database Management Systems at the University of Sydney this is the Week 12 topic that ties the whole unit together: relational algebra becomes an execution plan, and storage-level page-I/O costs decide which plan wins. This chapter covers the parse → optimise → execute pipeline, heuristic and cost-based optimisation, the join-algorithm cost formulas, and external merge sort — the costing toolkit the final paper tests by hand.

In this chapter

What this chapter covers

  • 01The three steps of query processing: parse & translate SQL to relational algebra, optimise, then execute
  • 02Logical expression tree vs physical execution plan — and why one query has many plans
  • 03Heuristic (algebraic) optimisation: push selections and projections down to shrink intermediate results
  • 04The equivalence rules — selection cascade, join commutativity/associativity, selection distributing over a join
  • 05Cost-based optimisation: annotate operators with physical algorithms and pick the lowest estimated page-I/O
  • 06Join costs: tuple nested-loop bR+|R|·bS, block nested-loop bR+bR·bS, index nested-loop bR+|R|·c
  • 07Why choosing the outer relation (and an index on the inner) changes the join cost
  • 08External merge sort: #passes = 1 + ceil(log_(B-1) ceil(N/B)) and total I/O = 2N · #passes
  • 09Sort-merge join, and materialisation vs pipelining between operators
Worked example · free

Cost a join four ways and pick the cheapest plan

Q [6 marks]. An Order relation has |R| = 5,000 tuples in bR = 500 pages; an Item relation has |S| = 40,000 tuples in bS = 1,000 pages; they join on order_id. Using this unit's cost formulas, cost the join by tuple nested-loop and by block nested-loop with EACH relation as the outer, then state the cheapest plan. (Cost = page I/Os; output is ignored.)
  • +1Set up the formulas. Tuple nested-loop = b_outer + |outer|·b_inner (inner scanned once per outer ROW); block nested-loop = b_outer + b_outer·b_inner (inner scanned once per outer PAGE, this unit's convention).
  • +1Tuple NLJ, Order outer = 500 + 5,000·1,000 = 500 + 5,000,000 = 5,000,500 I/O.
  • +1Tuple NLJ, Item outer = 1,000 + 40,000·500 = 1,000 + 20,000,000 = 20,001,000 I/O. So the smaller-row relation (Order) is the better outer.
  • +1Block NLJ, Order outer = 500 + 500·1,000 = 500 + 500,000 = 500,500 I/O.
  • +1Block NLJ, Item outer = 1,000 + 1,000·500 = 1,000 + 500,000 = 501,000 I/O.
  • +1Compare all four and state the winner: block NLJ with Order as the outer = 500,500 I/O, the cheapest.
Block nested-loop join with Order as the outer relation costs 500,500 page I/Os — the cheapest of the four options (tuple NLJ Order-outer 5,000,500; tuple NLJ Item-outer 20,001,000; block NLJ Item-outer 501,000). Block NLJ beats tuple NLJ by roughly 10x here because the inner relation is rescanned once per outer PAGE (bR) instead of once per outer ROW (|R|).
Sia tip — Always try BOTH relations as the outer and state which you chose — the cost changes and the marks depend on it. Make the smaller relation the outer, and use THIS unit's block NLJ formula bR + bR·bS (inner scanned once per outer page); if a suitable index exists on the inner join attribute, also cost index nested-loop = bR + |R|·c, which is often cheaper still.
Glossary

Key terms

Query execution plan
The tree of physical operators the engine actually runs; produced by the optimiser from the logical relational-algebra tree, it commits to concrete algorithms and access paths, and that is where the page-I/O cost lives.
Heuristic (algebraic) optimisation
A logical rewrite layer that applies equivalence rules — mainly pushing selections and projections down — to produce smaller intermediate results before any physical operator is chosen.
Cost-based optimisation
The physical layer that annotates each operator with an algorithm and access path, estimates the page-I/O of each candidate from catalogue statistics, and keeps the plan with the lowest estimated cost.
Selection push-down
Moving a selection as early as possible, legal only when its predicate uses attributes of a single relation, so fewer rows reach an expensive join. Projection push-down likewise drops unneeded columns early.
Tuple nested-loop join
For every outer tuple, scan the whole inner relation; cost = bR + |R|·bS. Works for any join condition but is the most expensive because the inner is rescanned once per outer row.
Block nested-loop join
In this unit, cost = bR + bR·bS — the inner relation is scanned once per outer page rather than per outer row, which cuts the tuple-NLJ cost by roughly the blocking factor.
Index nested-loop join
For each outer tuple, probe an index on the inner relation's join attribute and fetch matches; cost = bR + |R|·c. Needs an equi/natural join and an index on the inner — often the cheapest when both hold.
External merge sort
Disk-based sort for a file too large for RAM: Pass 0 makes ceil(N/B) sorted runs; each (B-1)-way merge pass costs 2N. #passes = 1 + ceil(log_(B-1) ceil(N/B)); total I/O = 2N · #passes. It underlies ORDER BY, DISTINCT, GROUP BY and sort-merge join.
FAQ

Query Processing & Optimisation FAQ

What is the difference between heuristic and cost-based optimisation?

Heuristic (algebraic) optimisation is a logical rewrite: it applies equivalence rules — mostly pushing selections and projections down — to shrink intermediate results, without knowing any page counts. Cost-based optimisation comes next: it annotates each operator with a physical algorithm and access path, estimates the page-I/O of each candidate plan from catalogue statistics such as cardinalities and blocking factors, and keeps the cheapest. Heuristics decide the shape of the tree; cost estimates decide the algorithms.

Which join algorithm is cheapest?

It depends on the inputs, so you should compute each one. Tuple nested-loop (bR + |R|·bS) is the baseline and usually the most expensive; block nested-loop (bR + bR·bS in this unit) is far cheaper because it rescans the inner once per outer page; index nested-loop (bR + |R|·c) is often cheapest when the join is an equi/natural join and there is an index on the inner relation's join attribute. Sort-merge is attractive for equi-joins when an input is already sorted or the output must be sorted. Always try both relations as the outer and pick the minimum.

Can AI help me with query processing and optimisation in COMP9120?

Yes — used as a study aid. An AI tutor such as Sia can explain the parse-optimise-execute pipeline step by step, walk you through why pushing a selection below a join is cheaper, and check your reasoning as you re-derive the join and external-sort cost formulas on your own practice numbers. It cannot sit the closed-book final exam for you and it will not guarantee a grade, so treat it as a way to understand the method and rehearse it, then practise the past-paper costing questions yourself and confirm the current unit details on Canvas.

Studying with AI? Sia — free AI data science tutor works through COMP9120 step by step.

Study strategy

Exam move

Treat this chapter as two skills to rehearse until they are automatic: rewriting and costing. For rewriting, practise taking a relational-algebra tree or an SFW query and pushing selections and projections below the join, naming the equivalence rule at each step. For costing, memorise the four formulas — tuple NLJ bR+|R|·bS, block NLJ bR+bR·bS, index NLJ bR+|R|·c, and external merge sort #passes = 1 + ceil(log_(B-1) ceil(N/B)) with total I/O = 2N·#passes — then drill them on your own page counts, always trying both relations as the outer and writing every intermediate value so a slip costs one line, not the whole answer. Since the final exam covers the full unit at roughly 0.9 minutes per mark, budget about nine minutes for a ten-mark costing part and show the formula, the substitution and the arithmetic. Confirm the current assessment details on Canvas.

A+Everything unlocked
Unlocks this Bible + all 25 of your University of Sydney subjects - and 1,000+ Bibles across every Australian university.
Sia - your COMP9120 tutor, unlimited, worked the way the exam marks it
The full 9-page Bible + practice bank with worked solutions
Chrome extension - sync your LMS so Sia knows your deadlines
Bilingual EN / Chinese on every Bible and every Sia answer
$25/ month
30-day money-back · cancel in one tap · how it works
Unlock the full COMP9120 Bible + 25 University of Sydney subjects解锁完整 COMP9120 Bible + University of Sydney 25 门科目
$25/mo