University of Sydney · S1 2026 · FACULTY OF INFORMATION TECHNOLOGY

DATA3404 · Scalable Data Management

- one subject, every graph, every model, every mark
50% final exam · hurdle14 Chapters8-page Bible
Our own words - no uploaded lecturer files
Built to mirror S1 2026 · updated this semester
Chapter 5 of 7 · DATA3404

Query Optimisation

SQL is declarative — you say what rows you want, not how to compute them — and the query optimiser turns that declaration into the cheapest physical plan it can find. It does this in two moves: a logical rewrite (re-arrange the relational-algebra expression into a cheaper but equivalent one) and a physical plan choice (pick join orders and algorithms over an enormous search space). This chapter teaches you to estimate costs and reason like the optimiser: push selections and projections down below joins, estimate selectivity and the size of each intermediate result, cost a plan by plugging sizes into the join I/O formulas, and understand why the optimiser prefers left-deep plans and uses System-R dynamic programming over join orders. The headline worked chain in the unit reduces a Car/Trip query through the estimates 62,625 → 2,125 → 100,500 → 38.

In this chapter

What this chapter covers

  • 015.1 What the optimiser does: logical rewrite + physical plan choice
  • 02Relational-algebra equivalences and pushing selections/projections down
  • 03Selectivity and reduction factors — estimating result sizes
  • 04Costing a plan: sizes → join I/O formulas → summed page transfers
  • 05Left-deep plans and System-R dynamic programming over join orders
  • 06Interesting orders — why a sorted intermediate can be worth keeping
Worked example · free

Worked example: selectivity and intermediate-result size

Q [4 marks]. A relation R has |R| = 100,000 rows. A query applies the predicate year = 2024, and the attribute year has V = 50 distinct values, uniformly distributed. (a) What is the selectivity (reduction factor) of the predicate? (b) Estimate the number of rows that survive the selection. (c) Why does pushing this selection below a later join reduce total cost?
  • +1(a) Reduction factor of an equality on a value: with V uniformly distributed distinct values, an equality predicate keeps a fraction 1/V of the rows. So selectivity = 1/50 = 0.02.
  • +1(b) Surviving rows: |result| = selectivity × |R| = 0.02 × 100,000 = 2,000 rows.
  • +1(c) Why push it down: if the join above runs on 2,000 rows instead of 100,000, every join-cost formula is fed a far smaller input — fewer pages to scan, fewer tuples to match.
  • +1State the rule: selections shrink intermediate results, and intermediate-result size drives every downstream join cost, so the optimiser pushes selections as far down the plan tree as the algebra allows.
Selectivity = 1/V = 1/50 = 0.02; surviving rows = 0.02 × 100,000 = 2,000; pushing the selection below the join shrinks the join's input from 100,000 to 2,000 rows, cutting total cost.
Sia tip — Equality on a key (all values distinct) has selectivity 1/|R|; equality on a value with V distinct values has selectivity 1/V; a range predicate's selectivity comes from the value range. Multiply reduction factors across independent predicates to size a result.
Glossary

Key terms

Query optimiser
The engine component that turns a declarative SQL query into the cheapest physical plan it can find, by rewriting the relational-algebra expression and choosing join orders and algorithms. It estimates costs from statistics rather than running the query, so its estimates can be wrong if the statistics are stale.
Selectivity (reduction factor)
The fraction of input rows a predicate keeps. An equality on an attribute with V uniformly distributed distinct values has selectivity 1/V; the optimiser multiplies the reduction factors of independent predicates to estimate the size of each intermediate result.
Selection / projection push-down
Relational-algebra rewrites that move selections (filters) and projections (column drops) below the joins, so rows and columns are discarded before the expensive joins run. They shrink every intermediate result and are the optimiser's cheapest reliable wins.
Left-deep plan
A join tree in which every join's right input is a base relation (the tree leans left). Left-deep plans let the optimiser pipeline results and keep the System-R dynamic-programming search tractable, which is why real optimisers restrict their search to them.
System-R dynamic programming
The classic join-order search: build the cheapest plan for every subset of relations bottom-up, reusing the best sub-plans, while keeping any plan with an interesting (useful sorted) order. It avoids enumerating all n! join orders by reusing optimal sub-solutions.
FAQ

Query Optimisation FAQ

How does the optimiser estimate the size of an intermediate result?

It multiplies the input size by the selectivity (reduction factor) of each predicate. An equality on an attribute with V distinct uniformly distributed values has selectivity 1/V; a join on a foreign key has selectivity 1/max(V_A, V_B). Multiplying the reduction factors of independent predicates gives the estimated output size, which then feeds the cost of every operator above it.

Why does the optimiser push selections and projections down?

Because shrinking the data early shrinks everything above it. Pushing a selection below a join means the join runs on far fewer rows; pushing a projection below it means fewer columns (narrower tuples, more rows per page). Both reduce intermediate-result sizes, and since intermediate-result size drives join cost, these rewrites are the cheapest and most reliable optimisations.

Why do optimisers only consider left-deep plans?

Restricting the search to left-deep trees (where each join's inner input is a base table) keeps the number of plans manageable and lets results pipeline from one join straight into the next without materialising. Considering all bushy plans would blow up the search space, so System-R-style optimisers trade a little plan quality for a tractable search.

Do I need to memorise the exact numbers in the Car/Trip cost chain?

Not the digits, but you should be able to reproduce the method: estimate each operator's input size from selectivities, plug those sizes into the relevant join I/O formula, and sum the page transfers down the plan. The chain (62,625 → 2,125 → 100,500 → 38) is there to show how a pushed-down selection and a different join order change the totals — understand why each step moves, not the exact figures.

Study strategy

Exam move

Optimisation is where the storage, indexing, sorting and join chapters all come together, so treat it as integration practice. Master three skills: (1) rewrite an RA expression to push selections and projections below the joins and say why it is cheaper; (2) estimate sizes using reduction factors — 1/V for an equality on V values, multiplied across independent predicates; and (3) cost a plan end to end by feeding those sizes into the join I/O formulas from Chapter 4 and summing the page transfers. Be able to explain in words why the optimiser prefers left-deep plans and uses dynamic programming over join orders, and why an interesting (sorted) order can be worth keeping. Work the unit's Car/Trip chain until the method — not the digits — is automatic.

A+Everything unlocked
Unlocks this Bible + all 8 of your University of Sydney subjects - and 1,000+ Bibles across every Australian university.
Sia - your DATA3404 tutor, unlimited, worked the way the exam marks it
The full 8-page Bible + practice bank with worked solutions
Chrome extension - sync your LMS so Sia knows your deadlines
Bilingual EN / Chinese on every Bible and every Sia answer
$25/ month
30-day money-back · cancel in one tap · how it works
Unlock the full DATA3404 Bible + 8 University of Sydney subjects解锁完整 DATA3404 Bible + University of Sydney 8 门科目
$25/mo