DATA3404 · Scalable Data Management
Query Optimisation
SQL is declarative — you say what rows you want, not how to compute them — and the query optimiser turns that declaration into the cheapest physical plan it can find. It does this in two moves: a logical rewrite (re-arrange the relational-algebra expression into a cheaper but equivalent one) and a physical plan choice (pick join orders and algorithms over an enormous search space). This chapter teaches you to estimate costs and reason like the optimiser: push selections and projections down below joins, estimate selectivity and the size of each intermediate result, cost a plan by plugging sizes into the join I/O formulas, and understand why the optimiser prefers left-deep plans and uses System-R dynamic programming over join orders. The headline worked chain in the unit reduces a Car/Trip query through the estimates 62,625 → 2,125 → 100,500 → 38.
What this chapter covers
- 015.1 What the optimiser does: logical rewrite + physical plan choice
- 02Relational-algebra equivalences and pushing selections/projections down
- 03Selectivity and reduction factors — estimating result sizes
- 04Costing a plan: sizes → join I/O formulas → summed page transfers
- 05Left-deep plans and System-R dynamic programming over join orders
- 06Interesting orders — why a sorted intermediate can be worth keeping
Worked example: selectivity and intermediate-result size
- +1(a) Reduction factor of an equality on a value: with V uniformly distributed distinct values, an equality predicate keeps a fraction 1/V of the rows. So selectivity = 1/50 = 0.02.
- +1(b) Surviving rows: |result| = selectivity × |R| = 0.02 × 100,000 = 2,000 rows.
- +1(c) Why push it down: if the join above runs on 2,000 rows instead of 100,000, every join-cost formula is fed a far smaller input — fewer pages to scan, fewer tuples to match.
- +1State the rule: selections shrink intermediate results, and intermediate-result size drives every downstream join cost, so the optimiser pushes selections as far down the plan tree as the algebra allows.
Key terms
- Query optimiser
- The engine component that turns a declarative SQL query into the cheapest physical plan it can find, by rewriting the relational-algebra expression and choosing join orders and algorithms. It estimates costs from statistics rather than running the query, so its estimates can be wrong if the statistics are stale.
- Selectivity (reduction factor)
- The fraction of input rows a predicate keeps. An equality on an attribute with V uniformly distributed distinct values has selectivity 1/V; the optimiser multiplies the reduction factors of independent predicates to estimate the size of each intermediate result.
- Selection / projection push-down
- Relational-algebra rewrites that move selections (filters) and projections (column drops) below the joins, so rows and columns are discarded before the expensive joins run. They shrink every intermediate result and are the optimiser's cheapest reliable wins.
- Left-deep plan
- A join tree in which every join's right input is a base relation (the tree leans left). Left-deep plans let the optimiser pipeline results and keep the System-R dynamic-programming search tractable, which is why real optimisers restrict their search to them.
- System-R dynamic programming
- The classic join-order search: build the cheapest plan for every subset of relations bottom-up, reusing the best sub-plans, while keeping any plan with an interesting (useful sorted) order. It avoids enumerating all n! join orders by reusing optimal sub-solutions.
Query Optimisation FAQ
How does the optimiser estimate the size of an intermediate result?
It multiplies the input size by the selectivity (reduction factor) of each predicate. An equality on an attribute with V distinct uniformly distributed values has selectivity 1/V; a join on a foreign key has selectivity 1/max(V_A, V_B). Multiplying the reduction factors of independent predicates gives the estimated output size, which then feeds the cost of every operator above it.
Why does the optimiser push selections and projections down?
Because shrinking the data early shrinks everything above it. Pushing a selection below a join means the join runs on far fewer rows; pushing a projection below it means fewer columns (narrower tuples, more rows per page). Both reduce intermediate-result sizes, and since intermediate-result size drives join cost, these rewrites are the cheapest and most reliable optimisations.
Why do optimisers only consider left-deep plans?
Restricting the search to left-deep trees (where each join's inner input is a base table) keeps the number of plans manageable and lets results pipeline from one join straight into the next without materialising. Considering all bushy plans would blow up the search space, so System-R-style optimisers trade a little plan quality for a tractable search.
Do I need to memorise the exact numbers in the Car/Trip cost chain?
Not the digits, but you should be able to reproduce the method: estimate each operator's input size from selectivities, plug those sizes into the relevant join I/O formula, and sum the page transfers down the plan. The chain (62,625 → 2,125 → 100,500 → 38) is there to show how a pushed-down selection and a different join order change the totals — understand why each step moves, not the exact figures.
Exam move
Optimisation is where the storage, indexing, sorting and join chapters all come together, so treat it as integration practice. Master three skills: (1) rewrite an RA expression to push selections and projections below the joins and say why it is cheaper; (2) estimate sizes using reduction factors — 1/V for an equality on V values, multiplied across independent predicates; and (3) cost a plan end to end by feeding those sizes into the join I/O formulas from Chapter 4 and summing the page transfers. Be able to explain in words why the optimiser prefers left-deep plans and uses dynamic programming over join orders, and why an interesting (sorted) order can be worth keeping. Work the unit's Car/Trip chain until the method — not the digits — is automatic.