University of Sydney · S1 2026 · FACULTY OF INFORMATION TECHNOLOGY

DATA3404 · Scalable Data Management

- one subject, every graph, every model, every mark
50% final exam · hurdle14 Chapters4-page Bible
Our own words - no uploaded lecturer files
Built to mirror S1 2026 · updated this semester
Chapter 4 of 7 · DATA3404

Join Algorithms

A join R ⋈ S pairs every matching tuple of two relations, and it is the single most expensive thing a query does. The exam does not ask you to run a join; it asks you to cost one. This is the #1 topic in DATA3404: you memorise five cost formulas — tuple/block nested-loop, index nested-loop, sort-merge, and grace hash — learn which join condition each one needs, and then read off the cheapest plan in seconds. The chapter teaches each algorithm the same way: mechanism → formula → a worked number on real page counts → when it wins. The exam gives you the page counts bR, bS, the row counts |R|, |S| and a buffer of M pages, and asks you to compute each I/O cost (output not counted), state which condition each join needs, and pick the cheapest — always naming which relation you made the outer one.

In this chapter

What this chapter covers

  • 014.1 Why joins are expensive — the I/O cost model (output not counted)
  • 02Tuple nested-loop join and why it wastes the buffer
  • 03Block nested-loop join: cost = bR + ⌈bR/(M−2)⌉ × bS
  • 04Index nested-loop join: bR + |R| × (index lookup cost)
  • 05Sort-merge join: bR + bS (+ the cost of sorting each input)
  • 06Grace hash join: 3 × (bR + bS), and which condition each join needs
Worked example · free

Worked example: block nested-loop join I/O cost

Q [5 marks]. Join Student (bR = 100 pages, the outer relation) with Enrollment (bS = 400 pages) using a block nested-loop join and a buffer of M = 10 pages. Compute the total I/O in page transfers.
  • +1Cost formula: BNLJ cost = bR + ⌈bR / (M − 2)⌉ × bS. One frame reads the inner relation, one is the output buffer, so each outer block is M − 2 pages.
  • +1Outer-block size: M − 2 = 10 − 2 = 8 pages.
  • +1Number of outer blocks: ⌈bR / (M−2)⌉ = ⌈100 / 8⌉ = 13.
  • +1Plug in: cost = 100 + 13 × 400 = 100 + 5,200.
  • +1Answer: 5,300 page I/Os — Student is read once; Enrollment is re-scanned once per outer block (13 times).
Cost = bR + ⌈bR/(M−2)⌉ × bS = 100 + ⌈100/8⌉ × 400 = 100 + 13 × 400 = 5,300 page transfers.
Sia tip — Make the smaller relation the outer one to minimise re-scans of the inner, and always state your choice — it changes the block count and the answer. The output is never counted in DATA3404's cost model.
Glossary

Key terms

Block nested-loop join (BNLJ)
A nested-loop join that fills the buffer with a block of the outer relation (M−2 pages) and scans the inner relation once per block. Cost = bR + ⌈bR/(M−2)⌉ × bS; it works for any join condition.
Index nested-loop join (INLJ)
A nested-loop join that, instead of scanning the inner relation, probes an index on the inner's join attribute for each outer tuple. Cost = bR + |R| × (index lookup cost); it needs an index on the inner relation and is cheap when that index is selective.
Sort-merge join (SMJ)
A join that sorts both relations on the join attribute then merges them in one pass. Cost = bR + bS plus the cost of sorting each input; it works only for equi/natural joins and leaves the result sorted, which can help later operators.
Grace hash join
A join that partitions both relations into buckets with the same hash function, then joins matching buckets in memory. Cost = 3 × (bR + bS) — read, write-partitions, read-back — for equi/natural joins when no useful index or sort order exists.
I/O cost model (joins)
DATA3404 costs a join in page transfers only: pages read from and written to disk. CPU work, the difference between sequential and random I/O, and the cost of writing the join output are all ignored — so never add output pages to a join cost.
FAQ

Join Algorithms FAQ

Which join algorithm is cheapest, and how do I decide?

There is no single winner — you compute each cost from the given page counts, row counts and buffer size and compare. As a rule of thumb: with a selective index on the inner relation, index nested-loop wins; for a large equi-join with enough memory, grace hash or sort-merge (each roughly 3(bR+bS) or bR+bS+sorts) beats block nested-loop; and sort-merge is attractive when an input is already sorted or the output needs to be. Always plug in the numbers rather than guessing.

Why is the output not counted in the join I/O cost?

Because DATA3404's cost model counts only the pages read and written to execute the join, and the cost of writing the final result is the same whichever algorithm you choose — so it cancels out when you compare algorithms and is conventionally dropped. Adding output pages to a join cost is a common mistake that loses marks.

Which joins work for which conditions?

Nested-loop joins (tuple and block) work for any join condition, including inequalities and theta-joins. Sort-merge and hash joins work only for equi-joins and natural joins, because both rely on equal keys lining up (by sorting or by hashing). Index nested-loop needs an index on the inner relation's join attribute. State the condition each join needs — it is an explicit exam ask.

Why does the choice of outer relation matter in a nested-loop join?

Because the inner relation is re-scanned once per block of the outer, the number of re-scans is ⌈bouter/(M−2)⌉. Making the smaller relation the outer one minimises that block count and therefore the total cost. The two choices can give noticeably different numbers, so always say which relation you made the outer one and, ideally, pick the cheaper assignment.

Study strategy

Exam move

This is the #1 exam topic, so over-prepare it. Memorise the five cost formulas as a single table — tuple-NLJ, BNLJ bR+⌈bR/(M−2)⌉bS, INLJ bR+|R|×c, SMJ bR+bS(+sorts), grace hash 3(bR+bS) — together with the join condition each one requires. Then drill the standard exam item: given bR, bS, |R|, |S| and M, compute every cost and pick the cheapest, explicitly stating which relation is outer. Watch the two classic traps: never count the output, and use M−2 (not M) for the outer block. Show the formula and every page count — method marks are real even when the final number slips.

A+Everything unlocked
Unlocks this Bible + all 8 of your University of Sydney subjects - and 1,000+ Bibles across every Australian university.
Sia - your DATA3404 tutor, unlimited, worked the way the exam marks it
The full 4-page Bible + practice bank with worked solutions
Chrome extension - sync your LMS so Sia knows your deadlines
Bilingual EN / Chinese on every Bible and every Sia answer
$25/ month
30-day money-back · cancel in one tap · how it works
Unlock the full DATA3404 Bible + 8 University of Sydney subjects解锁完整 DATA3404 Bible + University of Sydney 8 门科目
$25/mo