University of Sydney · S1 2026 · FACULTY OF INFORMATION TECHNOLOGY

DATA3404 · Scalable Data Management

Q: Which joins work for which conditions?

Nested-loop joins (tuple and block) work for any join condition, including inequalities and theta-joins. Sort-merge and hash joins work only for equi-joins and natural joins, because both rely on equal keys lining up (by sorting or by hashing). Index nested-loop needs an index on the inner relation's join attribute. State the condition each join needs — it is an explicit exam ask.

- one subject, every graph, every model, every mark

50% final exam · hurdle14 Chapters4-page Bible

Our own words - no uploaded lecturer files

Built to mirror S1 2026 · updated this semester

Chapter 4 of 7 · DATA3404

Join Algorithms

A join R ⋈ S pairs every matching tuple of two relations, and it is the single most expensive thing a query does. The exam does not ask you to run a join; it asks you to cost one. This is the #1 topic in DATA3404: you memorise five cost formulas — tuple/block nested-loop, index nested-loop, sort-merge, and grace hash — learn which join condition each one needs, and then read off the cheapest plan in seconds. The chapter teaches each algorithm the same way: mechanism → formula → a worked number on real page counts → when it wins. The exam gives you the page counts b_R, b_S, the row counts |R|, |S| and a buffer of M pages, and asks you to compute each I/O cost (output not counted), state which condition each join needs, and pick the cheapest — always naming which relation you made the outer one.

In this chapter

What this chapter covers

014.1 Why joins are expensive — the I/O cost model (output not counted)
02Tuple nested-loop join and why it wastes the buffer
03Block nested-loop join: cost = b_R + ⌈b_R/(M−2)⌉ × b_S
04Index nested-loop join: b_R + |R| × (index lookup cost)
05Sort-merge join: b_R + b_S (+ the cost of sorting each input)
06Grace hash join: 3 × (b_R + b_S), and which condition each join needs

Worked example · free

Worked example: block nested-loop join I/O cost

Q [5 marks]. Join Student (b_R = 100 pages, the outer relation) with Enrollment (b_S = 400 pages) using a block nested-loop join and a buffer of M = 10 pages. Compute the total I/O in page transfers.

+1Cost formula: BNLJ cost = b_R + ⌈b_R / (M − 2)⌉ × b_S. One frame reads the inner relation, one is the output buffer, so each outer block is M − 2 pages.
+1Outer-block size: M − 2 = 10 − 2 = 8 pages.
+1Number of outer blocks: ⌈b_R / (M−2)⌉ = ⌈100 / 8⌉ = 13.
+1Plug in: cost = 100 + 13 × 400 = 100 + 5,200.
+1Answer: 5,300 page I/Os — Student is read once; Enrollment is re-scanned once per outer block (13 times).

Cost = b_R + ⌈b_R/(M−2)⌉ × b_S = 100 + ⌈100/8⌉ × 400 = 100 + 13 × 400 = 5,300 page transfers.

Sia tip — Make the smaller relation the outer one to minimise re-scans of the inner, and always state your choice — it changes the block count and the answer. The output is never counted in DATA3404's cost model.

Glossary

Key terms

Block nested-loop join (BNLJ): A nested-loop join that fills the buffer with a block of the outer relation (M−2 pages) and scans the inner relation once per block. Cost = b_R + ⌈b_R/(M−2)⌉ × b_S; it works for any join condition.
Index nested-loop join (INLJ): A nested-loop join that, instead of scanning the inner relation, probes an index on the inner's join attribute for each outer tuple. Cost = b_R + |R| × (index lookup cost); it needs an index on the inner relation and is cheap when that index is selective.
Sort-merge join (SMJ): A join that sorts both relations on the join attribute then merges them in one pass. Cost = b_R + b_S plus the cost of sorting each input; it works only for equi/natural joins and leaves the result sorted, which can help later operators.
Grace hash join: A join that partitions both relations into buckets with the same hash function, then joins matching buckets in memory. Cost = 3 × (b_R + b_S) — read, write-partitions, read-back — for equi/natural joins when no useful index or sort order exists.
I/O cost model (joins): DATA3404 costs a join in page transfers only: pages read from and written to disk. CPU work, the difference between sequential and random I/O, and the cost of writing the join output are all ignored — so never add output pages to a join cost.

FAQ

Join Algorithms FAQ

Which join algorithm is cheapest, and how do I decide?

There is no single winner — you compute each cost from the given page counts, row counts and buffer size and compare. As a rule of thumb: with a selective index on the inner relation, index nested-loop wins; for a large equi-join with enough memory, grace hash or sort-merge (each roughly 3(b_R+b_S) or b_R+b_S+sorts) beats block nested-loop; and sort-merge is attractive when an input is already sorted or the output needs to be. Always plug in the numbers rather than guessing.

Why is the output not counted in the join I/O cost?

Because DATA3404's cost model counts only the pages read and written to execute the join, and the cost of writing the final result is the same whichever algorithm you choose — so it cancels out when you compare algorithms and is conventionally dropped. Adding output pages to a join cost is a common mistake that loses marks.

Which joins work for which conditions?

Nested-loop joins (tuple and block) work for any join condition, including inequalities and theta-joins. Sort-merge and hash joins work only for equi-joins and natural joins, because both rely on equal keys lining up (by sorting or by hashing). Index nested-loop needs an index on the inner relation's join attribute. State the condition each join needs — it is an explicit exam ask.

Why does the choice of outer relation matter in a nested-loop join?

Because the inner relation is re-scanned once per block of the outer, the number of re-scans is ⌈b_outer/(M−2)⌉. Making the smaller relation the outer one minimises that block count and therefore the total cost. The two choices can give noticeably different numbers, so always say which relation you made the outer one and, ideally, pick the cheaper assignment.

Study strategy

Exam move

This is the #1 exam topic, so over-prepare it. Memorise the five cost formulas as a single table — tuple-NLJ, BNLJ b_R+⌈b_R/(M−2)⌉b_S, INLJ b_R+|R|×c, SMJ b_R+b_S(+sorts), grace hash 3(b_R+b_S) — together with the join condition each one requires. Then drill the standard exam item: given b_R, b_S, |R|, |S| and M, compute every cost and pick the cheapest, explicitly stating which relation is outer. Watch the two classic traps: never count the output, and use M−2 (not M) for the outer block. Show the formula and every page count — method marks are real even when the final number slips.

Keep going — explore the course

A+Everything unlocked

Unlocks this Bible + all 8 of your University of Sydney subjects - and 1,000+ Bibles across every Australian university.

Sia - your DATA3404 tutor, unlimited, worked the way the exam marks it

The full 4-page Bible + practice bank with worked solutions

Chrome extension - sync your LMS so Sia knows your deadlines

Bilingual EN / Chinese on every Bible and every Sia answer

$25/ month

30-day money-back · cancel in one tap · how it works