University of Sydney · S1 2026 · FACULTY OF INFORMATION TECHNOLOGY

DATA3404 · Scalable Data Management

- one subject, every graph, every model, every mark
50% final exam · hurdle14 Chapters8-page Bible
Our own words - no uploaded lecturer files
Built to mirror S1 2026 · updated this semester
Chapter 6 of 7 · DATA3404

Parallel Databases

When the data no longer fits — or no longer finishes — on one machine, you scale out: spread the data across many nodes and run the query on all of them at once. The payoff is parallelism; the price is coordination and communication cost. The whole chapter is one trade-off: scaling out buys throughput but forces you to choose how data is partitioned and how a join is executed across the network — and to cost those choices. You learn partitioning schemes (round-robin, range, hash) and what each can and cannot prune; speed-up vs scale-up and what limits the ideal line; the distributed join strategies (local-reference, broadcast, distributed-shuffle, fragment-and-replicate) and how to cost them; and the foundations of the big-data stack — HDFS block size and 3× replication, the NameNode/DataNode split, and the MapReduce map → shuffle/sort → reduce dataflow.

In this chapter

What this chapter covers

  • 016.1 Why scale out: scale-up (vertical) vs scale-out (horizontal)
  • 02Partitioning: round-robin, range, hash — and what each can prune
  • 03Speed-up vs scale-up — the ideal line and what limits it
  • 04Distributed joins: local-reference, broadcast, distributed-shuffle, fragment-and-replicate
  • 05HDFS: block size, 3× replication, NameNode vs DataNode
  • 06MapReduce: map → shuffle/sort → reduce, and how a join maps onto it
Worked example · free

Worked example: choosing a partitioning scheme

Q [4 marks]. A 1 TB Orders relation is spread across a 10-node cluster. The dominant queries are (i) equality lookups on customer_id and (ii) joins of Orders to Customers on customer_id. (a) Which partitioning scheme should you choose, and why? (b) What can it prune that round-robin cannot? (c) What is the cost if you pick round-robin instead?
  • +1(a) Choose hash partitioning on customer_id: rows with the same customer_id land on the same node, which matches both the equality lookups and the join key.
  • +1(b) What it prunes: an equality query on customer_id goes to the single node that holds that hash bucket — the other 9 nodes are skipped entirely. Round-robin scatters a customer's rows across all nodes, so it can prune nothing.
  • +1Co-located join: because Customers can be hash-partitioned on the same key, matching Orders and Customers rows sit on the same node — the join runs locally with no network shuffle.
  • +1(c) Round-robin cost: every equality lookup must touch all 10 nodes (no pruning), and the join needs a full network shuffle to bring matching rows together — far more communication cost than the hash-partitioned plan.
Hash-partition on customer_id: equality lookups hit one node (the rest are pruned) and the join is co-located with no shuffle. Round-robin prunes nothing and forces a full shuffle for the join, so it is much more expensive for this workload.
Sia tip — Match the partitioning key to the workload's dominant predicate and join key. Hash prunes equality and enables co-located joins; range prunes range queries; round-robin balances load but prunes nothing. State explicitly what each scheme can and cannot skip.
Glossary

Key terms

Scale-out (horizontal)
Adding more commodity nodes to a cluster rather than buying one bigger machine (scale-up). It gives low initial cost and incremental growth but demands software that handles scalability, consistency and availability across nodes — the foundation of big-data systems.
Hash partitioning
Distributing a relation across nodes by a hash of a partitioning attribute, so equal keys land on the same node. It prunes equality queries to one node and enables co-located (shuffle-free) joins on that key, but cannot prune range queries.
Speed-up
Running a fixed problem on more nodes to finish faster — ideally an N-times speed-up on N nodes. Communication and skew bend the curve below the ideal line; contrast with scale-up, which grows the problem and the cluster together to hold response time constant.
Distributed (shuffle) join
A join across nodes that re-partitions (shuffles) both relations on the join key over the network so matching rows meet on the same node. Its cost is dominated by the network transfer; a broadcast join instead sends the small relation to every node, avoiding shuffling the large one.
HDFS
The Hadoop Distributed File System: large files are split into big blocks (e.g. 128 MB), each replicated 3× across DataNodes for fault tolerance, with a single NameNode holding the metadata (the block-to-node map). It favours large sequential reads over random access.
FAQ

Parallel Databases FAQ

What is the difference between scaling up and scaling out?

Scaling up (vertical) means buying a single bigger machine — simple, but expensive and capped by what one box can be. Scaling out (horizontal) means adding many commodity nodes — cheap to start and able to grow incrementally, but it requires software that manages scalability, consistency and availability across the cluster. Big-data systems are built on scale-out.

How do I choose a partitioning scheme, and what does each prune?

Match the scheme to the workload. Hash partitioning sends equal keys to the same node, so it prunes equality queries to one node and enables co-located joins on that key, but cannot help range queries. Range partitioning prunes range queries (only the nodes covering the range are touched). Round-robin spreads rows evenly for balanced load but prunes nothing, so every query must touch every node.

When should I use a broadcast join versus a distributed-shuffle join?

Broadcast the small relation when one side is small enough to fit in memory on every node: each node joins its local partition of the large relation against the full small one, with no shuffle of the large relation. Use a distributed-shuffle join when both relations are large — re-partition (shuffle) both on the join key over the network so matching rows meet. Cost both and compare the network transfer.

What do the NameNode and DataNodes do in HDFS?

The NameNode holds the metadata — the directory tree and the map from each file's blocks to the DataNodes storing their replicas — but no file data. The DataNodes store the actual blocks (each replicated 3× by default) and serve reads and writes. A client asks the NameNode where a block's replicas are, then reads directly from a DataNode; the single NameNode is the metadata bottleneck and single point of failure.

Study strategy

Exam move

Frame the whole chapter as one trade-off: scale-out buys throughput at the price of communication cost. For partitioning, build a small table of round-robin / range / hash against what each can prune and which joins each enables, and practise picking one for a stated workload. For the distributed-join drill, be ready to cost the broadcast and distributed-shuffle options on a small cluster (the headline 9-node, per-node I/O example) and pick the cheaper. Learn speed-up vs scale-up as the two scaling questions — fixed problem faster vs growing problem at constant time — and what bends each below ideal. Finally, lock in the HDFS facts (block size, 3× replication, NameNode vs DataNode) and the MapReduce map → shuffle/sort → reduce flow, including how a join maps onto it — this sets up the Spark chapter.

A+Everything unlocked
Unlocks this Bible + all 8 of your University of Sydney subjects - and 1,000+ Bibles across every Australian university.
Sia - your DATA3404 tutor, unlimited, worked the way the exam marks it
The full 8-page Bible + practice bank with worked solutions
Chrome extension - sync your LMS so Sia knows your deadlines
Bilingual EN / Chinese on every Bible and every Sia answer
$25/ month
30-day money-back · cancel in one tap · how it works
Unlock the full DATA3404 Bible + 8 University of Sydney subjects解锁完整 DATA3404 Bible + University of Sydney 8 门科目
$25/mo