University of Sydney · S1 2026 · FACULTY OF INFORMATION TECHNOLOGY

DATA3404 · Scalable Data Management

- one subject, every graph, every model, every mark
50% final exam · hurdle14 Chapters7-page Bible
Our own words - no uploaded lecturer files
Built to mirror S1 2026 · updated this semester
Chapter 7 of 7 · DATA3404

Big Data with Spark

Spark is the modern engine of the big-data stack. It keeps everything the earlier chapters taught — partitioning, cost-based optimisation, the join algorithms — but wraps them in a fault-tolerant, in-memory, lazily-optimised dataflow. The mental model the exam rewards is small: a job is a DAG of transformations on immutable RDDs; nothing runs until an action fires; wide (shuffle) dependencies cut the DAG into stages; and a lost partition is rebuilt from its lineage, not from a replica. The chapter explains why Spark beats MapReduce on iterative and interactive work (no disk round-trip per step), narrow vs wide dependencies and stage boundaries, Spark's four join strategies and what Catalyst/AQE do, the NoSQL families and the CAP / BASE-vs-ACID trade-off, and streaming windows. This area is lighter and more conceptual than the cost-formula chapters.

In this chapter

What this chapter covers

  • 017.1 Why Spark: the limits of MapReduce's disk round-trips
  • 02RDDs, lazy evaluation, transformations vs actions
  • 03Narrow vs wide (shuffle) dependencies and stage boundaries
  • 04Lineage-based fault tolerance — recompute, don't replicate
  • 05Spark's four join strategies; Catalyst & adaptive query execution
  • 06NoSQL families, the CAP theorem & ACID-vs-BASE, and streaming windows
Worked example · free

Worked example: narrow vs wide dependencies and stage count

Q [4 marks]. A Spark job runs: rdd.map(f).filter(g).reduceByKey(h).map(k), then collect(). (a) Which transformations are narrow and which are wide? (b) How many stages does the DAG have, and where is the boundary? (c) When does any of this actually execute?
  • +1(a) Narrow vs wide: map and filter are narrow (each output partition depends on one input partition). reduceByKey is wide — it groups by key across partitions, so it shuffles. The final map is narrow again.
  • +1(b) Stage boundary: a wide dependency cuts the DAG into stages, so the single shuffle at reduceByKey creates 2 stages — map+filter before the shuffle, the post-shuffle map after it.
  • +1Count the shuffles: exactly one wide dependency → one shuffle → one stage boundary.
  • +1(c) When it runs: all the transformations are lazy — they only build the DAG. Nothing executes until the action collect() fires, which triggers the two stages.
map and filter are narrow, reduceByKey is wide (a shuffle), the final map is narrow; the one wide dependency gives 2 stages with the boundary at reduceByKey; and nothing runs until the action collect() triggers the DAG.
Sia tip — Count shuffles to count stage boundaries: stages = wide dependencies + 1. Remember transformations are lazy and only actions (collect, count, save) trigger execution — a job with no action does nothing.
Glossary

Key terms

RDD
Resilient Distributed Dataset — Spark's core abstraction: an immutable, partitioned collection computed lazily. Transformations build a lineage DAG without running anything; the data is only materialised when an action fires.
Lazy evaluation
Spark records transformations (map, filter, join) as a DAG but does not execute them until an action (collect, count, save) is called. This lets Catalyst optimise the whole DAG before any work runs, and lets Spark pipeline narrow transformations together.
Narrow vs wide dependency
A narrow dependency means each output partition depends on one input partition (map, filter) — no shuffle, pipelined within a stage. A wide dependency (groupByKey, reduceByKey, join) needs data from many partitions, forcing a network shuffle that cuts the DAG into a new stage.
Lineage
The recorded chain of transformations that produced an RDD. If a partition is lost, Spark recomputes just that partition from its lineage rather than restoring a replica — fault tolerance without the 3× storage cost of HDFS replication.
CAP theorem
A distributed store can guarantee at most two of Consistency, Availability and Partition-tolerance at once. Since network partitions are unavoidable at scale, real systems trade consistency for availability (BASE, eventual consistency) or vice versa — the lens for choosing a NoSQL family.
FAQ

Big Data with Spark FAQ

Why is Spark faster than MapReduce?

MapReduce materialises every intermediate result back to HDFS (3× replicated) between steps, so an iterative or interactive job pays a full disk round-trip per iteration. Spark keeps intermediate RDDs in memory across steps and only writes to disk when it must, so iterative machine-learning and graph jobs and interactive queries run far faster. For a single batch pass the gap is smaller; the win is in repeated passes over the same data.

What is the difference between a transformation and an action in Spark?

Transformations (map, filter, join, reduceByKey) are lazy — they only add to the lineage DAG and return a new RDD without running anything. Actions (collect, count, save, take) force the DAG to execute and return a value to the driver or write output. A Spark program with no action does no work, which is a common source of confusion when nothing seems to happen.

How does Spark recover from a lost partition without replication?

Through lineage. Each RDD records the exact sequence of transformations that built it, so when a node fails and a partition is lost, Spark recomputes only that partition by replaying its lineage from the last available data. This avoids the storage cost of HDFS-style 3× replication, at the price of recomputation time when a failure happens.

What does the CAP theorem mean for choosing a NoSQL store?

CAP says a distributed store can guarantee at most two of Consistency, Availability and Partition-tolerance simultaneously. Because network partitions are unavoidable at scale, you effectively choose between consistency and availability: a CP store (e.g. for correctness-critical data) may reject reads during a partition, while an AP store (BASE / eventual consistency) stays available but may return slightly stale data. Match the choice to whether your use-case tolerates stale reads.

Study strategy

Exam move

This chapter is conceptual, not cost-formula heavy, so optimise for crisp explanations rather than arithmetic. Lock in the core mental model: a job is a DAG of transformations on immutable RDDs; transformations are lazy and only actions trigger execution; wide (shuffle) dependencies cut the DAG into stages (stages = shuffles + 1); and a lost partition is recomputed from lineage, not a replica. Be able to take a small Spark snippet, label narrow vs wide dependencies, draw the stage boundaries and count the shuffles. Know why Spark beats MapReduce on iterative/interactive work, name Spark's join strategies and what Catalyst/AQE do, and be ready to pick a NoSQL family and state the CAP / ACID-vs-BASE trade-off. Because this area is medium-to-lower weight, secure the cost-formula marks from the earlier chapters first, then bank these conceptual marks.

A+Everything unlocked
Unlocks this Bible + all 8 of your University of Sydney subjects - and 1,000+ Bibles across every Australian university.
Sia - your DATA3404 tutor, unlimited, worked the way the exam marks it
The full 7-page Bible + practice bank with worked solutions
Chrome extension - sync your LMS so Sia knows your deadlines
Bilingual EN / Chinese on every Bible and every Sia answer
$25/ month
30-day money-back · cancel in one tap · how it works
Unlock the full DATA3404 Bible + 8 University of Sydney subjects解锁完整 DATA3404 Bible + University of Sydney 8 门科目
$25/mo