DATA3404 · Scalable Data Management
Big Data with Spark
Spark is the modern engine of the big-data stack. It keeps everything the earlier chapters taught — partitioning, cost-based optimisation, the join algorithms — but wraps them in a fault-tolerant, in-memory, lazily-optimised dataflow. The mental model the exam rewards is small: a job is a DAG of transformations on immutable RDDs; nothing runs until an action fires; wide (shuffle) dependencies cut the DAG into stages; and a lost partition is rebuilt from its lineage, not from a replica. The chapter explains why Spark beats MapReduce on iterative and interactive work (no disk round-trip per step), narrow vs wide dependencies and stage boundaries, Spark's four join strategies and what Catalyst/AQE do, the NoSQL families and the CAP / BASE-vs-ACID trade-off, and streaming windows. This area is lighter and more conceptual than the cost-formula chapters.
What this chapter covers
- 017.1 Why Spark: the limits of MapReduce's disk round-trips
- 02RDDs, lazy evaluation, transformations vs actions
- 03Narrow vs wide (shuffle) dependencies and stage boundaries
- 04Lineage-based fault tolerance — recompute, don't replicate
- 05Spark's four join strategies; Catalyst & adaptive query execution
- 06NoSQL families, the CAP theorem & ACID-vs-BASE, and streaming windows
Worked example: narrow vs wide dependencies and stage count
- +1(a) Narrow vs wide: map and filter are narrow (each output partition depends on one input partition). reduceByKey is wide — it groups by key across partitions, so it shuffles. The final map is narrow again.
- +1(b) Stage boundary: a wide dependency cuts the DAG into stages, so the single shuffle at reduceByKey creates 2 stages — map+filter before the shuffle, the post-shuffle map after it.
- +1Count the shuffles: exactly one wide dependency → one shuffle → one stage boundary.
- +1(c) When it runs: all the transformations are lazy — they only build the DAG. Nothing executes until the action collect() fires, which triggers the two stages.
Key terms
- RDD
- Resilient Distributed Dataset — Spark's core abstraction: an immutable, partitioned collection computed lazily. Transformations build a lineage DAG without running anything; the data is only materialised when an action fires.
- Lazy evaluation
- Spark records transformations (map, filter, join) as a DAG but does not execute them until an action (collect, count, save) is called. This lets Catalyst optimise the whole DAG before any work runs, and lets Spark pipeline narrow transformations together.
- Narrow vs wide dependency
- A narrow dependency means each output partition depends on one input partition (map, filter) — no shuffle, pipelined within a stage. A wide dependency (groupByKey, reduceByKey, join) needs data from many partitions, forcing a network shuffle that cuts the DAG into a new stage.
- Lineage
- The recorded chain of transformations that produced an RDD. If a partition is lost, Spark recomputes just that partition from its lineage rather than restoring a replica — fault tolerance without the 3× storage cost of HDFS replication.
- CAP theorem
- A distributed store can guarantee at most two of Consistency, Availability and Partition-tolerance at once. Since network partitions are unavoidable at scale, real systems trade consistency for availability (BASE, eventual consistency) or vice versa — the lens for choosing a NoSQL family.
Big Data with Spark FAQ
Why is Spark faster than MapReduce?
MapReduce materialises every intermediate result back to HDFS (3× replicated) between steps, so an iterative or interactive job pays a full disk round-trip per iteration. Spark keeps intermediate RDDs in memory across steps and only writes to disk when it must, so iterative machine-learning and graph jobs and interactive queries run far faster. For a single batch pass the gap is smaller; the win is in repeated passes over the same data.
What is the difference between a transformation and an action in Spark?
Transformations (map, filter, join, reduceByKey) are lazy — they only add to the lineage DAG and return a new RDD without running anything. Actions (collect, count, save, take) force the DAG to execute and return a value to the driver or write output. A Spark program with no action does no work, which is a common source of confusion when nothing seems to happen.
How does Spark recover from a lost partition without replication?
Through lineage. Each RDD records the exact sequence of transformations that built it, so when a node fails and a partition is lost, Spark recomputes only that partition by replaying its lineage from the last available data. This avoids the storage cost of HDFS-style 3× replication, at the price of recomputation time when a failure happens.
What does the CAP theorem mean for choosing a NoSQL store?
CAP says a distributed store can guarantee at most two of Consistency, Availability and Partition-tolerance simultaneously. Because network partitions are unavoidable at scale, you effectively choose between consistency and availability: a CP store (e.g. for correctness-critical data) may reject reads during a partition, while an AP store (BASE / eventual consistency) stays available but may return slightly stale data. Match the choice to whether your use-case tolerates stale reads.
Exam move
This chapter is conceptual, not cost-formula heavy, so optimise for crisp explanations rather than arithmetic. Lock in the core mental model: a job is a DAG of transformations on immutable RDDs; transformations are lazy and only actions trigger execution; wide (shuffle) dependencies cut the DAG into stages (stages = shuffles + 1); and a lost partition is recomputed from lineage, not a replica. Be able to take a small Spark snippet, label narrow vs wide dependencies, draw the stage boundaries and count the shuffles. Know why Spark beats MapReduce on iterative/interactive work, name Spark's join strategies and what Catalyst/AQE do, and be ready to pick a NoSQL family and state the CAP / ACID-vs-BASE trade-off. Because this area is medium-to-lower weight, secure the cost-formula marks from the earlier chapters first, then bank these conceptual marks.