Skip to content

Reading notes: MapReduce (Dean and Ghemawat, 2004)

I still assign this to undergraduates and it still divides opinion among my colleagues. The argument against: MapReduce is dead, Spark won, why are we teaching obsolete frameworks?

The argument for, which I find compelling: the paper is not really about MapReduce the system. It is about the model of expressing data-parallel computation, the fault tolerance design, and the conceptual move of bringing computation to data instead of the other way round. Those ideas are not obsolete.

What it teaches well

  • The map/reduce primitives as a functional decomposition.
  • Re-execution as a fault tolerance strategy (vs checkpointing).
  • The straggler problem and backup tasks.
  • Locality optimisation — reading from local disk is orders of magnitude faster.

What is no longer accurate

  • Three replicas of every chunk in GFS — modern storage uses erasure coding.
  • Disk-based shuffle is a serious latency cost. In-memory shuffle is the modern default.
  • The combiner optimisation is rarely needed in practice when working sets fit in memory.

Connections to other things

The functional programming lineage is explicit but worth dwelling on. map and reduce come from Lisp. The paper makes this connection in section 2. Students who have written Haskell or used Array.map in JavaScript get it immediately.

Spark replaced MapReduce's execution model with RDDs and lazy evaluation. Dataflow systems (Beam, Flink) generalised further. But all of them inherit the basic decomposition.