MapReduce

A framework for large-scale parallel processing.

Goal: Create a distributed computing framework to process data on a massive scale.

MapReduce is a software framework for processing (large) data sets in a distributed fashion over a several machines.

MapReduce = high-level programming model and implementation for large-scale parallel data processing

MapReduce framework/library allows programmers without any experience with parallel and distributed systems to easily utilize the resources of a large distributed system.

image

Motivation

Context: multi-hour computations on multi-terabyte data-sets e.g. build search index, or sort, or analyze structure of web only practical with 1000s of computers

image

image

MapReduce Etymology

image

Programming Model

image

map      (k1,v1)       → list(k2,v2)
reduce   (k2,list(v2)) → list(v2)

Programmer specifies two primary methods:

All v' with same k' are reduced together. (Remember the invisible “Shuffle and Sort” step).

image

Word-count example

Counting the number of occurrences of each word in a large collection of documents.

  Input1 -> Map -> a,1 b,1
  Input2 -> Map ->     b,1
  Input3 -> Map -> a,1     c,1
                    |   |   |
                    |   |   -> Reduce -> c,1
                    |   -----> Reduce -> b,2
                    ---------> Reduce -> a,2

Abstract view of a MapReduce job – word count
1) input is (already) split into M pieces 2) MR calls Map() for each input split, produces list of k,v pairs “intermediate” data each Map() call is a “task” 3) when Maps are done, MR gathers all intermediate v’s for each k, and passes each key + values to a Reduce call 4) final output is set of <k,v> pairs from Reduce()s

Word-count code

  Map(d)
    chop d into words
    for each word w
      emit(w, "1")

  Reduce(k, v[])
    emit(len(v[]))

image

More details

image

image

image

image

image

Ref: https://web.stanford.edu/class/archive/cs/cs110/cs110.1204/static/lectures/cs110-lecture-17-mapreduce.pdf

Other examples

image

Map Reduce Notes

image

Data Storage

image

Data Model

image

Map Phase

image

Reduce Phase

image

image

image

Execution Overview

image

image

Ref: http://www.cohenwang.com/edith/bigdataclass2013/lectures/MapReduce_Kryzanowski.pdf

One master, many workers

Master assigns each map task to a free worker

Master assigns each reduce task to a free worker

Input and output are stored on the GFS cluster file system

Scalability

MapReduce scales well:

MapReduce hides much complexity:

To get these benefits, MapReduce restricts applications:

image

image

image

image Ref: https://static.googleusercontent.com/media/research.google.com/en//pubs/archive/32721.pdf

MR writes Map() output to local disk

The shuffle

The “Coordinator” manages all the steps in a job.

MapReduce Granularity

Fine granularity tasks: many more map tasks than machines

image

Skew image

image

image

image

Straggler

image

MapReduce: Fault Tolerance via Re-Execution

Worker failure:

Master failure:

Very Robust: lost 1600 of 1800 machines once, but finished fine

Typical problem solved by MapReduce

Outline stays the same, Map and Reduce functions change to fit the problem

Not used much currently

image

Every modern distributed system mentioned—Spark, Snowflake, and BigQuery—is essentially a more refined, faster evolution of the core ideas MapReduce pioneered:

Reference: