MapReduce

A framework for large-scale parallel processing.

Goal: Create a distributed computing framework to process data on a massive scale.

MapReduce is a software framework for processing (large) data sets in a distributed fashion over a several machines.

MapReduce framework/library allows programmers without any experience with parallel and distributed systems to easily utilize the resources of a large distributed system.

Motivation

MapReduce Etymology

Programming Model

image

map      (k1,v1)       → list(k2,v2)
reduce   (k2,list(v2)) → list(v2)

Programmer specifies two primary methods:

All v' with same k' are reduced together. (Remember the invisible “Shuffle and Sort” step).

Word-count example

Counting the number of occurrences of each word in a large collection of documents.

map(String key, String value):
  // key: document name
  // value: document contents
  for each word w in value:
    EmitIntermediate(w, "1");

reduce(String key, Iterator values):
  // key: a word
  // values: a list of counts
  int result = 0;
  for each v in values:
    result += ParseInt(v);
  Emit(AsString(result));

image

Execution Overview

image

One master, many workers

Master assigns each map task to a free worker

Master assigns each reduce task to a free worker

Mapreduce Granularity

Fine granularity tasks: many more map tasks than machines

image

MapReduce: Fault Tolerance via Re-Execution

Worker failure:

Master failure:

Very Robust: lost 1600 of 1800 machines once, but finished fine

Typical problem solved by MapReduce

Outline stays the same, Map and Reduce functions change to fit the problem

Reference: