MapReduce is the architecture that Google uses to do things like index the web and calculate PageRank. It’s a somewhat popular topic for developers and bloggers, for two reasons. First, because Google uses it to such dramatic effect, and it’s easy to think that it must be the greatest and most powerful way to handle distributed processing. Second, it is a little hard to understand at first, which means that there is always a market for “intro to MapReduce” blog posts.
The thing is, there is nothing magical about MapReduce. It is fairly simple on the surface, once you understand a few basic concepts (like map and reduce, though MapReduce != map + reduce, as we’ll see soon). It also isn’t the “best” approach to distributed processing, because there are so many types of problems that need distributed processing, and MapReduce is only appropriate for a small subset.
So this post isn’t about DIY MapReduce. Instead, it’s about understanding MapReduce. Specifically, understanding it from a languages standpoint, with reference to map and reduce, rather than understanding it from a systems standpoint.
Iterators and MapReduce
If you understand map and reduce, you’re about a third of the way to understanding MapReduce. But it is also important to note that MapReduce doesn’t even strictly need map or reduce functions, and is implemented at Google in C++ (not exactly a functional language). So map and reduce are more the conceptual foundation of MapReduce, rather than the underlying code.
But still, the MapReduce framework gets its name from these higher-order functions, and the basic pattern is simple:
- Split a large problem into smaller chunks, and perform a function on each chunk (
map) - Aggregate the output (
reduce)
Because map just applies a function to an element in an array, with no side effects, order doesn’t matter. You could run map backwards or forwards, and the result would be the same. Therefore, map operations can easily be parallelized over different CPU cores, or across multiple machines. (This is one of the advantages we get from greater abstraction. We could replace map with an each iterator, reduce, or even a for or while loop, but by using map, we know immediately that parallelization is possible.)
MapReduce also parallelizes its reduce stage, even though reduce is not inherently parallelizable. It does this not by parallelizing a single reduce, but by distributing each reduction to a different machine. So while the map stage is a single map, distributed over several computers, the reduce stage is multiple reduces, each operating on a single machine, and each independent of the reduction happening on the next machine.
An example: from local reduce to MapReduce
What kind of problem can be solved with MapReduce? Counting words is a basic example. Let’s say you want to count the number of occurrences of each word in War and Peace. You could do this simply in terms of inject, returning a hash of key-value pairs that list a word and its number of instances, like {"war" => 225, "peace" => 341}.
1 2 3 4 5 |
words = File.open("/path/to/war_and_peace.txt", "r").to_a.join(" ").split(" ") word_counts = words.reduce(Hash.new(0)) do |results, word| results[word.to_sym] += 1 results end |
This code is not parallelizable. Fortunately, it only takes Ruby about a second to count the instances of each word in War and Peace, which means that distributed MapReduce is only needed for much larger problems. But what if we wanted word counts of every book in Project Gutenberg? Or of every page on the entire internet? Or what if our calculation function took longer? There are 574,780 total words in the English translation of War and Peace that I’m using; if each word took a second to process, due to a network call or a complex calculation, it would take 6.5 days to process the book. Proust would take three weeks. Yikes!
That’s where MapReduce comes in. Instead of processing the entire list with a single reduce, imagine splitting the text of War and Peace into 200 even chunks. These chunks would then be mapped to 200 different servers, with each server doing its own parallel word counting, like this:
1 2 3 |
word_chunks.map do |chunk| assign_to_server(count_words(chunk)) #[{"the" => 1}, {"cat" => 1}, {"the" => 1], {"dog" => 1}] end |
In reality, MapReduce works slightly differently; each chunk is represented as the value in a key-value pair, with the key being the identifier for that chunk (like 1..200 when using 200 even chunks, or the ID of a Google FileSystem cluster, or a filename when each server gets a different file). This key is useful for managing a MapReduce operation; if server XYZ is goes down, the master program knows that XYZ was handling chunk 19, and we can process chunk 19 again. So what we have is more like this:
1 2 3 |
word_chunks.each do |chunk_key, words| assign_to_server(count_words(chunk_key, words)) end |
You might be asking yourself: if the map phase always creates values of “1” attached to a particular word, so a mapper might end up with [{"the" => 1}, {"the" => 1], {"the" => 1}], why even use a hash? Why not just create a big nested array of words ([["the","the","the"]]), group by word, and count the elements in each array?
Well, first, MapReduce isn’t always used for word counts, so in another use, the values returned by map might be more significant. Second, this lets us introduce a “combiner” stage between map and reduce to optimize the process. This stage creates a local count for a particular server, which reduces bandwidth and makes life easier for the reduce stage. Without this stage, a mapper might return [{"the" => 1}, {"cat" => 1}, {"the" => 1], {"dog" => 1}, {"the" => 1}], leaving it up to which ever reducer handles “the” to sum these numbers (along with other instances of “the”). But the combiner creates local sums, meaning the mapper will actually return [{"the" => 3}, {"cat" => 1}, {"dog" => 1}].
When all the distributed word counts finish, the results are grouped by key. So all the “cat” key/value pairs are grouped into one list, and the “dog” key/value pairs are grouped into another list. These are then reduced for a final word count. Our reduce pseudocode might look something like this.
1 2 3 4 5 |
grouped_results.each do |key, values| #key: "cat" #values: [1,3,12,9,1,2] total[key] = values.sum end |
These reductions can be distributed, so each pass through grouped_results can be handled by a different server, because the summing of instances of “war” is completely independent of the summing of instances of “peace”.
What’s next?
There is more to MapReduce than this; other pieces handle the fault tolerance, the grouping of keys between the map and reduce stages, etc. And the actual distribution of processing introduces a lot more complexity. If you actually want to use MapReduce, take a look at Hadoop, or a few Ruby distributed processing systems inspired by MapReduce (Skynet, Starfish).
Hopefully this post will give you a conceptual understanding of MapReduce; it’s an interesting and powerful architecture. Just remember that it is not the end-all of distributed processing, and just because it’s appropriate for Google doesn’t mean it is appropriate for you. In fact, MapReduce can only handle a certain array of problems; if you want to distribute video transcoding across multiple machines, for example, MapReduce can’t really help you. Keep in mind too that you shouldn’t reinvent the wheel – if Hadoop can help you, it’s already built and built well. But it never hurts to understand something new.
More Resources
Cluster Computing and MapReduce (video series)

If you’re a programmer, you’ve probably worked through one or more books teaching you the syntax of a new language. I’ve had this experience with half a dozen languages, like C, Javascript, and Perl. These books are typically introduce loops midway through the syntax discussion, after datatypes and control flow, but before I/O and advanced features.