What is MapReduce?
For those of you who don't know what MapReduce is: MapReduce is a simplified interface for parallel data processing. MapReduce was initially described by the Google engineers Jeffrey Dean and Sanjay Ghemawat in the 2004 paper titled [MapReduce: Simplified data processing on large clusters](http://labs.google.com/papers/mapreduce.html).MapReduce processes data by splitting the processing in to a set of transformations (in functional programming, this is called the "map" function (it maps or transforms an input to an output)). The results of the transformations are then combined into a single result (in functional programming, this is called the "reduce" function (it reduces a set of values to a single value)). On a sidenote, Linq has equivalent functions, but the names are different, presumably to make them more familiar to people with SQL knowledge. In Linq, map is called `Select`, and reduce is called `Aggregate`.
Shortly put, to process a huge set of data, you split the data into chunks and process each chunk in parallel. This eventually creates a new set of intermediary results, which is reduced to a single result.
Implementing a minimalistic MapReduce in .NET 4.0
The signature of my MapReduce function is static TaskIn other words, to start a MapReduce run, you supply a map function, a reduce function, and a set of inputs. Each input will be turned into an intermediate result (of type TPartial). Inputs are transformed concurrently. When all inputs are transformed, the reduce function is called to transform the partial results into a final result (of type TResult). Cool!
The map part is implemented by starting a task for each supplied input using Task.Factory.StartNew(() => map(input)).
The reduce part is implemented as a continuation of all the map tasks, meaning that the reduce task waits for all the map tasks to complete, and then executes. This is achieved using Task.Factory.ContinueWhenAll(mapTasks, tasks => PerformReduce(reduce, tasks)).
As you can see, the implementation is minimalistic and simple, and usage is likewise.
Here's a simple example using MapReduce to calculate the root mean square (MSE) of a set of values:
Actual applications of MapReduce are of course far more interesting than this simple example.
Applications of MapReduce
MapReduce can essentially be applied to any problem where you need a number of things to be done in parallel. It can even be applied in cases where you don't need a final result. Just return an arbitrary value as the result (or even better, implement a variant of my MapReduce which uses ActionA few obvious use cases:
- Distributed search
- Distributed sort
- Tokenization
- Indexing
- Log processing
- Machine learning
- General artificial intelligence
- General data mining
- Large scale image processing
- ...
You can grab the source code for MapReduce here (requires .NET 4.0 or later):
As usual, play around with it, have fun, and let me know if you find it useful!