04 November, 2009

Minimalistic MapReduce in .NET 4.0 with the new Task Parallel Library(TPL)

Among the news in .NET 4.0 are several additions by the [Parallel Computing Platform Team](http://blogs.msdn.com/pfxteam/). As I wandered through the documentation of the Task library with cloud computing and parallelism buzz in the back of my head, I got the idea of using tasks to create a minimalistic MapReduce. Here's the result, a rather crude and simple, but efficient MapReduce for you to play with and utilize!

What is MapReduce? 

For those of you who don't know what MapReduce is: MapReduce is a simplified interface for parallel data processing. MapReduce was initially described by the Google engineers Jeffrey Dean and Sanjay Ghemawat in the 2004 paper titled [MapReduce: Simplified data processing on large clusters](http://labs.google.com/papers/mapreduce.html).

MapReduce processes data by splitting the processing in to a set of transformations (in functional programming, this is called the "map" function (it maps or transforms an input to an output)). The results of the transformations are then combined into a single result (in functional programming, this is called the "reduce" function (it reduces a set of values to a single value)). On a sidenote, Linq has equivalent functions, but the names are different, presumably to make them more familiar to people with SQL knowledge. In Linq, map is called `Select`, and reduce is called `Aggregate`.

Shortly put, to process a huge set of data, you split the data into chunks and process each chunk in parallel. This eventually creates a new set of intermediary results, which is reduced to a single result.

Implementing a minimalistic MapReduce in .NET 4.0

The signature of my MapReduce function is static Task Start(Func map, Func reduce, params TInput[] inputs);

In other words, to start a MapReduce run, you supply a map function, a reduce function, and a set of inputs. Each input will be turned into an intermediate result (of type TPartial). Inputs are transformed concurrently. When all inputs are transformed, the reduce function is called to transform the partial results into a final result (of type TResult). Cool!

The map part is implemented by starting a task for each supplied input using Task.Factory.StartNew(() => map(input)).

The reduce part is implemented as a continuation of all the map tasks, meaning that the reduce task waits for all the map tasks to complete, and then executes. This is achieved using Task.Factory.ContinueWhenAll(mapTasks, tasks => PerformReduce(reduce, tasks)).
As you can see, the implementation is minimalistic and simple, and usage is likewise.

Here's a simple example using MapReduce to calculate the root mean square (MSE) of a set of values:

Actual applications of MapReduce are of course far more interesting than this simple example.

Applications of MapReduce

MapReduce can essentially be applied to any problem where you need a number of things to be done in parallel. It can even be applied in cases where you don't need a final result. Just return an arbitrary value as the result (or even better, implement a variant of my MapReduce which uses Action).

A few obvious use cases:
  • Distributed search
  • Distributed sort
  • Tokenization
  • Indexing
  • Log processing
  • Machine learning
  • General artificial intelligence
  • General data mining
  • Large scale image processing
  • ...
The list goes on and on, these are just a few things off the top of my head.

You can grab the source code for MapReduce here (requires .NET 4.0 or later):


As usual, play around with it, have fun, and let me know if you find it useful!