04 November, 2009

Minimalistic MapReduce in .NET 4.0 with the new Task Parallel Library(TPL)

Among the news in .NET 4.0 are several additions by the [Parallel Computing Platform Team](http://blogs.msdn.com/pfxteam/). As I wandered through the documentation of the Task library with cloud computing and parallelism buzz in the back of my head, I got the idea of using tasks to create a minimalistic MapReduce. Here's the result, a rather crude and simple, but efficient MapReduce for you to play with and utilize!

What is MapReduce? 

For those of you who don't know what MapReduce is: MapReduce is a simplified interface for parallel data processing. MapReduce was initially described by the Google engineers Jeffrey Dean and Sanjay Ghemawat in the 2004 paper titled [MapReduce: Simplified data processing on large clusters](http://labs.google.com/papers/mapreduce.html).

MapReduce processes data by splitting the processing in to a set of transformations (in functional programming, this is called the "map" function (it maps or transforms an input to an output)). The results of the transformations are then combined into a single result (in functional programming, this is called the "reduce" function (it reduces a set of values to a single value)). On a sidenote, Linq has equivalent functions, but the names are different, presumably to make them more familiar to people with SQL knowledge. In Linq, map is called `Select`, and reduce is called `Aggregate`.

Shortly put, to process a huge set of data, you split the data into chunks and process each chunk in parallel. This eventually creates a new set of intermediary results, which is reduced to a single result.

Implementing a minimalistic MapReduce in .NET 4.0

The signature of my MapReduce function is static Task Start(Func map, Func reduce, params TInput[] inputs);

In other words, to start a MapReduce run, you supply a map function, a reduce function, and a set of inputs. Each input will be turned into an intermediate result (of type TPartial). Inputs are transformed concurrently. When all inputs are transformed, the reduce function is called to transform the partial results into a final result (of type TResult). Cool!

The map part is implemented by starting a task for each supplied input using Task.Factory.StartNew(() => map(input)).

The reduce part is implemented as a continuation of all the map tasks, meaning that the reduce task waits for all the map tasks to complete, and then executes. This is achieved using Task.Factory.ContinueWhenAll(mapTasks, tasks => PerformReduce(reduce, tasks)).
As you can see, the implementation is minimalistic and simple, and usage is likewise.

Here's a simple example using MapReduce to calculate the root mean square (MSE) of a set of values:

Actual applications of MapReduce are of course far more interesting than this simple example.

Applications of MapReduce

MapReduce can essentially be applied to any problem where you need a number of things to be done in parallel. It can even be applied in cases where you don't need a final result. Just return an arbitrary value as the result (or even better, implement a variant of my MapReduce which uses Action).

A few obvious use cases:
  • Distributed search
  • Distributed sort
  • Tokenization
  • Indexing
  • Log processing
  • Machine learning
  • General artificial intelligence
  • General data mining
  • Large scale image processing
  • ...
The list goes on and on, these are just a few things off the top of my head.

You can grab the source code for MapReduce here (requires .NET 4.0 or later):


As usual, play around with it, have fun, and let me know if you find it useful!

19 March, 2009

RSA using BouncyCastle

Trying to do RSA using BouncyCastle, but struggling to find your way around the API? In a previous post (see [here](/posts/why-cripple-the-net-rsa-implementation)) I pondered why the RSA implementation in `System.Security.Cryptography` is restricted to only the most common usage scenarios. I mentioned [BouncyCastle](http://bouncycastle.org) as an alternative for those who wanted a more flexible API, but never got around to providing examples where BouncyCastle was used. By request, this post provides usage examples by building a crude and simple, but efficient set of methods for RSA key generation, encryption, and decryption, all built on top of BouncyCastle.

NOTE: The general cryptographical security of the presented method is beyond the scope of the article. The code presented is not cryptographically secure for large data sets. If you're here looking for a way to do cryptographically secure RSA in the general case, you should look into more complicated approaches including padding, blinding, and more sophisticated block cipher modes. Cryptography is a topic undergoing constant research, so stay up to date and be sure to evaluate the strength of your solution for the scenarios in which you apply it.

BouncyCastle provides flexibility and control over your encryption approach, which comes at a cost. The BouncyCastle API might be a bit hard to cope with at first, but if you know encryption in general you should be able to find your way around the API without too much effort. This post will be focusing on RSA, since that was my original need, but it should be mentioned that BouncyCastle provides many other asymmetric (and symmetric) algorithms for which the usage is similar to what you find below.

Creating RSA keys

Creating RSA keys is a simple task. The method below lets you specify the key size in bits, and creates a key pair for you.

That's all there is to it.

Encryption

Now that we have a key pair, we are ready to encrypt and decrypt using RSA. In the example below, we use a key (public or private) to encrypt a byte sequence. To encrypt a string, simply convert the string to a byte array using Encoding.GetBytes.

The approach above uses a list to gather output for the sake of simplicity. Note that the RSA engine can only process a limited block size at a time (block size depends on the key size). The approach above processes a data set of an arbitrary size.

The above method does not impose constraints on which key you use for encryption. Use the public key or the private key as you see fit for your solution.

Decryption

The Decrypt method is very similar to the Encrypt method:

Again, it's up to you which key you choose to use. If you want to use the common approach, encrypt using a symmetric cipher, hash the data, and sign the hash with your private key using the above Encrypt method. If you want to use another approach like encrypting the actual data using your private key, you are of course free to do so.

I hope this post helps those of you who want to apply RSA (or any other asymmetric cipher) to more subtle cases than those supported by the .NET framework.

08 March, 2009

Mocking HtmlHelper in ASP.NET MVC RC1 using Moq

For those of you trying to mock HtmlHelper, but finding it difficult, here's a mock that works in ASP.NET MVC RC1.

The ViewDataDictionary that is passed to the HtmlHelper can be empty, or made to contain the data you want for your test.