26 October, 2007

Five advices on implementing a cache

I've spent the last few days at work implementing a cache in the data access layer (DAL) of one of our services. The cache works great, and speeds up our service very much in some cases, and somewhat in all cases. I've implemented caches before, and experienced many of the difficulties that arise when introducing a cache. It always seems rather easy, and always has unwanted side effects. The general advice is of course not to do it (and the advice from the database guys are always not to do it), but here are my five best advices on what you should consider if you decide to do it.

1. Make sure the cache is transparent.

The system shall not in any way notice that there is a cache handing objects to it instead of the database, nor should the introduction of a cache require changes in any layers above the layer where the cache resides. If you decide that changes are required, be aware that you are making changes to code that does work, and that re-verifying its behavior is a hard and expensive task.

2. If your system is transactional, make sure that the lifetime and mutability of objects in your cache matches your transaction isolation level.

The obvious solution for a cache in a transactional system is a cache that lives per transaction, but this is not guaranteed to work. As an example, let's say that transaction 1 reads some data and starts operating on them. Meanwhile, transaction 2 reads, alters, and commits some data that partially or fully depends on the data read by transaction 1. After the commit, transaction 1 reads some other data that depend on the new data committed by transaction 1. In this case, a cache in transaction 1 alters the system behavior if a too low isolation level is used (the alteration is most likely correct, but does not need to be, and it most certainly changes the behavior nonetheless) (see [this Wikipedia page](http://en.wikipedia.org/wiki/Isolation_(computer_science) "Transaction isolation") for an explanation of transaction isolation levels). Be aware of your isolation level, and know also that the default isolation level is different from RDBMS to RDBMS. As an example, MySQL uses repeatable read as the default isolation level, while MS SQL Server uses read committed. With an isolation level less than repeatable read, a cache with any mutable data in it is essentially useless.

3. Use the cache for immutable data as much as possible.

Also, use the cache for mutable data as little as possible, since this significantly increases the difficulty of the cache implementation, and with it the risk of errors.

4. Give the objects in the cache an as short lifetime as possible.

When implementing a cache, you want the objects in your cache to live as long as possible, since accessing the cache is much faster than accessing the database. Well, think about this: A cache with objects that live forever is actually a replacement for your database, which is not what you want to achieve. To avoid an implementation that is difficult, hard to verify, and unnecessary complex, make objects live as short as possible, while maintaining an increase in speed.

5. Be absolutely certain that the keys you use in your cache identify objects uniquely and unambiguously.

This sounds obvious, but with complex object hierarchies and caching of different parts of the hierarchy at different levels, it suddenly becomes very hard. In general, cache either the top-most or the lower-most objects in your hierarchy. The choice of an approach depends on how you access your data. The best way to decide which approach to take is to do a thorough analysis of your data and the objects that represent them, how you access these objects, how you use them, and in which cases you are most likely to gain speed by introducing a cache.

08 August, 2007

LINQ vs Loop - A performance test

I just installed Visual Studio 2008 beta 2 to see what the future holds for C#. The addition of LINQ has brought a variety of query keywords to the language. "Anything" can be queried; SQL databases (naturally), XML documents, and regular collections. Custom queryable objects can also be created by implementing IQueryable. Sadly, like every abstraction, these goodies all come at a cost. The question is how much?

I decided to create a simple test to see how much of a performance hit LINQ is. The simple test I deviced finds the numbers in an array that are less than 10.

Initially, I assumed the performance impact would not be too large, since its equivalent is the straightforward imperative loop, which should not be too hard for a compiler to deduce given static typing and a single collection to iterate across. Or?

    LINQ: 00:00:04.1052060, avg. 00:00:00.0041052
    Loop: 00:00:00.0790965, avg. 00:00:00.0000790

As you can see, the performance impact is huge. LINQ performs 50 times worse than the traditional loop! This seems rather wild at first glance, but the explanation is this: The keywords introduced by LINQ are syntactic sugar for method invocations to a set of generic routines for iterating across collections and filtering through lambda expressions. Naturally, this will not perform as good as a traditional imperative loop, and less optimization is possible.

Having seen the performance impact, I am still of the view that LINQ is a great step towards a more declarative world for developers. Instead of saying "take these numbers, iterate over all of them, and insert them into this list if they are less then ten", which is an informal description of a classical imperative loop, you can now say "from these numbers, give me those that are less than ten". The difference may be subtle, but the latter is in my opinion far more declarative and easy to read.

This may very well be the next big thing, but it comes at a cost. So far, my advice is to create simple performance tests for the cases where you consider adopting LINQ, to spot possible pitfalls as early as possible.