Problems worthy of attack prove their worth by hitting back. —Piet Hein

Tuesday, 27 November 2007

Eels in Many Worlds

I caught the second half of Parallel Worlds, Parallel Lives on BBC Four last night, where Mark Everett of the band Eels travelled around the US to find out about his father Hugh Everett III, the physicist who developed the Many Worlds Interpretation of quantum mechanics. (New Scientist has a good interview with him too.)

I read Everett senior's landmark paper when I was doing my Master's degree, and at the time I agreed with Paul Davies' view that Many Worlds is "cheap on assumptions but expensive on universes". I later read David Deutsch's The Fabric of Reality, and I was prepared to give the theory more time - but even so, I still find it a lot to ask. (It's better than the Copenhagen Interpretation though.) Not long after finishing my degree I got into Eels and saw them play live in London.

Only now do I know the connection. A fascinating programme.

Consistent Hashing

I've bumped into consistent hashing a couple of times lately. The paper that introduced the idea (Consistent Hashing and Random Trees: Distributed Caching Protocols for Relieving Hot Spots on the World Wide Web by David Karger et al) appeared ten years ago, although recently it seems the idea has quietly been finding its way into more and more services, from Amazon's Dynamo to memcached (courtesy of So what is consistent hashing and why should you care?

The need for consistent hashing arose from limitations experienced while running collections of caching machines - web caches, for example. If you have a collection of n cache machines then a common way of load balancing across them is to put object o in cache machine number hash(o) mod n. This works well until you add or remove cache machines (for whatever reason), for then n changes and every object is hashed to a new location. This can be catastrophic since the originating content servers are swamped with requests from the cache machines. It's as if the cache suddenly disappeared. Which it has, in a sense. (This is why you should care - consistent hashing is needed to avoid swamping your servers!)

It would be nice if, when a cache machine was added, it took its fair share of objects from all the other cache machines. Equally, when a cache machine was removed, it would be nice if its objects were shared between the remaining machines. This is exactly what consistent hashing does - consistently maps objects to the same cache machine, as far as is possible, at least.

The basic idea behind the consistent hashing algorithm is to hash both objects and caches using the same hash function. The reason to do this is to map the cache to an interval, which will contain a number of object hashes. If the cache is removed then its interval is taken over by a cache with an adjacent interval. All the other caches remain unchanged.


Let's look at this in more detail. The hash function actually maps objects and caches to a number range. This should be familiar to every Java programmer - the hashCode method on Object returns an int, which lies in the range -231 to 231-1. Imagine mapping this range into a circle so the values wrap around. Here's a picture of the circle with a number of objects (1, 2, 3, 4) and caches (A, B, C) marked at the points that they hash to (based on a diagram from Web Caching with Consistent Hashing by David Karger et al):

To find which cache an object goes in, we move clockwise round the circle until we find a cache point. So in the diagram above, we see object 1 and 4 belong in cache A, object 2 belongs in cache B and object 3 belongs in cache C. Consider what happens if cache C is removed: object 3 now belongs in cache A, and all the other object mappings are unchanged. If then another cache D is added in the position marked it will take objects 3 and 4, leaving only object 1 belonging to A.

This works well, except the size of the intervals assigned to each cache is pretty hit and miss. Since it is essentially random it is possible to have a very non-uniform distribution of objects between caches. The solution to this problem is to introduce the idea of "virtual nodes", which are replicas of cache points in the circle. So whenever we add a cache we create a number of points in the circle for it.

You can see the effect of this in the following plot which I produced by simulating storing 10,000 objects in 10 caches using the code described below. On the x-axis is the number of replicas of cache points (with a logarithmic scale). When it is small, we see that the distribution of objects across caches is unbalanced, since the standard deviation as a percentage of the mean number of objects per cache (on the y-axis, also logarithmic) is high. As the number of replicas increases the distribution of objects becomes more balanced. This experiment shows that a figure of one or two hundred replicas achieves an acceptable balance (a standard deviation that is roughly between 5% and 10% of the mean).


For completeness here is a simple implementation in Java. In order for consistent hashing to be effective it is important to have a hash function that mixes well. Most implementations of Object's hashCode do not mix well - for example, they typically produce a restricted number of small integer values - so we have a HashFunction interface to allow a custom hash function to be used. MD5 hashes are recommended here.

import java.util.Collection;
import java.util.SortedMap;
import java.util.TreeMap;

public class ConsistentHash<T> {

private final HashFunction hashFunction;
private final int numberOfReplicas;
private final SortedMap<Integer, T> circle =
new TreeMap<Integer, T>();

public ConsistentHash(HashFunction hashFunction,
int numberOfReplicas, Collection<T> nodes) {

this.hashFunction = hashFunction;
this.numberOfReplicas = numberOfReplicas;

for (T node : nodes) {

public void add(T node) {
for (int i = 0; i < numberOfReplicas; i++) {
circle.put(hashFunction.hash(node.toString() + i),

public void remove(T node) {
for (int i = 0; i < numberOfReplicas; i++) {
circle.remove(hashFunction.hash(node.toString() + i));

public T get(Object key) {
if (circle.isEmpty()) {
return null;
int hash = hashFunction.hash(key);
if (!circle.containsKey(hash)) {
SortedMap<Integer, T> tailMap =
hash = tailMap.isEmpty() ?
circle.firstKey() : tailMap.firstKey();
return circle.get(hash);


The circle is represented as a sorted map of integers, which represent the hash values, to caches (of type T here).
When a ConsistentHash object is created each node is added to the circle map a number of times (controlled by numberOfReplicas). The location of each replica is chosen by hashing the node's name along with a numerical suffix, and the node is stored at each of these points in the map.

To find a node for an object (the get method), the hash value of the object is used to look in the map. Most of the time there will not be a node stored at this hash value (since the hash value space is typically much larger than the number of nodes, even with replicas), so the next node is found by looking for the first key in the tail map. If the tail map is empty then we wrap around the circle by getting the first key in the circle.


So how can you use consistent hashing? You are most likely to meet it in a library, rather than having to code it yourself. For example, as mentioned above, memcached, a distributed memory object caching system, now has clients that support consistent hashing.'s ketama by Richard Jones was the first, and there is now a Java implementation by Dustin Sallings (which inspired my simplified demonstration implementation above). It is interesting to note that it is only the client that needs to implement the consistent hashing algorithm - the memcached server is unchanged. Other systems that employ consistent hashing include Chord, which is a distributed hash table implementation, and Amazon's Dynamo, which is a key-value store (not available outside Amazon).

Sunday, 18 November 2007

Back on the net

Yesterday, after a month of waiting, we got broadband at our new house. Hooray!

Last time I was without net access it was fun - it was just for a two week holiday. This time the whole experience was frustrating, since it wasn't self-imposed: we were dragged into the labyrinthine machine of BT/Openreach and of course we didn't know how long it would take to get out. Eliane's got the gory details - all I have to add is that Eclipse Internet, our broadband supplier, provided excellent support every time I rang.

Still, it gave me the chance to practise lighting fires in the wood-burning stove.

Wednesday, 31 October 2007

Mixing with MD5

MD5 and cryptographic hash functions in general have many uses - not least integrity checking and password storage. They have the property of mixing the input in an unpredictable but deterministic way. This can be very useful - and not just in security applications. Here are two great examples I came across recently: partitioning users into groups for controlled experiments on the web, and Dopplr's colour-coding of cities. Neat.

Moved to Wales

Well the phone line is yet to be enabled, but in the meantime I've got this view to stare at.

Sunday, 14 October 2007

Moving to Wales

Next week we are moving to the Brecon Beacons in Wales. This is a big change for us: I've been living in London for the past 11 years and Eliane for the last 17. But in some ways it's not so big as we both grew up in small towns. We're renting a barn on a hill, and Eliane is already planning the veg patch.

Work-wise I shall be leaving Kizoom at the end of November. I am one of the three original members (with Nick Knowles and Mike Storey) of the team that started Kizoom in September 1999, in Nick's library (Islington people have libraries and garages).

After that I shall be working freelance and specializing in Hadoop. The plan is to work from home as much as possible, but I'm realistic about the need to travel to make this happen. So - if you need help with a Hadoop-related project, please get in touch.

Tuesday, 2 October 2007

Zero-width Negative Lookbehind

Mainly for my own reference:

perl -ple 's/(?<![+-\d])(\d+)/\+$1/g'

This will ensure all integers in a string have a leading sign. So for example, "+163, 163, -163" becomes "+163, +163, -163". It works using a zero-width negative lookbehind assertion: in this case the only "163" in the string that matches is the one that is not preceded by a plus sign, a minus sign, or a digit.

This came up at work today (thanks Robin).

Friday, 28 September 2007

A Java Servlet for Thrift

I've been playing around with Thrift (the new version that came out a few days ago), mainly to see how it might be used as a serialization mechanism in Hadoop, but also because the RPC looks useful. It doesn't come with a Java HTTP server transport, so I whipped up a servlet to make it easy. Exposing a service is then as simple as subclassing the base servlet to supply a Thrift processor to service the request. For the calculator example from the tutorial:
package server;

import tutorial.Calculator;

public class CalculatorTServlet extends TServlet {
public CalculatorTServlet() {
super(new Calculator.Processor(new CalculatorHandler()));

Invoking the service is easy - you just use the THttpClient transport. Using Thrift over HTTP allows you to use all your existing high-availability and failover infrastructure, which can be attractive. (But also see this Thrift mailing list thread which gives some more detail on how Facebook tackles high-availability and failover.)

Saturday, 22 September 2007

Lucene Layer Cake

With a proposal for Pig (a query language interface for very large datasets) to go into the Apache Incubator, it looks like the Lucene family is growing once more. With so many members it gets harder to track the inter-project dependencies, so I created a quick family portrait.

Update: not long after writing this I noticed this patch to run Lucene on Hbase (a part of Hadoop) - so now my diagram's wrong. It was a bit of an oversimplification anyway - it's meant to give a rough idea of the building blocks of Lucene.

Tuesday, 18 September 2007

Debugging with XStream

A little while back a colleague and I had a problem with some data in a large object graph in our system. There were tens of thousands of objects in the graph so we didn't fancy pointing a graphical debugger at it to find where the problem was residing. Most of the objects didn't define a toString method, so we used XStream to get a representation of the object graph and dump it to the console where we picked through it with command line tools and a XML editor. (We found the problem!) The idiom we used was

new XStream().toXML(objectGraph, System.out);

Thursday, 13 September 2007

Ohloh's Visualizations

Ohloh seems to be positioning itself as the social networking site of the open source project world. It's also worth having a look at for its neat visualizations. I particularly like the sparklines showing commit activity for each committer on a project, the codebase history showing the number of lines of code over time, and the ability to compare projects graphically (a bit like Google Trends). These tools are great for getting a quick feel for a project that you can't really find from its website or source code repository.

Friday, 7 September 2007

RESTful Web Services

RESTful Web Services by Leonard Richardson and Sam Ruby is a great book. It's about how to make the web more programmable, and tells you, through a great mix of theory and practical advice, how you can achieve this for the part of the web you're building.

It'll make you think about URLs. (Do you put state in query parameters or the path? See page 121.) It'll make you think about HTTP. (For example, the response code 201 Created, is used to show that the server created a new resource in response to a client POST request.) It'll make you think about the web.

I found it gave me a (conceptual) framework to design a RESTful API for an product we're building at Kizoom. Sometimes it can be a bit of a struggle to see how to make some operations RESTful (only four verbs remember!), but the design that emerged was actually very simple once I starting thinking about things the right way. Seems like James Strachan had a similar experience; after a rocky start he persevered and now has what looks like a great design for a Pure RESTful API to ActiveMQ via AtomPub.

Sunday, 26 August 2007

Two weeks off the net

I've just arrived back from a two-week family holiday in the Brecon Beacons in Wales. Before I went I confess I was a little worried that two weeks without a computer or net access would be difficult. The last two-week holiday was four years ago to see family in Thailand, and we ended up using their computer now and again. But this time it was fine - and I resisted the occasional urge to nip into the local resource centre to check email. (There were of course lots of other things to do as you can see in the picture.)

Also on holiday I read a Guardian article about BlackBerry addiction (in a real newspaper, of course) and was glad I don't have one of the blasted things...

Monday, 6 August 2007

Tim O'Reilly on Hadoop

Tim O'Reilly has a written a nice piece about Hadoop, following its starring role at OSCON. I was tickled pink to see that he linked to my article about running Hadoop on Amazon EC2 to do log analysis!

Also, from the comments: Google has some courseware on MapReduce and Hadoop. Everyone seems to love Hadoop at the moment.

Saturday, 28 July 2007

Hadoop at OSCON 2007

I wasn't there, but Hadoop had two airings at OSCON this week. Doug Cutting was a part of the Open Source Radar Executive Briefing with Tim O'Reilly to talk about scaling.

He also gave a talk, with Eric Baldeschwieler, entitled "Meet Hadoop" where he gave a great exposition of the problem that Hadoop is designed to solve. In short: disk seeks are expensive, so databases built using sort-merge, which is limited by transfer speed not seek speed, scale better than traditional B-tree databases, which are limited by seek speed. More details and examples on the slides.

Eric's half of the talk gave some interesting tidbits about how Hadoop is being used at Yahoo! For example, they are running Hadoop on about 10,000 machines, and the biggest cluster is 1600 machines! With these kind of numbers I can see how Nigel Daley came to coin Nigel's Law:

In a large enough cluster, there are NO corner cases

JUnit gets flexible and philisophical

My colleague Robert Chatley pointed me to an InfoQ announcement about the release of JUnit 4.4. It looks interesting. For a start it includes Hamcrest matchers which allow you to write flexible assertions, using the assertThat construct. I've had some involvement with Hamcrest, and I really like using it since it allows me to write tests that I can read, so I'm really pleased that it's got into JUnit as this can only increase its takeup. Well done Joe for having the idea in the first place!

The new release also includes theories, adopted from the Popper JUnit extension, which are tests that apply to a (potentially infinite) set of data points. This reminds me of Andreas Leitner's talk AutoTest: Push-button testing using contracts from the Google London Test Automation Conference (in 2006) where he talked about using Eiffel contracts to generate test cases to look for contract violations. David Saff is one of the creators of JUnit theories and he also talks about the relation to contracts in his paper (authored with Marat Boshernitsan), The Practice of Theories: Adding "For-all" Statements to "There-Exists" Tests. I'm looking forward to trying this out.

Thursday, 19 July 2007

Articles and blogs

Coinciding nicely with Jakob Nielson's admonition Write Articles, Not Blog Postings (via Steve Loughran) here's my latest article, Running Hadoop MapReduce on Amazon EC2 and Amazon S3, published on Amazon Web Services Developer Connection.

It's almost a year and a half since I wrote my last article and one of the reasons is because I've blogged more. Not much, but more.

Jakob's got a point about being directed to short, old, irrelevant blog postings when you're looking for something. He's not saying don't blog, but rather try to write stuff that will be long-lived. Definitely something to aim for for tech writers - but not at the cost of not writing anything (so it's OK to sprinkle blogs with lighter weight stuff).

Monday, 16 July 2007

Why are there no Amazon S3/EC2 competitors?

Amazon's storage and compute services (S3 and EC2 respectively) are widely seen as game changers. So, almost one year on from EC2's launch, why is it that there are no competitors in this space? One commenter on Artur Bergman's post entitled Amazon Web Services and the lack of a SLA made the good point that a "competitive utility computing market" would effectively solve any disaster recovery problems, and make such services even more popular.

Meanwhile, TechCrunch reports a rumour that Amazon will offer a MySQL web service by the end of the year.

Sunday, 15 July 2007

Hadoop Development Steadily Rising

Judging by this graph showing posts to the dev list (on Gmane) the rate is currently at about 50 posts a day. This has roughly doubled since the beginning of the year. Some of the increase is down to the momentum behind Hbase (which provides Bigtable-like capabilities on top of Hadoop), but I think it is also down to general growth - more people seem to be participating in development than ever before. This is great! Obviously the risk is that at such a rate of development Hadoop becomes unstable, but so far it looks like it's under control - the (informal) feedback I've had tells me the 0.13.0 release is one of the most stable we've done.