Pluggable Hadoop

Update: This quote from Tim O'Reilly in his OSCON keynote today sums up the changes I describe below: "Do less and then create extensibility mechanisms." (via Matt Raible)

I'm noticing an increased desire to make Hadoop more modular. I'm not sure why this is happening now, but it's probably because as more people start using Hadoop it needs to be more malleable (people want to plug in their own implementations of things), and the way to do that in software is through modularity.

Some examples:

Job scheduling

The current scheduler is a simple FIFO scheduler which is adequate for small clusters with a few cooperating users. On larger clusters the best advice has been to use HOD (Hadoop On Demand), but that has its own problems with inefficient cluster utilization. This situation led to a number of proposals to make the scheduler pluggable (HADOOP-2510, HADOOP-3412, HADOOP-3444). Already there is a fair scheduler implementation (like the Completely Fair Scheduler in Linux) from Facebook.

HDFS block placement

Today the algorithm for placing a file's blocks across datanodes in the cluster is hardcoded into HDFS, and while it has evolved, it is clear that a one-size-fits-all approach is not necessarily the best approach. Hence the new proposal to support pluggable block placement algorithms.

Instrumentation

Finding out what is happening in a distributed system is a hard problem. Today, Hadoop has a metrics API (for gathering statistics from the main components of Hadoop), but there is interest in adding other logging systems, such as X-Trace, via a new instrumentation API.

Serialization

The ability to use pluggable serialization frameworks in MapReduce appeared in Hadoop 0.17.0, but has received renewed interest due to the talk around Apache Thrift and Google Protocol Buffers.

Component lifecycle

There is work being done to add a lifecyle interface to Hadoop components. One of the goals is to make it easier to subclass components, so they can be customized.

Remove dependency cycles

This is really just good engineering practice, but the existence of dependencies makes it harder to understand, modify and extend code. Bill de hÓra did a great analysis of Hadoop's code structure (and its deficiencies), which has lead to some work to enforce module dependencies and remove the cycles.