Problems worthy of attack prove their worth by hitting back. —Piet Hein

Tuesday, 8 July 2008

RPC and Serialization with Hadoop, Thrift, and Protocol Buffers

Hadoop and related projects like Thrift provide a choice of protocols and formats for doing RPC and serialization. In this post I'll briefly run through them and explain where they came from, how they relate to each other and how Google's newly released Protocol Buffers might fit in.

RPC and Writables

Hadoop has its own RPC mechanism that dates back to when Hadoop was a part of Nutch. It's used throughout Hadoop as the mechanism by which daemons talk to each other. For example, a DataNode communicates with the NameNode using the RPC interface DatanodeProtocol.

Protocols are defined using Java interfaces whose arguments and return types are primitives, Strings, Writables, or arrays. These types can all be serialized using Hadoop's specialized serialization format, based on Writable. Combined with the magic of Java dynamic proxies, we get a simple RPC mechanism which for the caller appears to be a Java interface.

MapReduce and Writables

Hadoop uses Writables for another, quite different, purpose: as a serialization format for MapReduce programs. If you've ever written a Hadoop MapReduce program you will have used Writables for the key and value types. For example:


public class MapClass
implements Mapper<LongWritable, Text, Text, IntWritable> {

// ...

}

(Text is just a Writable version of Java String.)

The primary benefit of using Writables is in their efficiency. Compared to Java serialization, which would have been an obvious alternative choice, they have a more compact representation. Writables don't store their type in the serialized representation, since at the point of deserialization it is known which type is expected. For the MapReduce code above, the input key is a LongWritable, so an empty LongWritable instance is asked to populate itself from the input data stream.

More flexible MapReduce

There are downsides of having to use Writables for MapReduce types, however. For a newcomer to Hadoop it's another hurdle: something else to learn ("why can't I just use a String?"). More seriously, perhaps, is that it's hard to use different binary storage formats for MapReduce input and output. For example, Apache Thrift (see below) is an increasingly popular way of storing binary data. It's possible, but cumbersome and inefficient, to read or write Thrift data from MapReduce.

From Hadoop 0.17.0 onwards you no longer have to use Writables for key and value types in MapReduce programs. You can use any serialization framework. (Note that this is change is completely independent of Hadoop's RPC mechanism, which still uses Writables - and can only use Writables - as its on-wire format.) So it's easier to use Thrift types, say, throughout your MapReduce program. Or you can even use Java serialization (with some limitations which will be fixed). What's more, you can add your own serialization framework if you like.

Record I/O, Thrift and Protocol Buffers

Another problem with Writables, at least for the MapReduce programmer, is that creating new types is a burden. You have to implement the Writable interface, which means designing the on-wire format, and writing two methods: one to write the data in that format and one to read it back.

Hadoop's Record I/O was created to solve this problem. You write a definition of your types using a record definition language, then run a record compiler to generate Java source code representations of your types. All Record I/O types are Writable, so they plug into Hadoop very easily. As a bonus, you can generate bindings for other languages, so it's easy to read your data files from other programs.

For whatever reason, Record I/O never really took off. It's used in ZooKeeper, but that's about it (and ZooKeeper will move away from it someday). Momentum has switched to Thrift (from Facebook, now in the Apache Incubator), which offers a very similar proposition, but in more languages. Thrift also makes it easy to build a (cross-language) RPC mechanism.

Yesterday, Google open sourced Protocol Buffers, its "language-neutral, platform-neutral, extensible mechanism for serializing structured data". Record I/O, Thrift and Protocol Buffers are really solving the same problem, so it will be interesting to see how this develops. Of course, since we're talking about persistent data formats, nothing's going to go away in the short or medium term while people have significant amounts of data locked up in these formats.

That's why it makes sense to add support in Hadoop for MapReduce using Thrift and Protocol Buffers: so people can process data in the format they have it in. This will be a relatively simple addition.

What Next?

For RPC, where a message is short-lived, changing the mechanism is more viable in the short term. Going back to Hadoop's RPC mechanism, now that both Thrift and Protocol Buffers offer an alternative, it may well be time to evaluate them to see if either can offer a performance boost. It would be a big job to retrofit RPC in Hadoop with another implementation, but if there are significant performance gains to be had, then it would be worth doing.

4 comments:

jeff said...

hey tom,

good post, this is an area in need of clarification.

protocol buffers and record io are merely data exchange formats, while thrift includes a lot more.

in addition to a data exchange format, thrift includes a code generator, networking transport code, and a suite of robust server skeleton implementations in a variety of languages. it should be noted that you can use the servers and networking code to exchange data in any format, including protocol buffers.

there's also been a lot of debate about serialization formats for putting data on the wire versus putting data on disk.

for the latter case, we've built support into hive for general serialization formats, so while a lot of our data is serialized in thrift's format, we can support record io, protocol buffers, csv, or whatever.

Tom White said...

Thanks Jeff,

Yes, you're right that Thrift provides more in the way of providing code for doing network services. I didn't know that you could use Thrift transport with Protocol Buffers - that's pretty neat.

I'm sure I heard Google say that there isn't RPC support in Protocol Buffers, but there does seem to be some RPC Java-related classes in there. Haven't tried it though.

And well done on making Hive support lots of serialization formats. As I said, I think we need to support all of them, and let people choose what they want to use.

Anonymous said...

Has anybody tried Etch?

http://developer.cisco.com/web/cuae/etch
http://developer.cisco.com/web/cuae/devconf2008_session_3

Jay Vyas said...

I think the current hadoop build (3.0.0-SNAPSHOT) actually does use protocol buffers for RPC?