Problems worthy of attack prove their worth by hitting back. —Piet Hein

Sunday, 2 March 2008

MapReduce without the Reduce

There's a class of MapReduce applications that use Hadoop just for its distributed processing capabilities. Telltale signs are:

1. Little or no input data of note. (Certainly not large files stored in HDFS.)
2. Map tasks are therefore not limited by their ability to consume input, but by their ability to run the task, which depending on the application may be CPU-bound or IO-bound.
3. Little or map output.
4. No reducers (set by conf.setNumReduceTasks(0)).

This seems to work well - indeed the CopyFiles program in Hadoop (aka distcp) follows this pattern to efficiently copy files between distributed filesystems:

1. The input to each map task is a source file and a destination.
2. The map task is limited by its ability to copy the source to the destination (IO-bound).
3. The map output is used as a convenience to record files that were skipped.
4. There are no reducers.

Combined with Streaming this is a neat way to distribute your processing in any language. You do need a Hadoop cluster, it is true, but CPU-intensive jobs would happily co-exist with more traditional MapReduce jobs, which are typically fairly light on CPU usage.

No comments: