Problems worthy of attack prove their worth by hitting back. —Piet Hein

Wednesday, 30 January 2008

Hadoop and Log File Analysis

I've always thought that Hadoop is a great fit for analyzing log files (I even wrote an article about it). The big win is that you can write ad hoc MapReduce queries against huge datasets and get results in minutes or hours. So I was interested to read Stu Hood's recent post about using Hadoop to analyze email log data:
Here at Mailtrust, Rackspace’s mail division, we are taking advantage of Hadoop to help us wrangle several hundred gigabytes of email log data that our mail servers generate each day. We’ve built a great tool for our support team that lets them search mail logs in order to troubleshoot problems for customers. Until recently, this log search and storage system was centered around a traditional relational database, which worked fine until the exponential growth in the volume of our dataset overcame what a single machine could cope with. The new logging backend we’ve developed based on Hadoop gives us virtually unlimited scalability.
The best bit was when they wrote a MapReduce query to find the geographic distribution of their users.
This data was so useful that we’ve scheduled the MapReduce job to run monthly and we will be using this data to help us decide which Rackspace data centers to place new mail servers in as we grow.
It's great when a technology has the ability to make such a positive contribution to your business. In Doug Cutting's words, it is "transformative".

Can we take this further? It seems to me that there is a gap in the market for an open source web traffic analysis tool. Think Google Analytics where you can write your own queries. I wonder who's going to build such a thing?


Tom White said...

Shortly after I wrote this, High Scalability covered the story with extra details from Mailtrust CTO Bill Boebel. Well worth a read.

Tony Czarnik said...

Tom, what would be needed for an enterprise log management system using Hadoop? Can it source any log type including custom? Do I need to build the interfaces? Does Hadoop include a query reporting system? Otherwise, what are your recommendations for query/reporting?

Raffy said...

At Loggly we are building a cloud-based log management an analysis platform. You will not only be able to run large-scale processes (through map-reduce), but you can do much more based on our real-time indexing and data processing capabilities. In order to enable you to build your own use-cases around your log data, we provide an API that you can use to access all the data that you have Loggly manage for you. Would be curious to get your feedback!

Troy said...

Hey Tony, here's a fairly easy doc with how to load your system and app logs into Hadoop (Apache Hive) for SQL-style analysis:
Log analytics with Hadoop and Hive

The doc is written for customers of our hosted "log aggregator in the cloud" service, Papertrail, but it covers how the data gets there, the formatting (TSV), and Hive's LOAD DATA INFILE process.

We've been spoiled by Amazon Elastic MapReduce, but Cloudera's distro is very powerful and would work fine.

Whether or not you're logging to Papertrail, something like that would be my recommendation for reporting.