Problems worthy of attack prove their worth by hitting back. —Piet Hein

Friday, 16 January 2015

Hadoop for Science

Some of the largest datasets are generated by the sciences. For example, the Large Hadron Collider produces around 30PB of data a year. I'm interested in the technologies and tools for analyzing these kind of datasets, and how they work with Hadoop, so here's a brief post.

Open Data

Amazon S3 seems to be emerging as the de facto solution for sharing large datasets. In particular, AWS curates a variety of public data sets that can be accessed for free (from within AWS; there are egress charges otherwise). To take one example from genomics, the 1000 Genomes project hosts a 200TB dataset on S3.

Hadoop has long supported S3 as a filesystem, but recently there has been a lot of work to make it more robust and scalable. It’s natural to process S3-resident data in the cloud, and here there are many options for Hadoop. The recently released Cloudera Director, for example, makes it possible to run all the components of CDH in the cloud.

Notebooks

By "notebooks" I mean web-based, computational scientific notebooks, exemplified by the IPython Notebook. Notebooks have been around in the scientific community for a long time (they were added to IPython in 2011), but increasingly they seem to be reaching the larger data scientist and developer community. Notebooks combine prose and computation, which is great for exposition and interactivity. They are also easy to share, which helps foster collaboration and reproducibility of research.

It’s possible to run IPython against PySpark (notebooks are inherently interactive, so working with Spark is the natural Hadoop lead in), but it requires a bit of manual set up. Hopefully that will get easierideally Hadoop distributions like CDH will come with packages to run an appropriately-configured IPython notebook server.

Distributed Data Frames

IPython supports many different languages and libraries. (Despite its name IPython is not restricted to Python; in fact, it is being refactored into more modular pieces as a part of the Jupyter project.) Most notebook users are data scientists, and the central abstraction that they work with is the data frame. Both R and pandas, for example, use data frames, although both systems were designed to work on a single machine.

The challenge is to make systems like R and pandas work with distributed data. Many of the solutions to date have addressed this problem by adding MapReduce user libraries. However, this is unsatisfactory for several reasons, but primarily because the user has to explicitly think about the distributed case and can’t use the existing libraries on distributed data. Instead, what’s needed is a deeper integration so that the same R and pandas libraries work on local and distributed data.

There are several projects and teams working on distributed data frames, including Sparkling Pandas (which has the best name), Adatao’s distributed data frame, and Blaze. All are at an early stage, but as they mature the experience of working with distributed data frames from R or Python will become practically seamless. Of course, Spark already provides machine learning libraries for Scala, Java, and Python, which is a different approach to getting existing libraries like R or Pandas running on Hadoop. Having multiple competing solutions is broadly a good thing, and something that we see a lot of in open source ecosystems.

Combining the Pieces

Imagine if you could share a large dataset and the notebooks containing your work in a form that makes it easy for anyone to run them—it’s a sort of holy grail for researchers.

To see what this might look like, have a look at the talk by Andy Petrella and Xavier Tordoir on Lightning fast genomics, where they used a Spark Notebook and the ADAM genomics processing engine to run a clustering algorithm over a part of the 1000 Genomes dataset. It combines all the topics above—open data, cloud computing, notebooks, and distributed data frames—into one.

There’s still work to be done to expand the tooling and to make the whole experience smoother, nevertheless this demo shows that it's possible for scientists to analyse large amounts of data, on demand and in a way that is repeatable, using powerful high-level machine learning libraries. I'm optimistic that tools like this will become commonplace in the not-to-distant future.

Sunday, 11 January 2015

Marmalade

I made some marmalade. I've never made it before, although I have memories of my parents making it every January, and how slicing the peel seemed to take hours. I used this meta recipe from Felicity Cloake that Eliane found, and it seemed to work pretty well.








Sunday, 13 October 2013

Five years at Cloudera

Five years ago today was my first day at Cloudera. The team I joined consisted of the four founders—Mike Olson, Amr Awadallah, Jeff Hammerbacher, Christophe Bisciglia—as well as Aaron Kimball who had joined a week or so before, Alex Loddengaard who was working as an intern, and Matei Zaharia who joined on the same day as me as a part-time consultant.

Before I joined I had been working as an independent Apache Hadoop consultant for a year (probably the first Hadoop consultant anywhere), and was halfway through writing a book on Hadoop. The interview process had involved speaking to all four founders, and I remember when I came off the phone after the last call it was late in the UK but I couldn't sleep because the vision they had described was exactly what I wanted to see for Hadoop: a company that wanted to make Hadoop accessible to everyone, by making it easier to use and run, while maintaining a strong commitment to open source. The last point sealed the deal for me, and really at that point there was no way I could not join, and five years on I can say without exaggeration that it was the best decision of my professional life.

When I started I was living in Wales, which meant that on my first day I didn't see any of my new colleagues! That was remedied a few weeks later on when I visited California (and ApacheCon in New Orleans) in early November 2008. Initially the others were working out of a single room in AdMob's offices in San Mateo, but it wasn't long before we moved to a smart brick-lined office in Burlingame. I was around for the moving in day, which involved more flatpack assembly skills than programming.

From the very beginning we worked on making Hadoop easier to use, run, and support, and better integrated with other systems, so that it could enjoy broader adoption. That was borne out in the early projects at Cloudera which included creating training material, creating packages for Red Hat and Debian (CDH, and later Bigtop), writing tools for data ingest (Flume and Sqoop), creating a rich web UI for Hadoop users (Hue), as well as making contributions to the core project. I was mainly involved in the latter, which I did at the same time as completing the book in time for the Hadoop Summit 2009, which would never have been possible without the time and space my teammates gave me.

Over the first year I would visit every three months or so, and naturally each time the team would have grown. I always enjoyed meeting the new people who had joined since my last visit, but I realized that at such a formative time in a company's life, when the culture was being laid down that being closer to the team would make it easier for me to stay involved. The opportunity to move to California came up, and on the last day of October 2009 I arrived in San Francisco with my wife, Eliane, and two girls.

As anyone who has moved to a new country knows, there's a lot of things to sort out—somewhere to live, a school for the girls, reams of paperwork—and during this time the folks at Cloudera were incredibly helpful and supportive. When we moved into our new apartment  (which Eliane had found a mere two weeks after we arrived) half of the engineering team turned up to help with Ikea flatpack assembly.

At the end of our three year sojourn in the US, we left having made many friends, sad to leave, but happy knowing we'd be living closer to our family again. Cloudera was an order of magnitude larger than when I had arrived, and was now an international company with offices in several countries across the world.

Over the last five years I've been lucky enough to have been given the freedom to work on many parts of the Hadoop stack, in different parts of the Hadoop community, and with different teams at Cloudera. In the course of doing so, I've worked with the most talented and intelligent group of people in my life. It's hard work, and challenging, but also a lot of fun and incredibly enriching. I have every reason to expect it to continue. Thanks Cloudera!

Update on October 14: reworded to state that ApacheCon 2008 was held in New Orleans, not California. Thanks to Isabel Drost-Fromm for pointing out the error.

Saturday, 30 March 2013

Making a Kitchen Table


A couple of weeks ago I made a new kitchen table.



It was much easier than it looks as all I had to do was attach some hairpin legs to a worktop. If you haven't seen hairpin legs before, here's a closeup:



Eliane got the idea for the design after seeing something similar on the web, and she ordered the worktop from Worktop Express, and the hairpin legs from the Iron Mill.

I worked out what size screws to use (#12) and the pilot drill size using this handy chart. I also found a tip somewhere that said putting a little wax on the screw makes it easier to drive in with hardwoods (our worktop is oak).

The table is pretty sturdy, and hasn't collapsed! It was quicker to put together than some Ikea furniture, and it's very satisfying having an everyday piece of furniture that we designed and built ourselves.

Sunday, 3 February 2013

Have you put the chickens to bed?

"Have you put the chickens to bed?" -- it's a question we ask each other frequently in our house, since we are the proud owners of seven beautiful hens. Normally Eliane has, but when Lottie, our younger daughter, asked long after it had got dark one evening last week it turned out that none of us had, despite having IFTTT alerts set up to remind us.

The problem with the alert is that it is set to go off at sunset, which is all that IFTTT allows, and that's a bit too early as it's not dark enough for the chickens to be in their house. So we wait a bit, then we forget.

So I decided to write an Android app to send an alert a fixed amount of time (say 45 minutes) after sunset, so that when we received it, it would be dark, the chickens would be in their house, and we could close the door there and then.

This is the result:

Eliane is currently beta testing it, so we'll see how well it works. (Obviously the long term goal is an automatic sensor to open and close the chicken house door, but we're not there yet.)

Writing Android Apps

This is the first Android app I've written, and overall I found the process very straightforward. A couple of years ago I ran a "Hello World" Android tutorial, and I seem to remember most of the time taken to get the app running was installing the Eclipse plugin. This time the Android Developer Tools (ADT) include a customized version of Eclipse, making the getting started process much smoother. 

The Android API is huge and fairly intimidating. It is, however, incredibly well documented, and the user guides are invaluable. The hardest part of writing the app was figuring out which parts of the API to use - do I need a BroadcastReceiver or a Service?, how do AlarmManager and Notification interact? - that kind of thing. There's a lot of material online covering how to do various things in Android, and these offered general pointers, but not necessarily useful code, since the API evolves rapidly from release to release. And although the older code is generally supported, since compatibility is taken very seriously, there may be a better way of doing things in later versions.

The ADT tooling is good and encourages you to do the right thing - for example, extracting natural language strings from your app so it's easy to change them (or translate them) later. In this case, a class called R is generated which has references to all the assets that you need in you app: icons, sound files, strings, etc. For example, the audio file which plays when the notification is received is referred to with:

R.raw.cluck

To generate the icons I drew a chicken on a piece of paper with a sharpie, then took a photo of it and used an online image editor to make the background transparent. The Android Asset Studio completed the job of converting the image to a set of icons. (I didn't use Inkscape in the end, but this blog entry shows how to convert from an Inkscape drawing.) 

What's Next?

The biggest limitation in the app at the moment is that the calculation for sunset time is hardcoded for the UK. Using the Location API is the obvious next step there.

There are also some complications to do with making sure that notifications will still be sent even the phone is rebooted. I want to make sure that works properly before putting the app on Google Play.

The UI is pretty rudimentary too and could do with some work.

And before we get to the fully-automated solution, we could have a sensor that detects if the door is open or closed and only sends the reminder if the door hasn't been closed for the night.

Source is on GitHub.