Problems worthy of attack prove their worth by hitting back. —Piet Hein

Sunday, 19 April 2015

The Hay Dark Skies Festival, Reverend Thomas William Webb, and Jupiter

In 2013, the Brecon Beacons was designated a Dark Sky Reserve, and a year later the first Dark Skies Festival was held in Hay-on-Wye. The second festival took place this weekend, and my family went along to some of the activities.

Young stargazers, Lottie and Millie
In the morning, we found ourselves in a planetarium tent, then we looked at sunspots, and held pieces of meteorite.

The evening event was stargazing at Holy Trinity Church in Hardwicke, just outside Hay. Quite apart from the lack of light pollution, the location was a special one, since the vicar of the parish from 1856 until 1885 was Reverend Thomas William Webb, who in his spare time observed the night sky with telescopes and an observatory he had built himself.
Holy Trinity Church, Hardwicke

In 1859, while at Hardwicke he wrote the classic book, Celestial Objects for the Common Telescope, the object of which was "to furnish the possessors of ordinary telescopes with plain directions for their use, and a list of objects for their advantageous employment".

The book remained in print well into the following century (and was recently republished by Cambridge University Press), and it's probably difficult to overemphasise the importance of this book in encouraging generation after generation of amateur stargazers.

In the words of Janet and Mark Robinson, who used to live in the vicarage and have edited a book about Webb,
Like Patrick Moore, he was an enthusiast who wanted to inspire as many people as possible to look through a telescope. Even at the choir party he "arranged the telescope and acted as showman and all in turn had a look at Saturn".
Webb would no doubt have been pleased to see yesterday's gathering of enthusiastic amateurs (including the Robinsons) with an impressive range of telescopes, on a cold but very clear night. The highlight for us was seeing Jupiter and its four brightest moons (Io, Europa, Ganymede and Callisto) through a large reflecting telescope. We could even see the north and south belts, and the Great Red Spot (or Pink Splodge as Lottie named it).

Sunset. Venus is visible top centre
Thank you to the organisers of the Hay Dark Skies Festival, and the volunteers from the Usk Astronomical Society (the oldest astronomical society in the UK), the Abergavenny Astronomy Society and the Heads of the Valleys Astronomical Society.

Sunday, 8 March 2015

Tennis Ball Parabola

Here's an image of me throwing a tennis ball to Lottie:

Millie filmed the video and edited it down to a shorter segment. I turned the resulting video frames into a series of JPEGs by running:

ffmpeg -i Tennis\ Ball.mp4 tennis-%03d.jpeg

Then I composed them into a single image using ImageMagick:

convert -compose lighten tennis-014.jpeg tennis-015.jpeg \
-composite tennis-016.jpeg \
-composite tennis-017.jpeg \

-composite tennis-043.jpeg \
-composite result.jpeg

Millie then used Desmos (an online graphing editor) to superimpose a parabola on the image.

Update: Dima Spivak suggested I use the picture to estimate g, the acceleration due to gravity.
  • My head measures 0.22 m (chin to crown), and is 49 pixels on the picture.
  • The vertical distance, d, from the highest ball to the ball above Lottie's hands is 204 pixels, or 0.916 m.
  • The time, t, it took to travel this distance was between 12 and 13 frames (it's hard to say more precisely than this from the picture), which at 29.97 frames per second is between 0.4 and 0.434 seconds.
The acceleration is 2d/t2, which works out at between 9.7 and 11.4 m/s2. This range contains the accepted value of g, which is 9.8 m/s2.

Friday, 16 January 2015

Hadoop for Science

Some of the largest datasets are generated by the sciences. For example, the Large Hadron Collider produces around 30PB of data a year. I'm interested in the technologies and tools for analyzing these kind of datasets, and how they work with Hadoop, so here's a brief post.

Open Data

Amazon S3 seems to be emerging as the de facto solution for sharing large datasets. In particular, AWS curates a variety of public data sets that can be accessed for free (from within AWS; there are egress charges otherwise). To take one example from genomics, the 1000 Genomes project hosts a 200TB dataset on S3.

Hadoop has long supported S3 as a filesystem, but recently there has been a lot of work to make it more robust and scalable. It’s natural to process S3-resident data in the cloud, and here there are many options for Hadoop. The recently released Cloudera Director, for example, makes it possible to run all the components of CDH in the cloud.


By "notebooks" I mean web-based, computational scientific notebooks, exemplified by the IPython Notebook. Notebooks have been around in the scientific community for a long time (they were added to IPython in 2011), but increasingly they seem to be reaching the larger data scientist and developer community. Notebooks combine prose and computation, which is great for exposition and interactivity. They are also easy to share, which helps foster collaboration and reproducibility of research.

It’s possible to run IPython against PySpark (notebooks are inherently interactive, so working with Spark is the natural Hadoop lead in), but it requires a bit of manual set up. Hopefully that will get easierideally Hadoop distributions like CDH will come with packages to run an appropriately-configured IPython notebook server.

Distributed Data Frames

IPython supports many different languages and libraries. (Despite its name IPython is not restricted to Python; in fact, it is being refactored into more modular pieces as a part of the Jupyter project.) Most notebook users are data scientists, and the central abstraction that they work with is the data frame. Both R and pandas, for example, use data frames, although both systems were designed to work on a single machine.

The challenge is to make systems like R and pandas work with distributed data. Many of the solutions to date have addressed this problem by adding MapReduce user libraries. However, this is unsatisfactory for several reasons, but primarily because the user has to explicitly think about the distributed case and can’t use the existing libraries on distributed data. Instead, what’s needed is a deeper integration so that the same R and pandas libraries work on local and distributed data.

There are several projects and teams working on distributed data frames, including Sparkling Pandas (which has the best name), Adatao’s distributed data frame, and Blaze. All are at an early stage, but as they mature the experience of working with distributed data frames from R or Python will become practically seamless. Of course, Spark already provides machine learning libraries for Scala, Java, and Python, which is a different approach to getting existing libraries like R or Pandas running on Hadoop. Having multiple competing solutions is broadly a good thing, and something that we see a lot of in open source ecosystems.

Combining the Pieces

Imagine if you could share a large dataset and the notebooks containing your work in a form that makes it easy for anyone to run them—it’s a sort of holy grail for researchers.

To see what this might look like, have a look at the talk by Andy Petrella and Xavier Tordoir on Lightning fast genomics, where they used a Spark Notebook and the ADAM genomics processing engine to run a clustering algorithm over a part of the 1000 Genomes dataset. It combines all the topics above—open data, cloud computing, notebooks, and distributed data frames—into one.

There’s still work to be done to expand the tooling and to make the whole experience smoother, nevertheless this demo shows that it's possible for scientists to analyse large amounts of data, on demand and in a way that is repeatable, using powerful high-level machine learning libraries. I'm optimistic that tools like this will become commonplace in the not-to-distant future.

Sunday, 11 January 2015


I made some marmalade. I've never made it before, although I have memories of my parents making it every January, and how slicing the peel seemed to take hours. I used this meta recipe from Felicity Cloake that Eliane found, and it seemed to work pretty well.

Sunday, 13 October 2013

Five years at Cloudera

Five years ago today was my first day at Cloudera. The team I joined consisted of the four founders—Mike Olson, Amr Awadallah, Jeff Hammerbacher, Christophe Bisciglia—as well as Aaron Kimball who had joined a week or so before, Alex Loddengaard who was working as an intern, and Matei Zaharia who joined on the same day as me as a part-time consultant.

Before I joined I had been working as an independent Apache Hadoop consultant for a year (probably the first Hadoop consultant anywhere), and was halfway through writing a book on Hadoop. The interview process had involved speaking to all four founders, and I remember when I came off the phone after the last call it was late in the UK but I couldn't sleep because the vision they had described was exactly what I wanted to see for Hadoop: a company that wanted to make Hadoop accessible to everyone, by making it easier to use and run, while maintaining a strong commitment to open source. The last point sealed the deal for me, and really at that point there was no way I could not join, and five years on I can say without exaggeration that it was the best decision of my professional life.

When I started I was living in Wales, which meant that on my first day I didn't see any of my new colleagues! That was remedied a few weeks later on when I visited California (and ApacheCon in New Orleans) in early November 2008. Initially the others were working out of a single room in AdMob's offices in San Mateo, but it wasn't long before we moved to a smart brick-lined office in Burlingame. I was around for the moving in day, which involved more flatpack assembly skills than programming.

From the very beginning we worked on making Hadoop easier to use, run, and support, and better integrated with other systems, so that it could enjoy broader adoption. That was borne out in the early projects at Cloudera which included creating training material, creating packages for Red Hat and Debian (CDH, and later Bigtop), writing tools for data ingest (Flume and Sqoop), creating a rich web UI for Hadoop users (Hue), as well as making contributions to the core project. I was mainly involved in the latter, which I did at the same time as completing the book in time for the Hadoop Summit 2009, which would never have been possible without the time and space my teammates gave me.

Over the first year I would visit every three months or so, and naturally each time the team would have grown. I always enjoyed meeting the new people who had joined since my last visit, but I realized that at such a formative time in a company's life, when the culture was being laid down that being closer to the team would make it easier for me to stay involved. The opportunity to move to California came up, and on the last day of October 2009 I arrived in San Francisco with my wife, Eliane, and two girls.

As anyone who has moved to a new country knows, there's a lot of things to sort out—somewhere to live, a school for the girls, reams of paperwork—and during this time the folks at Cloudera were incredibly helpful and supportive. When we moved into our new apartment  (which Eliane had found a mere two weeks after we arrived) half of the engineering team turned up to help with Ikea flatpack assembly.

At the end of our three year sojourn in the US, we left having made many friends, sad to leave, but happy knowing we'd be living closer to our family again. Cloudera was an order of magnitude larger than when I had arrived, and was now an international company with offices in several countries across the world.

Over the last five years I've been lucky enough to have been given the freedom to work on many parts of the Hadoop stack, in different parts of the Hadoop community, and with different teams at Cloudera. In the course of doing so, I've worked with the most talented and intelligent group of people in my life. It's hard work, and challenging, but also a lot of fun and incredibly enriching. I have every reason to expect it to continue. Thanks Cloudera!

Update on October 14: reworded to state that ApacheCon 2008 was held in New Orleans, not California. Thanks to Isabel Drost-Fromm for pointing out the error.