Problems worthy of attack prove their worth by hitting back. —Piet Hein

Saturday, 4 June 2011

What's new in Apache Whirr 0.5.0-incubating

Apache Whirr 0.5.0-incubating is now available. Whirr is a library and command line interface for running distributed services like Apache Hadoop in the cloud. Note that Whirr is currently undergoing Incubation at the Apache Software Foundation, which means that, in particular, the project has yet to be
fully endorsed by the ASF. Please read the full disclaimer.

In this release the Whirr development team have added many new features while still making the core more solid. This post covers some of the more important changes. The full list can be found in the release notes.

Improving the new user experience

Orchestrating multiple services on cloud instances is a challenge to make simple, and Whirr has sometimes been a little fiddly to get running. SSH settings, in particular, have been a common sticking point with new users. The new Whirr in 5 Minutes guide walks through the minimum number of commands you need to type to get a simple 3-node ZooKeeper cluster running in a few minutes. From there you can move on to the Quick Start Guide and the Configuration Guide.

The sample configurations in the recipes directory in the distribution contain useful settings for running the services on a variety of cloud providers. Users are always encouraged to share their working configurations with the community.

New services

Elastic Search and Voldemort have been added to the roster of services that come with Whirr. This brings the total to six; adding to Apache Cassandra, Apache Hadoop, Apache HBase, and Apache ZooKeeper.

API improvements

Whirr is still a young project so it is not surprising that its API is rapidly evolving. In WHIRR-245, the demarcation between the user API (for users who control Whirr clusters from Java) and the service API (for developers writing new Whirr services) was clarified. The user API can be found in the org.apache.whirr package; whereas the service API is in org.apache.whirr.service.

You can find out more about writing Whirr services in this presentation (PDF).

The firewall API that service writers use to open ports for services was simplified and made more powerful in WHIRR-275.

Overriding scripts

This feature was actually introduced in Whirr 0.4.0-incubating, but it's useful enough to mention here. In older versions of Whirr, if you wanted to make a modification to the scripts that run on cloud instances - to tweak some settings, for instance - you would have to upload your modifications (as well as all the other scripts) to a publicly available web server (Amazon S3 was a common choice), then point Whirr at the new location. Not particularly difficult, but a big enough barrier to discourage users from trying it.

The new approach is to push scripts to nodes from the launching machine, so you can just edit them locally before launch. Full instructions are covered in the FAQ.

Running scripts on nodes

In 0.5.0 the scripts that run on cloud instances have been broken up to be more fine-grained, so many services have individual start and stop scripts (WHIRR-266). Combined with the ability to run scripts on sets of nodes in the cluster (by ID or role), users now have more control of the cluster once it has launched (WHIRR-173). Try running whirr run-script at the command line to use this feature. There's a contrib script to run the Yahoo! Cloud Serving Benchmark (YCSB) against an HBase cluster, which takes advantage of the run-script command (WHIRR-287).

Also useful is WHIRR-291, which allows you to launch "blank" nodes with no services running on them (in a "noop" role), and then, with whirr run-script, run arbitrary scripts on them to bring them into the state you want.

Custom service builds

Developers who work on services supported in Whirr will find the ability to push a custom build to a cluster very useful for testing (WHIRR-220). For example, if you are working on a ZooKeeper feature, you can build a ZooKeeper tarball with your new feature, then launch a cluster that uses this tarball by specifying whirr.zookeeper.tarball.url as a local file:// URL pointing to your tarball. Whirr will push the tarball to a temporary blob store container, then each node will download from there.

I used a variation of this feature to try out a nightly Hadoop 0.22 build on a small Whirr cluster. In this case the tarball URL is not a local file, so Whirr doesn't copy the tarball to a blob store since it is already accessible from the cloud.

Service improvements

Whirr is only able to exist because of the powerful abstraction that jclouds provides for interacting with cloud providers. A great example of this power is the API that jclouds provides for discovering the hardware capabilities of an instance running on any provider. WHIRR-282 took advantage of the jclouds API to find the number of cores on a node to dynamically configure the number of slots in a Hadoop cluster. Previously, you had to set this manually for each cluster to take full advantage of larger image sizes.

This is just the beginning - there is more work to use memory capabilities to set configuration (WHIRR-229), and to use hardware capabilities generally in services other than Hadoop.

Cluster state storage

In previous releases of Whirr, information about launched instances was stored in a file on the machine that launched the cluster (~/.whirr/<cluster-name>/instances). With WHIRR-288, it's now possible to store this information in a blob store instead (such as Amazon S3, although any jclouds-supported blob store can be used), which is useful if you want to control clusters from multiple machines.

Bring Your Own Nodes

Or just BYON, for short. Many users have requested the ability to deploy to privately owned hardware - and jclouds added this feature in 1.0-beta-9. Whirr now has preliminary support for BYON clusters. In a nutshell, you write a YAML file enumerating the nodes to deploy to - their addresses, access credentials, etc. - then Whirr will start services on them. The nodes just need to have a base OS like Centos or Ubuntu installed. You can find an example BYON configuration in the recipes directory of the download.

BYON is also useful for testing locally by using VMware or VirtualBox to host target nodes.

A hummingbird

Last, but not least, Whirr finally has a logo! Many thanks to Alison Wong, who designed it and donated it to the ASF.


I would like to thank everyone who helped with the 0.5.0-incubating release. We have a growing community, and we welcome feedback and help from new users and developers. If you'd like to get involved you can start by downloading the new release and joining us on the mailing lists.

What's next?

It's difficult to make firm predictions about the contents of the next release since Whirr is an open source project with many open issues, but the general themes include:
  • Adding more services. In tandem, we want to make it easier to write new services by pushing common patterns into the core (e.g. WHIRR-326 is one example of this).
  • Improving existing services. By making them more flexible, better configured, easier to manage.
  • Adding more cloud providers. The latest release of jclouds supports 30 providers, and we need help testing more of them with Whirr.
  • Implementing services using other configuration management tools, rather than bash scripting. Andrei Savu is working on using Puppet to write new services (WHIRR-255).
  • Supporting elastic clusters, so new nodes can be added to running clusters (WHIRR-214).