Hosting Large Public Datasets on Amazon S3

Update: I just thought of a quick and dirty way of doing this: just store your content on an extra large EC2 instance (holds up to 1690GB) and make the image public. Anyone can access it using their EC2 account, you just get charged for hosting the image.

There's a great deal of interest in large, publicly available datasets (see, for example, this thread from theinfo.org), but for very large datasets it is still expensive to provide the bandwidth to distribute them. Imagine if you could get your hands on the data from a large web crawl, the kind of thing that the Internet Archive produces. I'm sure people would discover some interesting things from it.

Amazon S3 is an obvious choice for storing data for public consumption, but while the cost for storage may be reasonable, the cost for transfer can be crippling since the cost is not under the control of the data provider, being incurred for each transfer (which is initiated by the user).

For example, consider a 1TB dataset. With storage running at $0.15 per GB per month this works out at around $150 per month to host. With transfer costs costing $0.18 per GB, this dataset costs around $180 for each transfer out of Amazon! It's not surprising large datasets are not publicly hosted on S3.

However, transferring data between S3 and EC2 is free, so could we limit transfers from S3 so they are only possible to EC2? You (or anyone else) could run an analysis on EC2 (using Hadoop, say) and only pay for the EC2 time. Or you could transfer it out of EC2 at your own expense. S3 doesn't support this option directly, but it is possible to emulate it with a bit of code.

The idea (suggested by Doug Cutting) is to make objects private on S3 to restrict access generally, then run a proxy on EC2 that is authorized to access the objects. The proxy only accepts connections from within EC2: any client that is outside Amazon's cloud is firewalled out. This combination ensures only EC2 instances can access the S3 objects, thus removing any bandwidth costs.

Implementation

I've written such a proxy. It's a Java servlet that uses the JetS3t library to add the correct Amazon S3 Authorization HTTP header to gain access to the owner's objects on S3. If the proxy is running on the EC2 instance with hostname ec2-67-202-43-67.compute-1.amazonaws.com, for example, then a request for

http://ec2-67-202-43-67.compute-1.amazonaws.com/bucket/object

is proxied to the protected object at

http://s3.amazonaws.com/bucket/object

To ensure that only clients on EC2 can get access to the proxy I set up an EC2 security group (which limits access to port 80):

ec2-add-group ec2-private-subnet -d "Group for all Amazon EC2 instances."
ec2-authorize ec2-private-subnet -p 80 -s 10.0.0.0/8

Then by launching the proxy in this group, only machines on EC2 can connect. (Initially, I thought I had to add public IP addresses to the group -- which, incidentally, I found in this forum posting -- but this is not necessary as the public DNS name of an EC2 instance resolves to the private IP address within EC2.) The AWS credentials to gain access to the S3 objects are passed in the user data, along with the hostname of S3:

ec2-run-instances -k gsg-keypair -g ec2-private-subnet \
-d "<aws_access_key> <aws_secret_key> s3.amazonaws.com" ami-fffd1996

This AMI (ID ami-fffd1996) is publicly available, so anyone can use it by using the commands shown here. (The code is available here, under an Apache 2.0 license, but you don't need this if you only intend to run or use a proxy.)

Demo

Here's a resource on S3 that is protected: http://s3.amazonaws.com/tiling/private.txt. When you try to retrieve it you get an authorization error:

% curl http://s3.amazonaws.com/tiling/private.txt
<?xml version="1.0" encoding="UTF-8"?>
<Error>
<Code>AccessDenied</Code>
<Message>Access Denied</Message>
<RequestId>57E370CDDD9FE044</RequestId>
<HostId>dA+9II1dYAjPE5aNsnRxhVoQ5qy3KCa6frkLg3SyTwzP3i2SQNCU534/v8NXXEnN</HostId>
</Error>

With a proxy running, I still can't retrieve the resource via the proxy from outside EC2. It just times out due to the firewall rule:

% curl http://ec2-67-202-56-11.compute-1.amazonaws.com/tiling/private.txt
curl: (7) couldn't connect to host

But it does works from an EC2 machine (any EC2 machine):

% curl http://ec2-67-202-56-11.compute-1.amazonaws.com/tiling/private.txt
secret

Conclusion

By running a proxy on EC2, at 10 cents per hour (small instance) - or $72 a month, you can allow folks using EC2 to access your data on S3 for free. While running the proxy is not free, it is a fixed cost that might be acceptable to some organizations, particularly those that have an interest in making data publicly available (but can't stomach large bandwidth costs).

A few questions:

Is this useful?
Is there a better way of doing it?
Can we have this built into S3 (please, Amazon)?