Wednesday, November 19, 2008

hadoop s3 integration not quite there

We rely pretty heavily on hadoop for processing lots of log data. It's a really great project and a critical part of our infrastructure.

I think we've hinted at this before, but here at bizo, the only machines we have are our laptops. Everything else is at amazon. All of our servers are ec2 instances and all of our data is in s3.

In some of our larger hadoop jobs, we're processing hundreds of gigs worth of input data, and it literally takes a couple of hours to copy data from s3 to a local machine, and then to hdfs. And once the job is done, it's back to s3. Additionally, we're starting and stopping clusters on demand, per job, so we're really not getting any data locality benefit from hdfs.

How do we speed this up... Hadoop comes with a tool, distcp, that is basically a Map/Reduce job to copy files to hdfs, so there's some benefit there to distribute the copy.

It would be great to skip all this copying and just use s3 directly. All of our input is already there, and that's where we want our output. And once the data's on s3, it's easy to move around, copy, rename, etc.

Hadoop does include s3 support, with 2 filesystems. The s3 Block Filesystem (s3), and the The s3 Native Filesystem (s3n).


The S3 Block Filesystem is basically HDFS, using s3 as the datastore. Files are stored as blocks directly into s3. This means that you need to dedicate a bucket to the filesystem, and can't use any s3 tools to read or write data. I believe the filesystem was designed this way to overcome the s3 object size limit, as well as to support renames (the rename support mentioned above wasn't available in the original s3 api). I don't quite get the point of this filesystem now, though. There doesn't seem to be much benefit to using s3, except I guess you don't need the NameNode and DataNode processes up and running.

The S3 Native Filesystem is what you want -- you can use real s3 objects directly, meaning, use the amazon apis directly, and any s3 tools to work with your job inputs and outputs. This is great!

Unfortunately, it's not quite here yet as a replacement for hdfs. This isn't totally clear... In fact, in the config the example says just replace "s3" with "s3n" in your config. So, I spent some time today trying to get it working. It worked great for my map tasks, but at reduce, I kept getting:

08/11/19 22:19:09 INFO mapred.JobClient: Task Id : attempt_200811192212_0002_r_000000_0, Status : FAILED
Failed to rename output with the exception: java.io.IOException: Not supported
at org.apache.hadoop.fs.s3native.NativeS3FileSystem.rename(NativeS3FileSystem.java:457)
at org.apache.hadoop.mapred.Task.moveTaskOutputs(Task.java:587)
at org.apache.hadoop.mapred.Task.moveTaskOutputs(Task.java:604)
at org.apache.hadoop.mapred.Task.saveTaskOutput(Task.java:561)
at org.apache.hadoop.mapred.JobTracker$TaskCommitQueue.run(JobTracker.java:2314)


I figured why would there be a s3n filesystem if you couldn't use it? The docs mention that rename isn't supported, but it's not clear that this means you can't use it as a hdfs replacement. Searching on this got me nothing, but after reading HADOOP-336 and other docs more closely it became clear this wasn't going to work.

Anyway, I thought I'd throw this post together and update the wiki to hopefully save someone else some time.

It looks like Support for renaming will be added in 0.19.0, which should then allow you to go all S3(n)! with you hadoop jobs.

Tuesday, November 18, 2008

Try not to crash the IDE...

We use Eclipse as our IDE. We also use Ruby pretty extensively; we do all of our deployments and configuration using Capistrano.

The problem I've been having is with mysterious tabbing. In spite of having set my Eclipse preferences to use spaces for tabs (Preferences -> General -> Editors -> Text Editors -> Insert spaces for tabs), my files edited in the Ruby editor have been saving with tabs.

Huh.

Now this normally might not be a problem, but we're not all on the same version of Eclipse, and for some strange reason, loading files with tabs in them in the Ruby editor seems to crash the IDE.

It turns out that there is yet another preference you need to set to make tabs go away. Under Preferences -> Ruby -> Editor, there is an Indentation preference. Set Tab policy to Spaces only, and you should be good to go.

Here's a handy perl one-liner to find and fix any random tabs you may have lying around in your source (I'm excluding .svn from the search, obviously you can chain together additional exclusions using multiple pipes to grep -v):

perl -i -pe 's/\t/ /gs' `find . -type f | grep -v ".svn" `

Friday, November 14, 2008

Video Standups Rock

Its pretty amazing how well video stand-ups have been working for us. We have one developer, Timo, who lives and works remotely in Hawaii and we video conference him in everyday for our afternoon stand-up. Video stand-ups also work well when we occasionally have a WFH Friday.

We highly recommend doing stand-ups in video if you have part (or all) of your team working remotely!

Monday, November 3, 2008

disk write performance on amazon ec2

For a few different projects here at Bizo, we've relied on Berkeley DB. For key/value storage and lookup, it's incredibly fast.

When using any DB, one of your main performance concerns for writes is going to be disk I/O. So, how fast is the I/O at amazon ec2?

I decided to do a quick test using BerekelyDB's writetest program. This is a small C program meant to simulate transaction writes to the BDB log file by repeatedly performing the following operations: 1. Seek to the beginning of a file, 2. write to the file, 3. flush the file write to disk. Their documentation suggests that "the number of times you can perform these three operations per second is a rough measure of the minimum number of transactions per second of which the hardware is capabable." You can find more details in their reference guide under Transaction throughput.

A quick disclaimer: This is not an exhaustive performance test! I am only measuring the above operations using the provided test program on a stock machine setup and single configuration.

For this test, I'm using an image based off of Eric Hammond's Ubuntu 8.04 image, on an ec2 small instance.

Anecdotally, I've noticed that writes to /tmp (/dev/sda1) seemed a lot faster than writes to /mnt (/dev/sda2, and the large disk in an ec2 instance), so we'll be testing both of these, as well as a 100 gig Elastic Block Store (EBS) volume mounted on /vol. All three are formatted as ext3, and mounted with defaults.

I made a small change to the test program (diff here), to print out the data size and file location in its output.

Test


The writetest program was run with -o 10000 (10,000 operations) against each mount point with file sizes: 256, 512, 1024, 2048, 4096, and 8192 (bytes). Each run was repeated 50 times. You can download the test script here.

Results



You can view the raw results here. I hacked together a small perl script (here), to join the rows together for a single run for a single file size. I then imported this data into Numbers, to generate some graphs. You can download both my raw csv files and Numbers spreadsheets here.

On to some graphs!









Conclusion



As you can see from this test, for BerkeleyDB log writes, /dev/sda1 (mounted as /), seems to have the most variance, but is also clearly faster than any of the other devices. Unfortunately, in an ec2 small instance, this is only 10G. You're expected to do all of your storage on /dev/sda1 (mounted as /mnt), which is much slower. Rounding it out is the EBS volume, which has a ton of cool features, but is slower still.

As a follow-up, it would be interesting to try EBS again using XFS, which most of the EBS guides recommend, due to its ability to freeze file writes for snapshots. I'm not sure if it's any better than ext3 for our BerkeleyDB write operations, but it's worth a shot.