Wednesday, December 17, 2008

From the productivity improvement department...

As a software developer, one of the things I rely upon to get the job done is autocomplete.  It is a feature I use every day whether that be in the bash shell or Eclipse.   The bottom line is that autocomplete saves keystrokes, and, therefore helps me get the job done just a little bit faster.  Autocomplete is a close second to sliced bread.

Now, something that is actually better than sliced bread is Capistrano.  If you've followed our blog, you know that we use Capistrano to deploy our code to EC2 and for other system administration tasks.

So, what if we could take the utility of autocomplete and marry it with the power of Capistrano?  Well, I came across a site that provides a ruby script that does just this.  The script supports Capistrano namespaces and utilizes a bash capability know as programmable completion.  The result?  Pure joy for yours truly.  

Here's an example of the autocomplete capability in action:



In the terminal window, I typed cap common:s[TAB] and was shown a list of possible completions.

If you are a Capistrano user you should definitely check it out.  I highly recommended it.

Monday, December 15, 2008

Open Source Simple DB Firefox Plugin: SDB Tool

We are huge fans of Amazon Web Services here at Bizo. We run our entire infrastructure on EC2, S3, SQS and SimpleDB.

We are also fans of some of the Firefox Plugin tools that make it easier to work with AWS: ElasticFox and S3 Organaizer.

So when we needed a GUI tool that would let us work better with SimpleDB and figured we'd extend the "Firefox Plugin Suite" and build a Firefox Plugin of our own. Today we've officially released our Simple DB Firefox plugin called "SDB Tool". You can install it by clicking here on your Firefox 3.0+ browser.

The code is release under the Apache2 license and available on GitHub. We are using Google Code for issue management.

We hope you find it useful and would love to hear how you are using it.

-Donnie (email me)

Wednesday, November 19, 2008

hadoop s3 integration not quite there

We rely pretty heavily on hadoop for processing lots of log data. It's a really great project and a critical part of our infrastructure.

I think we've hinted at this before, but here at bizo, the only machines we have are our laptops. Everything else is at amazon. All of our servers are ec2 instances and all of our data is in s3.

In some of our larger hadoop jobs, we're processing hundreds of gigs worth of input data, and it literally takes a couple of hours to copy data from s3 to a local machine, and then to hdfs. And once the job is done, it's back to s3. Additionally, we're starting and stopping clusters on demand, per job, so we're really not getting any data locality benefit from hdfs.

How do we speed this up... Hadoop comes with a tool, distcp, that is basically a Map/Reduce job to copy files to hdfs, so there's some benefit there to distribute the copy.

It would be great to skip all this copying and just use s3 directly. All of our input is already there, and that's where we want our output. And once the data's on s3, it's easy to move around, copy, rename, etc.

Hadoop does include s3 support, with 2 filesystems. The s3 Block Filesystem (s3), and the The s3 Native Filesystem (s3n).


The S3 Block Filesystem is basically HDFS, using s3 as the datastore. Files are stored as blocks directly into s3. This means that you need to dedicate a bucket to the filesystem, and can't use any s3 tools to read or write data. I believe the filesystem was designed this way to overcome the s3 object size limit, as well as to support renames (the rename support mentioned above wasn't available in the original s3 api). I don't quite get the point of this filesystem now, though. There doesn't seem to be much benefit to using s3, except I guess you don't need the NameNode and DataNode processes up and running.

The S3 Native Filesystem is what you want -- you can use real s3 objects directly, meaning, use the amazon apis directly, and any s3 tools to work with your job inputs and outputs. This is great!

Unfortunately, it's not quite here yet as a replacement for hdfs. This isn't totally clear... In fact, in the config the example says just replace "s3" with "s3n" in your config. So, I spent some time today trying to get it working. It worked great for my map tasks, but at reduce, I kept getting:

08/11/19 22:19:09 INFO mapred.JobClient: Task Id : attempt_200811192212_0002_r_000000_0, Status : FAILED
Failed to rename output with the exception: java.io.IOException: Not supported
at org.apache.hadoop.fs.s3native.NativeS3FileSystem.rename(NativeS3FileSystem.java:457)
at org.apache.hadoop.mapred.Task.moveTaskOutputs(Task.java:587)
at org.apache.hadoop.mapred.Task.moveTaskOutputs(Task.java:604)
at org.apache.hadoop.mapred.Task.saveTaskOutput(Task.java:561)
at org.apache.hadoop.mapred.JobTracker$TaskCommitQueue.run(JobTracker.java:2314)


I figured why would there be a s3n filesystem if you couldn't use it? The docs mention that rename isn't supported, but it's not clear that this means you can't use it as a hdfs replacement. Searching on this got me nothing, but after reading HADOOP-336 and other docs more closely it became clear this wasn't going to work.

Anyway, I thought I'd throw this post together and update the wiki to hopefully save someone else some time.

It looks like Support for renaming will be added in 0.19.0, which should then allow you to go all S3(n)! with you hadoop jobs.

Tuesday, November 18, 2008

Try not to crash the IDE...

We use Eclipse as our IDE. We also use Ruby pretty extensively; we do all of our deployments and configuration using Capistrano.

The problem I've been having is with mysterious tabbing. In spite of having set my Eclipse preferences to use spaces for tabs (Preferences -> General -> Editors -> Text Editors -> Insert spaces for tabs), my files edited in the Ruby editor have been saving with tabs.

Huh.

Now this normally might not be a problem, but we're not all on the same version of Eclipse, and for some strange reason, loading files with tabs in them in the Ruby editor seems to crash the IDE.

It turns out that there is yet another preference you need to set to make tabs go away. Under Preferences -> Ruby -> Editor, there is an Indentation preference. Set Tab policy to Spaces only, and you should be good to go.

Here's a handy perl one-liner to find and fix any random tabs you may have lying around in your source (I'm excluding .svn from the search, obviously you can chain together additional exclusions using multiple pipes to grep -v):

perl -i -pe 's/\t/ /gs' `find . -type f | grep -v ".svn" `

Friday, November 14, 2008

Video Standups Rock

Its pretty amazing how well video stand-ups have been working for us. We have one developer, Timo, who lives and works remotely in Hawaii and we video conference him in everyday for our afternoon stand-up. Video stand-ups also work well when we occasionally have a WFH Friday.

We highly recommend doing stand-ups in video if you have part (or all) of your team working remotely!

Monday, November 3, 2008

disk write performance on amazon ec2

For a few different projects here at Bizo, we've relied on Berkeley DB. For key/value storage and lookup, it's incredibly fast.

When using any DB, one of your main performance concerns for writes is going to be disk I/O. So, how fast is the I/O at amazon ec2?

I decided to do a quick test using BerekelyDB's writetest program. This is a small C program meant to simulate transaction writes to the BDB log file by repeatedly performing the following operations: 1. Seek to the beginning of a file, 2. write to the file, 3. flush the file write to disk. Their documentation suggests that "the number of times you can perform these three operations per second is a rough measure of the minimum number of transactions per second of which the hardware is capabable." You can find more details in their reference guide under Transaction throughput.

A quick disclaimer: This is not an exhaustive performance test! I am only measuring the above operations using the provided test program on a stock machine setup and single configuration.

For this test, I'm using an image based off of Eric Hammond's Ubuntu 8.04 image, on an ec2 small instance.

Anecdotally, I've noticed that writes to /tmp (/dev/sda1) seemed a lot faster than writes to /mnt (/dev/sda2, and the large disk in an ec2 instance), so we'll be testing both of these, as well as a 100 gig Elastic Block Store (EBS) volume mounted on /vol. All three are formatted as ext3, and mounted with defaults.

I made a small change to the test program (diff here), to print out the data size and file location in its output.

Test


The writetest program was run with -o 10000 (10,000 operations) against each mount point with file sizes: 256, 512, 1024, 2048, 4096, and 8192 (bytes). Each run was repeated 50 times. You can download the test script here.

Results



You can view the raw results here. I hacked together a small perl script (here), to join the rows together for a single run for a single file size. I then imported this data into Numbers, to generate some graphs. You can download both my raw csv files and Numbers spreadsheets here.

On to some graphs!









Conclusion



As you can see from this test, for BerkeleyDB log writes, /dev/sda1 (mounted as /), seems to have the most variance, but is also clearly faster than any of the other devices. Unfortunately, in an ec2 small instance, this is only 10G. You're expected to do all of your storage on /dev/sda1 (mounted as /mnt), which is much slower. Rounding it out is the EBS volume, which has a ton of cool features, but is slower still.

As a follow-up, it would be interesting to try EBS again using XFS, which most of the EBS guides recommend, due to its ability to freeze file writes for snapshots. I'm not sure if it's any better than ext3 for our BerkeleyDB write operations, but it's worth a shot.

Wednesday, October 22, 2008

Growl Notifications for Ant


Most of our code is written in Java here at Bizo. This means we heavily rely on Ant for building, testing and packaging our applications.

Sometimes I found myself watching the scrolling terminal waiting for a build to complete. Other times, I left the terminal to go off and do some other task while the build was happening, only to realize some minutes later that my build was probably done. What I wanted was a way for Ant to notify me when the build was complete so I didn't have to stare at a terminal or forget about the build. What I wanted was Growl notifications from Ant!

If you don't know what Growl is you can read about it here. Its basically a notification API for Mac OSX and very widely supported and used (other applications that use it include Adium and Quicksilver for example). (BTW - All Bizo developers get Macs...)

Doing a little searching turned up this blog post. It was pretty easy to get this task integrated with our build system... And now, growl tells me when my build is done!

Monday, September 29, 2008

'Fluidize' Your Favorite Web Apps

Ever since the advent of tabbed browsing, I've had this naggging problem with having too many open tabs in Firefox. I typically have 15 tabs open at a time. For whatever reason, I seem to have this aversion to closing tabs. In all fairness though, part of this proliferation of tabs has to do with the fact that many of my 'critical' apps are web based. For example, at Bizo, we use the google apps stack: gmail, google docs, google calendar, google sites, fogbugz for project management and bug tracking, and hyperic for monitoring our applications.

There have been numerous times where I've been reading some documentation in one tab and then needed to look at something in gmail. This process requires that I:
  1. find the tab that has gmail on it
  2. do what I need to do in gmail
  3. navigate back to the original tab
Because I have so many tabs open, I found that this process was taking entirely too long and slowing me down.

Not too long ago, I came across Fluid and realized that this might be an answer to my problem. The basic premise of Fluid is that you can elevate the status of any web application into a desktop application by wrapping the web application with a Site Specific Browser (SSB). Brilliant!

Since then, I've 'fluidized' some of my most frequently used web apps and it has eliminated the need to find the browser tab with 'application x' on it. Now, I can just take comfort in the fact that the application is in my dock waiting for me. Sweet.

This is a screenshot of the part of my dock that has been fluidized:
from left to right: gmail, gcal, gdocs, fogbugz, hyperic





I highly recommend Fluid, it's a great little product.

Friday, September 26, 2008

generic method call timeouts using java.lang.reflect.Proxy

In a future post, I hope to share some code and details on a generic wrapper we built for generated Thrift clients. You pass in a generated Thrift class, and it will return an instance that layers connection pooling, fail-over, connection timeouts, and method call timeouts on top of the thrift code, all without changing the original thrift service API, at runtime.

It's pretty cool! Check back later for that.

In this post, I wanted to talk about a small part of that project, a wrapper we built to dynamically add timeouts to the method calls of an instance. I wanted to make this functionality available separately from the thrift wrapper code, so one could easily specify timeouts dynamically, per call.

Timeouts for tasks have been pretty easy to deal with using an ExecutorService since Java 1.5, and this is the timeout mechanism we used.

The interesting part is dynamically adding this behavior to an instance at runtime. I was unaware of it, but since JDK 1.3 there has been a mechanism to do this, the java.lang.reflect.Proxy class.

"Dynamic proxies provide an alternate, dynamic mechanism for implementing many common design patterns, including the Facade, Bridge, Interceptor, Decorator, Proxy (including remote and virtual proxies), and Adapter patterns. While all of these patterns can be easily implemented using ordinary classes instead of dynamic proxies, in many cases the dynamic proxy approach is more convenient and compact and can eliminate a lot of handwritten or generated classes."

From Brian Goetz's excellent article on the use of this class, Decorating with dynamic proxies.

Using this class, we can provide a generic wrapper that implements method call timeouts. Here are some examples:


TimeoutWrapper<Map> wrapper = new TimeoutWrapper(Map.class, new HashMap());
Map m = wrapper.withTimeout(10, TimeUnit.MILLISECONDS);
m.put("key", "value");
m.get("key");


We may want faster gets than puts:

TimeoutWrapper<Map<String>> wrapper = new TimeoutWrapper(Map.class, new HashMap());
wrapper.withTimeout(10, TimeUnit.MILLISECONDS).put("key", "value");
String v = wrapper.withTimeout(5, TimeUnit.MILLISECONDS).get("key");


TimeoutWrapper provides a small (configurable) cache, so that it will reuse the same proxy instances if you pass in the same timeout values when using the above form.

Here's the main code:

TimeoutWrapper.java
TimedInvocationHandler.java

If you want to compile, you'll need some small supporting classes: GenericProxyFactory.java, LRUCache.java.

Monday, September 22, 2008

Capistrano deployment with S3

We recently started using Capistrano for our deployments. It's a great tool and really simplifies doing remote deployments to ec2.

A lot of our projects are java, or java web projects, so we need to deploy the result of a build, not the raw source. I really didn't want to store builds alongside the source code in subversion. Before capistrano, I had written a couple of install shell scripts that would fetch the latest project binary from S3, using bucket names structure like:

${RELEASE_BUCKET}:${PROJECT}/${VERSION}

with 'current' always pointing to the latest version.

There is a capistrano-s3 project that checks out from your scm, creates a .tar.gz, pushes this to s3, and then pushes this to your servers.

This isn't exactly what I wanted, since it's still missing a build step. I thought I could get by with something pretty simple -- just a scm implementation backed by S3. It turns out it was pretty easy.

Here's the S3 scm implementation. You need to have s3sync installed. Drop this in
$GEMPATH/capistrano-2.0.0/lib/capistrano/recipes/deploy/scm
And you can now use the following:

set :scm, :s3
set :repository, "my-bucket:my-path"

by default it will look in my-bucket:my-path/current/, you can also set

set :branch, "1.12"

and it will look in my-bucket:my-path/1.12/

If your AWS keys aren't available in the environment for s3sync, you can also set these in your capfile:

set :access_key, "my key"
set :secret_key, "my secret"

That's basically it. I'm pretty new to both capistrano and ruby, so any comments, feedback, etc. would be appreciated.

Hello Whirled

At Bizo we believe in sharing... This blog is to share the Bizo engineering experience with the world. We hope you find something interesting, helpful, inspiring or otherwise thought provoking.

-Bizo Engineering Team

Wednesday, September 17, 2008

Are we having fun yet?

This is a test of the bizo developer blogging system...

Thanks for setting this up Larry. Now my blogging career can finally take off!