Wednesday, July 22, 2009

Dependency management for Scala scripts using Ivy

I'm quickly becoming a huge fan of Scala scripting. Because Scala is Java-compatible, we can easily use our existing Java code base in scripts. This is especially convenient as we're moving our reporting to Hive, which supports script-based Hadoop streaming for custom Mappers and Reducers.

The one very annoying thing about Scala scripting is managing dependencies. My initial method was to have my bash preamble manually download the required libraries to the current directory and insert them onto the Scala classpath. So, my scripts looked something like this:


#!/bin/sh

if [ ! -f commons-lang.jar ]; then
s3cmd get [s3-location]/commons-lang.jar commons-lang.jar
fi

if [ ! -f google-collect.jar ]; then
s3cmd get [s3-location]/google-collect.jar google-collect.jar
fi

if [ ! -f hadoop-core.jar ]; then
s3cmd get [s3-location]/hadoop-core.jar hadoop-core.jar
fi

exec /opt/local/bin/scala -classpath commons-lang.jar:google-collect.jar:hadoop-core.jar $0 $@

!#
(scala code here)


This method has some rather severe scaling problems as the complexity of the dependency graph increases. I was about to step into the endless cycle of testing my script, finding the missing or conflicting dependencies, and re-editing it to download and include the appropriate files.

Fortunately, there was an easy solution. We're already using Ivy to manage our dependencies in our compiled projects, and Ivy can be run in standalone mode outside of ant. The key option to use is the "-cachepath" command line option, which causes Ivy to write a classpath to the cached dependencies to a specified file. So, now the preamble of my scripts looks like this:


#!/bin/bash

tempfile=`mktemp /tmp/tfile.XXXXXXXXXX`

/usr/bin/java -jar /mnt/bizo/ivy-script/ivy.jar -settings /mnt/bizo/ivy-script/ivyconf.xml -cachepath ${tempfile} > /dev/null

classpath=`cat ${tempfile} | tr -d "\n\r"`

rm ${tempfile}

exec /opt/local/bin/scala -classpath ${classpath} $0 $@

!#
(scala code here)


Now all I need is a standard ivy.xml file living next to my script, and Ivy will automagically resolve all of my dependencies and insert them into the script's classpath for me.

Crisis averted. Life is once again filled with joy and happiness.

Thursday, July 16, 2009

Pruning EBS Snapshots

We've been using Amazon's Elastic Block Storage (EBS) for some time now. In a nutshell, EBS is like a "hard drive for the AWS cloud". You simply create an EBS volume and then mount it on your EC2 instance. You then read/write to it as if it were local storage. For a good intro to EBS, check out this RightScale blog post.

The snapshots feature of EBS is especially handy as it allows you to easily backup the data on your EBS volume. AWS provides an API that allows you to request a snapshot. The API call will return immediately and then, in the background, the backup will occur and eventually be uploaded to S3.

While the snapshots feature is useful, one of the issues that you will likely run into is the snapshot limit. A standard AWS account allows you to have 500 EBS snapshots at any given time. After this limit has been reached, you will no longer be able to create new snapshots. So, you will need to have a strategy to 'prune' (remove) snapshots.

I wasn't able to find any scripts for pruning EBS snapshots on the web so I ended up writing a little Ruby script to accomplish the task.

You can get the script here. It requires the excellent right_aws ruby gem.

Tuesday, July 14, 2009

custom map scripts and hive

First, I have to say that after using Hive for the past couple of weeks and actually writing some real reporting tasks with it, it would be really hard to go back. If you are writing straight hadoop jobs for any kind of report, please give hive a shot. You'll thank me.

Sometimes, you need to perform data transformation in a more complex way than SQL will allow (even with custom UDFs). Specifically, if you want to return a different number of columns, or a different number of rows for a given input row, then you need to perform what hive calls a transform. This is basically a custom streaming map task.

The basics


1. You are not writing an org.apache.hadoop.mapred.Mapper class! This is just a simple script that reads rows from stdin (columns separated by \t) and should write rows to stdout (again, columns separated by \t). It's probably worth mentioning this again but you shouldn't be thinking Key Value, you need to think about columns.

2. You can write your script in any language you want, but it needs to be available on all machines in the cluster. Any easy way to do this is to take advantage of the hadoop distributed cache support, and just use add file /path/to/script within hive. The script will then be distributed and can be run as just ./script (assuming it is executable), or 'perl script.pl' if it's perl, etc.

An example


This is a simplified example, but recently I had a case where one of my columns contained a bunch of key/value pairs separated by commas:

k1=v1,k2=v2,k3=v3,...
k1=v1,k2=v2,k3=v3,...
k1=v1,k2=v2,k3=v3,...

I wanted to transform these records into a 2 column table of k/v:

k1 v1
k2 v2
k3 v3
k1 v1
k2 v2
...

I wrote a simple perl script to handle the map, created the 2 column output table, then ran the following:

-- add script to distributed cache
add file /tmp/split_kv.pl

-- run transform
insert overwrite table test_kv_split
select
transform (d.kvs)
using './split_kv.pl'
as (k, v)
from
(select all_kvs as kvs from kv_input) d
;

As you can see, you can specify both the input and output columns as part of your transform statement.

And... that's all there is to it. Next time... a reducer?

Tuesday, July 7, 2009

Load testing with Tsung

One of the big issues with building scalable software is making tests scale along with the application. A high performance web application should be tested under heavy loads, preferably to the breaking point. Of course, now you need a second application that can generate lots of traffic. You could use something simple like httperf; however, this doesn't work so well with complex systems, since you're only hitting one URL at a time.

Enter Tsung. Tsung is a load testing tool written in Erlang (everybody's favorite scalable language) that can not only generate large amounts of traffic, but it can parametrize requests based on data returned by your web application or with data pulled from external files. It also can generate very nice HTML reports using gnuplot.

Here's how we're running Tsung on Ubuntu in EC2:

  1. Start a new instance. We're using an Ubuntu Hardy instance build by Alestic.

  2. Download, configure, compile, and install Erlang.

  3. Get and install the Tsung dependencies: gnuplot and perl5.
  4. Download, configure, compile, and install Tsung.

  5. Install your favorite web server. I prefer Apache HTTPD...others in this office perfer nginx. If you want to be really Erlang-y, install Yaws or Mochiweb.

  6. Configure your ~/.tsung/tsung.xml configuration file for your test. The Tsung user manual has pretty good documentation about how to do this. Note that you do NOT want to use vm-transport for heavy loads, as this prevents Erlang from spawning additional virtual machines, which limits the number of requests you can use at a time. This does require you to set up passwordless ssh access to localhost.

  7. Point your web server at "~/.tsung/log/". Each test you run will log the results in a subdirectory of this location.

  8. Start your test with the "tsung start" command.

  9. Set the report-generating script /usr/lib/tsung/bin/tsung_stats.pl to run in the appropriate log directory every 10 seconds. You can do this via crons or simply having a "watch" command running in the background.



Now, you can just browse over to your machine to view the latest test report. Tsung exposes all of the statistics you would expect (req/sec, throughput, latency, etc) both in numerical and graphical form. All of the graphs can be downloaded as high quality postscript graphs, too.

If you want to generate truly large amounts of traffic, Tsung supports distributed testing environments (as you might expect from an Erlang testing tool). Just make sure that you have passwordless SSH set up between your test machines and configure the client list in your tsung.xml file appropriately.