Tuesday, December 15, 2009

amazon ec2 spot instances

Yesterday Amazon announced EC2 Spot Instances. The idea is that you can bid on unused EC2 instance time. The 'Spot price' is determined periodically by Amazon based on availability and demand for the instances. If your bid is higher than the spot price, you will get an instance and only pay the spot price. Of course, your instance may be terminated at any time, but the nice thing is that unlike the normal ec2 pricing, here you only pay for full hours of usage.

To check out the price history of small linux instances, download the new release of the ec2-api-tools and run:


ec2-describe-spot-price-history --instance-type m1.small -d Linux/UNIX -H


Running this last night, I saw prices that looked like (times PST):



It looks like there's a substantial discount here with prices ranging from $0.025 to $0.035 per hour (the normal ec2 price is $0.085/hr).

Since I'm in the middle of reading How to Cheat at Everything, one of my first thoughts was why not just bid say $0.10/hour? In this way, you're unlikely to get outbid, but you'll probably stand to save significantly for a large part of the day. Now I'm thinking this probably isn't quite a free market... If amazon needs capacity to satisfy reserved instances, or even regular ec2 instances, maybe they'll just kill off these machines to make room.

Still, this is really very cool. A great option for doing a lot of offline batch processing. I hope we start to see support for taking advantage of this type of model in hadoop. It's also exciting to think that one day maybe we'll see something like this across providers -- bid for time across amazon, sun, etc.

Update: Some nice charts by Tim Lossen at cloudexchange.org.

Friday, December 4, 2009

github spam?

I just happened to land on the github recent repositories page, and noticed a ton of spam:



A bunch of different users and projects advertising movie downloads. There's no project content, of course, just a "homepage" that points to a target url...

At first I was thinking, wow, these are some crazy spammers -- using git as a tool for spam! But on closer look, it seems like they're just hitting the website automating account signup and new repository actions.

Still, spam on github? Crazy! I guess no website is safe these days. If you're hosting user generated content, you need to think about detecting and blocking spam and automation of user activity.

Monday, November 30, 2009

quick script: open hadoop jobtracker UI with elastic map reduce

If you've ever logged into the hadoop master with amazon's elastic map reduce, you'll see something like:

The Hadoop UI can be accessed via the command: lynx http://localhost:9100/

Great, but lynx?.. not as nice as firefox or safari...

It's easy enough to do some ssh port forwarding so you can use your browser of choice and access the hadoop UI from your machine.

But, after getting tired of typing in the ssh options a bunch of times, I finally put together a short script that automates it a bit. The script takes in the public hostname of your hadoop master (you can get this from elastic-mapreduce --list), then picks a random port number, sets up the ssh forwarding, and opens the page in a new browser window.

I call it hcon for 'hadoop console'. After configuring the script with the path to your emr key file, you run it like:

hcon ec2-XXX-XXX-XXX-XXX.compute-1.amazonaws.com

Here's the full script, but in case you're curious the magic lines (wrapped) are:

ssh -f -N -o "StrictHostKeyChecking no" \
-L ${LPORT}:localhost:9100 \
-i ${KEYFILE} hadoop@${HOST}
$BROWSER http://localhost:${LPORT}

(Yes, for this, I turn off StrictHostKeyChecking).

Anyway, try it out and let me know if it's helpful at all.

Monday, November 2, 2009

Using Hudson to manage crons

We've been using Hudson for several months now to manage our builds -- we probably have 80-90 different projects that it's responsible for. It's an awesome system for continuous integration and testing.


It's also an awesome system for scheduling and managing generic jobs. We've only just begun to use it as a cron server, but it's clear that it has numerous advantages over the more traditional way of using the unix cron service directly.


  • Notification plugins -- Hudson can be easily configured to send email and Jabber notifications when cron jobs start, succeed, or fail. You can also track your scheduled jobs via RSS.

  • Stdout/Sterr logging -- Hudson saves the stdout and stderr from each run automatically.

  • SCM integration -- if you need to update a job, just check the changes into SVN (or whatever SCM system you use). Hudson will automatically pick up the changes the next time your job is run.

  • Nice web interface -- never underestimate the productivity gains from having a good UI. It can be surprisingly tricky to determine exactly which crons are running on a generic Unix box. Not so with Hudson.


At Bizo, we believe that developers should be getting their hands dirty in the operational aspects of their projects -- Hudson gives us an easy interface for managing our scheduled jobs using the same tools that we're familiar with for managing our build processes. Hudson is such a great tool for continuous integration that it's easy to overlook how good it is at the simpler task of managing generic scheduled jobs.

Tuesday, October 20, 2009

Clearing Linux Filesystem Cache

I was doing some performance tuning of our mysql db and was having some trouble consistently reproducing query performance due to IO caching that was occuring in Linux. In case you're wondering, you can clear this cache by executing the following command as root:

echo 1 > /proc/sys/vm/drop_caches

Friday, October 16, 2009

bash, errors, and pipes

Our typical pattern for writing bash scripts has been to start off each script with:

#!/bin/bash -e

The -e option will cause the script to exit immediate if a command has exited with a non-zero status. This way your script will fail as early as possible, and you never get into a case where on the surface, it looks like the script completed, but you're left with an empty file, or missing lines, etc.

Of course, this is only for "simple" commands, so in practice, you can think of it terminating immediately if the entire line fails. So a script like:

#!/usr/bin/bash -e
/usr/bin/false || true
echo "i am still running"
will still print "i am still running," and the script will exit with a zero exit status.

Of course, if you wrote it that way, that's probably what you're expecting. And, it's easy enough to change (just change "||" to "&&").

The thing that was slightly surprising to me was how a script would behave using pipes.

#!/bin/bash -e
/usr/bin/false | sort > sorted.txt
echo "i am still running"
If your script is piping its output to another command, it turns out that the return status of a pipeline is the exit status of its last command. So, the script above will also print "i am still running" and exit with a 0 exit status.

Bash provides a PIPESTATUS variable, which is an array containing a list of the exit status values from the pipeline. So, if we checked ${PIPESTATUS[0]} it would contain 1 (the exit value of /usr/bin/false), and ${PIPESTATUS[1]} would contain 0 (exit value of sort). Of course, PIPESTATUS is volatile, so, you must check it immediately. Any other command you run will affect its value.

This is great, but not exactly what I wanted. Luckily, there's another bash option -o pipefail, which will change the way the pipeline exit code is derived. Instead of being the last command, it will become the last command with a non-zero exit status. So

#!/bin/bash -e -o pipefail
/usr/bin/false | sort > sorted.txt
echo "this line will never execute"
So, thanks to pipefail, the above script will work as we expect. Since /usr/bin/false returns a non-zero exit status, the entire pipeline will return a non-zero exit status, the script will die immediately because of -e, and the echo will never execute.

Of course, all of this information is contained in the bash man page, but I had never really ran into it / looked into it before, and I thought it was interesting enough to write up.

Monday, October 12, 2009

s3fsr 1.4 released

s3fsr is a tool we built at Bizo to help quickly get files into/out of S3. It's had a few 1.x releases, but by 1.4 we figured it was worth getting around to posting about.

Overview

While there a lot of great S3 tools out there, s3fsr's niche is that it's a FUSE/Ruby user land file system.

For a command line user, this is handy, because it means you can do:
# mount yourbucket in ~/s3
s3fsr yourbucketname ~/s3

# see the directories/files
ls ~/s3/

# upload
mv ~/local.txt ~/s3/remotecopy.txt

# download
cp ~/s3/remote.txt ~/localcopy.txt
Behind the scenes, s3fsr is talking to the Amazon S3 REST API and getting/putting directory and file content. It will cache directory listings (not file content), so ls/tab completion will be quick after the initial short delay.

S3 And Directory Conventions

A unique aspect of s3fsr, and a specific annoyance it was written to fulfill, is that it understands several different directory conventions used by various S3 tools.

This directory convention problem stems from Amazon's decision to forgo any explicit notion of directories in the API, and instead force everyone to realize that S3 is not a file system but a giant hash table of string key -> huge byte array.

Let's take an example--you want to store two files, "/dir1/foo.txt" and "/dir1/bar.txt" in S3. In a traditional file system, you'd have 3 file system entries: "/dir1", "/dir1/foo.txt", and "/dir1/bar.txt". Note that "/dir1" gets its own entry.

In S3, without tool-specific conventions, storing "/dir1/foo.txt" and "/dir1/bar.txt" really means only 2 entries. "/dir1" does not exist of its own accord. The S3 API, when reading and writing, never parses keys apart by "/", it just treats the whole path as one big key to get/set in its hash table.

For Amazon, this "no /dir1" approach makes sense due to the scale of their system. If they let you have a "/dir1" entry, pretty soon API users would want the equivalent of a "rm -fr /dir1", which, for Amazon, means instead of a relatively simple "remove the key from the hash table" operation, they have to start walking a hierarchical structure and deleting child files/directories as they go.

When the keys are strewn across a distributed hash table like Dynamo, this increases the complexity and makes the runtime nondeterministic.

Which Amazon, being a bit OCD about their SLAs and 99th percentiles, doesn't care for.

So, no S3 native directories.

There is one caveat--the S3 API lets you progressively infer the existence of directories by probing the hash table keys with prefixes and delimiters.

In our example, if you probe with "prefix=/" and "delimiter=/", S3 will then, and only then, split & group the "/dir1/foo.txt" and "/dir1/bar.txt" keys on "/" and return you just "dir1/" as what the S3 API calls a "common prefix".

Which is kind of like a directory. Except that you have to create the children first, and then the directory pops into existence. Delete the children, and the directory pops out of existence.

This brings us to the authors of tools like s3sync and S3 Organizer--their users want the familiar "make a new directory, double click it, make a new file in it" idiom, not a backwards "make the children files first" idiom. It is, understandably, different from what users expect.

So, the tool authors got creative and basically added their own "/dir1" marker entries to S3 when users' perform a "new directory" operation to get back to the "directory first" idiom.

Note this is a hack, because issuing a "REMOVE /dir1" to S3 will not recursively delete the child files, because to S3 "/dir1" is just a meaningless key with no relation to any other key in the hash table). So now the burden is on the tool to do its own recursive iteration/deletion of the directories.

Which is cool, and actually works pretty well, except that the two primary tools implemented marker entries differently:
  • s3sync created marker entries (e.g. a "/dir1" entry) with a hard-coded content that etags (hashes) to a specific value. This known hash is nice because it makes it easy to distinguish directory entries from file entries when listing S3 entries and, S3 knowing nothing about directories, the tool having to infer on its own which keys represent files and which represent directories.
  • S3 Organizer created marker entries as well, but instead of a known etag/hash, they suffixed the directory name, so the key of "/dir1" is actually "/dir1_$folder$". It's then the job of the tool is recognize the suffix as a marker directory entry, strip off the suffix before showing the name to the user, and use a directory icon instead of a file icon.
So, if you use a S3 tool that does not understand these 3rd party conventions, browsing a well-used bucket will likely end up looking odd with obscure/duplicate entries:
/dir1 # s3sync marker entry file
/dir1 # common prefix directory
/dir1/foo.txt # actual file entry
/dir2_$folder$ # s3 organizer maker entry file
/dir2 # common prefix directory
/dir2/foo.txt # actual file entry
This quickly becomes annoying.

And so s3fsr understands all three conventions, s3sync, S3 Organizer, and common prefixes, and just generally tries to do the right thing.

FUSE Rocks

One final note is that the FUSE project is awesome. Implementing mountable file systems that users can "ls" around in usually involves messy, error-prone kernel integration that is hard to write and, if the file system code misbehaves, can screw up your machine.

FUSE takes a different approach and does the messy kernel code just once, in the FUSE project itself, and then it acts as a proxy out to your user-land, process-isolated, won't-blow-up-the-box process to handle the file system calls.

This proxy/user land indirection does degrade performance, so you wouldn't use it for your main file system, but for scenarios like s3fsr, it works quite well.

And FUSE language bindings like fusefs for Ruby make it a cinch to develop too--s3fsr is all of 280 LOC.

Wrapping up

Let us know if you find s3fsr useful--hop over to the github site, install the gem, kick the tires, and submit any feedback you might have.


Want to be challenged at work?

We've got a few challenges and are looking to grow our (kick ass) engineering team. Check out the opportunities below and reach out if you think you've got what it takes...

Thursday, October 8, 2009

Efficiently selecting random sub-collections.

Here's a handy algorithm for randomly choosing k elements from a collection of n elements (assume k < n)


public static <T> List<T> pickRandomSubset(Collection<T> source, int k, Random r) {
  List<T> toReturn = new ArrayList<T>(k);
  double remaining = source.size();
  for (T item : source) {
    double nextChance = (k - toReturn.size()) / remaining;
    if (r.nextDouble() < nextChance) {
      toReturn.add(item);
      if (toReturn.size() == k) {
        break;
      }
    }
    --remaining;
  }
  return toReturn;
}

The basic idea is to iterate through the source collection only once. For each element, we can compute the probability that it should be selected, which simply equals the number of items left to pick divided by the total number of items left.

Another nice thing about this algorithm is that it also works efficiently if the source is too large to fit in memory, provided you know (or can count) how many elements are in the source.

This isn't exactly anything groundbreaking, but it's far better than my first inclination to use library functions to randomly sort my list before taking a leading sublist.

Wednesday, October 7, 2009

hive map reduce in java

In my last post, I went through an example of writing custom reduce scripts in hive.

Writing a streaming reducer requires a lot of the same work to check for when keys change. Additionally, in java, there's a decent amount of boilerplate to go through just to read the columns from stdin.

To help with this, I put together a really simple little framework that more closely resembles the hadoop Mapper and Reducer interfaces.

To use it, you just need to write a really simple reduce method:
  void reduce(String key, Iterator<String[]> records, Output output);

The helper code will handle all IO, as well as the grouping together of records that have the same key. The 'records' Iterator will run you through all rows that have the key specified in key. It is assumed that the first column is the key. Each element in the String[] record represents a column. These rows aren't buffered in memory or anything, so it can handle any arbitrary number of rows.

Here's the complete example from the my reduce example, in java (even shorter than perl).
public class Condenser {
  public static void main(final String[] args) {
    new GenericMR().reduce(System.in, System.out, new Reducer() {
      void reduce(String key, Iterator records, Output output) throws Exception {
        final StringBuilder vals = new StringBuilder();
        while (records.hasNext()) {
          // note we use col[1] -- the key is provided again as col[0]
          vals.append(records.next()[1]);
          if (records.hasNext()) { vals.append(","); }
        }
        output.collect(new String[] { key, vals.toString() });
      }
    });
  }
}

Here's a wordcount reduce example:

public class WordCountReduce {
  public static void main(final String[] args) {
    new GenericMR().reduce(System.in, System.out, new Reducer() {
      public void reduce(String key, Iterator<String[]> records, Output output) throws Exception {
        int count = 0;
        
        while (records.hasNext()) {
          count += Integer.parseInt(records.next()[1]);
        }
        
        output.collect(new String[] { key, String.valueOf(count) });
      }
    });
  }
}


Although the real value is in making it easy to write reducers, there's also support for helping with mappers. Here's my key value split mapper from a previous example:

  public class KeyValueSplit {
    public static void main(final String[] args) {
      new GenericMR().map(System.in, System.out, new Mapper() {
      public void map(String[] record, Output output) throws Exception {
        for (final String kvs : record[0].split(",")) {
          final String[] kv = kvs.split("=");
          output.collect(new String[] { kv[0], kv[1] });
        }
      }
    }
  }

The full source code is available here. Or you can download a prebuilt jar here.

The only dependency is apache commons-lang.

I'd love to hear any feedback you may have.

Tuesday, October 6, 2009

Simple DB Firefox Plugin -- New Release

I finally got around to updating our open-sourced Simple DB Firefox Plugin creatively named SDB Tool.

obligatory screen shot

The major highlights include:
  • Runs in Firefox 3.5!
  • Support for "Select" Queries (e.g. Version 2009-04-15 of the API)
  • Lots of UI Tweaks and Refactoring...
Please report any issues here.

Click here to install.

reduce scripts in hive

In a previous post, I discussed writing custom map scripts in hive. Now, let's talk about reduce tasks.

The basics

As before, you are not writing an org.apache.hadoop.mapred.Reducer class. Your reducer is just a simple script that reads from stdin (columns separated by \t) and should write rows to stdout (again, columns separated by \t).

Another thing to mention is that you can't run a reduce without first doing a map.

The rows to your reduce script will be sorted by key (you specify which column this is), so that all rows with the same key will be consecutive. One thing that's kind of a pain with hive reducers, is that you need to keep track of when keys change yourself. Unlike a hadoop reducer where you get a (K key, Iterator<V> values), here you just get row after row of columns.

An example

We'll use a similar example to the map script.

We will attempt to condense a table (kv_input) that looks like:
k1 v1
k2 v1
k4 v1
k2 v3
k3 v1
k1 v2
k4 v2
k2 v2
...

into one (kv_condensed) that looks like:

k1 v1,v2
k2 v1,v2,v3
...

The reduce script

#!/usr/bin/perl                                                                                       

undef $currentKey;
@vals=();

while (<STDIN>) {
  chomp();
  processRow(split(/\t/));
}

output();

sub output() {
  print $currentKey . "\t" . join(",", sort @vals) . "\n";
}

sub processRow() {
  my ($k, $v) = @_;

  if (! defined($currentKey)) {
    $currentKey = $k;
    push(@vals, $v);
    return;
  }

  if ($currentKey ne $k) {
    output();
    $currentKey = $k;
    @vals=($v);
    return;
  }

  push(@vals, $v);
}

Please forgive my perl. It's been a long time (I usually write these in java, but thought perl would make for an easier blog example).

As you can see, a lot of the work goes in to just keeping track of when the keys change.

The nice thing about these simple reduce scripts is that it's very easy to test locally, without going through hadoop and hive. Just call your script and pass in some example text separated by tabs. If you do this, you need to remember to sort the input by key before passing into your script (this is usually done by hadoop/hive).

Reducing from Hive

Okay, now that we have our reduce script working, let's run it from Hive.

First, we need to add our map and reduce scripts:

add file identity.pl;
add file condense.pl;

Now for the real work:






01
02
03
04
05
06
07
08
09
10
11


from (
  from kv_input
  MAP k, v
  USING './identity.pl'
  as k, v
 cluster by k) map_output
insert overwrite table kv_condensed
reduce k, v
  using './condense.pl'
  as k, v
;


This is fairly dense, so I will attempt to give a line by line breakdown:

On line 3 we are specifying the columns to pass to our reduce script from the input table (specified on line 2).

As I mentioned, You must specify a map script in order to reduce. For this example, we're just using a simple identity perl script. On line 5 we name the columns the map script will output.

Line 6 specifies the column which is the key. This is how the rows will be sorted when passed to your reduce script.

Line 8 specifies the columns to pass into our reducer (from the map output columns on line 5).

Finally, line 10 names the output columns from our reducer.

(Here's my full hive session for this example, and an example input file).

I hope this was helpful. Next time, I'll talk about some java code I put together to simplify the process of writing reduce scripts.

Monday, October 5, 2009

Developing on the Scala console with JavaRebel

If you're the type of developer who likes to mess around interactively with your code, you should definitely be using the Scala console. Even if you're not actually using any Scala in your code, you can still instantiate your Java classes, call their methods, and play around with the results. Here's a handy script that I stick in the top-level of my Eclipse projects that will start an interactive console with my compiled code on the classpath:

#!/bin/bash

tempfile=`mktemp /tmp/tfile.XXXXXXXXXX`

/usr/bin/java -jar /mnt/bizo/ivy-script/ivy.jar -settings /mnt/bizo/ivy-script/ivyconf.xml -cachepath ${tempfile} > /dev/null

classpath=`cat ${tempfile} | tr -d "\n\r"`

rm ${tempfile}

exec /usr/bin/java -classpath /opt/local/share/scala/lib:target/classes:${classpath} -noverify -javaagent:/opt/javarebel/javarebel.jar scala.tools.nsc.MainGenericRunner

(Since we already use Ivy for dependency management, this script also pulls in the appropriate jar files from the Ivy cache. See this post for more details.)

The javaagent I'm using here is JavaRebel, a really awesome tool that provides automatic code reloading at runtime. Using the Scala console and JavaRebel, I can instantiate an object on the console and test a method. If I get an unexpected result, I can switch back to Eclipse, fix a bug or add some additional logging, and rerun the exact same method back on the console. JavaRebel will automagically detect that the class file was changed and reload it into the console, and the changes will even be reflected in the objects I created beforehand.

The icing on this cake is that Zero Turnaround (the makers of JavaRebel) is giving away free licenses to Scala developers. How awesome is that?

Thursday, September 10, 2009

Running ScalaTest BDD Tests from Eclipse

At Bizo, we're using Scala for a few things here and there. While investigating testing approaches for Scala, I came across ScalaTest and its Behavior Driven Development (BDD) spec approach.

While its a small thing, I really like the sentence-based it "should do this and that" aspect of the spec approach. You get great readability compared to traditional "testDoThisAndThat" method names.

However, a large downside to the spec approach is that spec tests cannot, on their own, be easily, one-keyboard-shortcut run from within Eclipse. The built-in Eclipse JUnit test runner does not understand the describe/it-based test structure.

To solve this, I wrote a class that can be used with JUnit's "RunWith" annotation to bridge the gap between JUnit and ScalaTest. Its not perfect, but you get back the one-shortcut/greenbar runner in Eclipse. So I can definitely see it being handy if we decide to do any spec-based testing here at Bizo.

Wednesday, September 9, 2009

GWT hosted mode on snow leopard

One of the first things I noticed after installing Snow Leopard was that GWT hosted mode no longer worked. You'll see the message "You must use a Java 1.5 runtime to use GWT Hosted Mode on Mac OS X." After spending about 10 minutes convincing myself that I was in fact using jdk1.5 for eclipse, ant, etc., and like, wasn't this working last week? I finally looked at the jdk symlinks in JavaVM.framework and figured out that 1.5 was just pointing to 1.6... interesting.

There was some discussion on the GWT group, along with a proposed fix of downloading someone's packaged leopard JDK and changing the symlinks. Not a great fix...

The Lombardi development team has come up with a great work-around.

I put together a jar with the modified BootStrapPlatform code (contains both .class and .java), or get just the src here.

Here are some step-by-step instructions for getting this working in Eclipse:

  1. add the gwt-dev-mac-snow jar to your Java Build path.

  2. in Java Build Path -> Order and Export, move the gwt-dev-mac-snow jar above the GWT SDK Library.

  3. go to Run->Run Configurations. In Web Applications->(your GWT project), click on Arguments, then add -d32 under VM arguments.


That's it! You should now be able to run GWT hosted mode on Snow Leopard.

Tuesday, August 11, 2009

Setting up AWS keys for Eclipse

One somewhat annoying thing about running JUnit tests in Eclipse is that they do not inherit your system's environment variables. There are good reasons for this, but we pass our AWS credentials to all of our applications via system variable, and it's a pain to add these to every single run configuration that needs them. This gets especially tedious when a significant number of your JUnit tests require AWS access.

As a workaround, you can add "Default VM Arguments" to the JVM you use to run your tests. Simply go to "Preferences->Java->Installed JREs" and edit your default JVM. Right under the JRE name is a space to add default VM arguments. I simply added "-DAWS_SECRET_ACCESS_KEY=foo -DAWS_ACCESS_KEY_ID=bar", and now I no longer need to manually edit individual run configurations.

This method seems a bit hacky to me, but until I can get a global run configuration, it definitely beats manually setting common environment variables for individual tests.

Wednesday, July 22, 2009

Dependency management for Scala scripts using Ivy

I'm quickly becoming a huge fan of Scala scripting. Because Scala is Java-compatible, we can easily use our existing Java code base in scripts. This is especially convenient as we're moving our reporting to Hive, which supports script-based Hadoop streaming for custom Mappers and Reducers.

The one very annoying thing about Scala scripting is managing dependencies. My initial method was to have my bash preamble manually download the required libraries to the current directory and insert them onto the Scala classpath. So, my scripts looked something like this:


#!/bin/sh

if [ ! -f commons-lang.jar ]; then
s3cmd get [s3-location]/commons-lang.jar commons-lang.jar
fi

if [ ! -f google-collect.jar ]; then
s3cmd get [s3-location]/google-collect.jar google-collect.jar
fi

if [ ! -f hadoop-core.jar ]; then
s3cmd get [s3-location]/hadoop-core.jar hadoop-core.jar
fi

exec /opt/local/bin/scala -classpath commons-lang.jar:google-collect.jar:hadoop-core.jar $0 $@

!#
(scala code here)


This method has some rather severe scaling problems as the complexity of the dependency graph increases. I was about to step into the endless cycle of testing my script, finding the missing or conflicting dependencies, and re-editing it to download and include the appropriate files.

Fortunately, there was an easy solution. We're already using Ivy to manage our dependencies in our compiled projects, and Ivy can be run in standalone mode outside of ant. The key option to use is the "-cachepath" command line option, which causes Ivy to write a classpath to the cached dependencies to a specified file. So, now the preamble of my scripts looks like this:


#!/bin/bash

tempfile=`mktemp /tmp/tfile.XXXXXXXXXX`

/usr/bin/java -jar /mnt/bizo/ivy-script/ivy.jar -settings /mnt/bizo/ivy-script/ivyconf.xml -cachepath ${tempfile} > /dev/null

classpath=`cat ${tempfile} | tr -d "\n\r"`

rm ${tempfile}

exec /opt/local/bin/scala -classpath ${classpath} $0 $@

!#
(scala code here)


Now all I need is a standard ivy.xml file living next to my script, and Ivy will automagically resolve all of my dependencies and insert them into the script's classpath for me.

Crisis averted. Life is once again filled with joy and happiness.

Thursday, July 16, 2009

Pruning EBS Snapshots

We've been using Amazon's Elastic Block Storage (EBS) for some time now. In a nutshell, EBS is like a "hard drive for the AWS cloud". You simply create an EBS volume and then mount it on your EC2 instance. You then read/write to it as if it were local storage. For a good intro to EBS, check out this RightScale blog post.

The snapshots feature of EBS is especially handy as it allows you to easily backup the data on your EBS volume. AWS provides an API that allows you to request a snapshot. The API call will return immediately and then, in the background, the backup will occur and eventually be uploaded to S3.

While the snapshots feature is useful, one of the issues that you will likely run into is the snapshot limit. A standard AWS account allows you to have 500 EBS snapshots at any given time. After this limit has been reached, you will no longer be able to create new snapshots. So, you will need to have a strategy to 'prune' (remove) snapshots.

I wasn't able to find any scripts for pruning EBS snapshots on the web so I ended up writing a little Ruby script to accomplish the task.

You can get the script here. It requires the excellent right_aws ruby gem.

Tuesday, July 14, 2009

custom map scripts and hive

First, I have to say that after using Hive for the past couple of weeks and actually writing some real reporting tasks with it, it would be really hard to go back. If you are writing straight hadoop jobs for any kind of report, please give hive a shot. You'll thank me.

Sometimes, you need to perform data transformation in a more complex way than SQL will allow (even with custom UDFs). Specifically, if you want to return a different number of columns, or a different number of rows for a given input row, then you need to perform what hive calls a transform. This is basically a custom streaming map task.

The basics


1. You are not writing an org.apache.hadoop.mapred.Mapper class! This is just a simple script that reads rows from stdin (columns separated by \t) and should write rows to stdout (again, columns separated by \t). It's probably worth mentioning this again but you shouldn't be thinking Key Value, you need to think about columns.

2. You can write your script in any language you want, but it needs to be available on all machines in the cluster. Any easy way to do this is to take advantage of the hadoop distributed cache support, and just use add file /path/to/script within hive. The script will then be distributed and can be run as just ./script (assuming it is executable), or 'perl script.pl' if it's perl, etc.

An example


This is a simplified example, but recently I had a case where one of my columns contained a bunch of key/value pairs separated by commas:

k1=v1,k2=v2,k3=v3,...
k1=v1,k2=v2,k3=v3,...
k1=v1,k2=v2,k3=v3,...

I wanted to transform these records into a 2 column table of k/v:

k1 v1
k2 v2
k3 v3
k1 v1
k2 v2
...

I wrote a simple perl script to handle the map, created the 2 column output table, then ran the following:

-- add script to distributed cache
add file /tmp/split_kv.pl

-- run transform
insert overwrite table test_kv_split
select
transform (d.kvs)
using './split_kv.pl'
as (k, v)
from
(select all_kvs as kvs from kv_input) d
;

As you can see, you can specify both the input and output columns as part of your transform statement.

And... that's all there is to it. Next time... a reducer?

Tuesday, July 7, 2009

Load testing with Tsung

One of the big issues with building scalable software is making tests scale along with the application. A high performance web application should be tested under heavy loads, preferably to the breaking point. Of course, now you need a second application that can generate lots of traffic. You could use something simple like httperf; however, this doesn't work so well with complex systems, since you're only hitting one URL at a time.

Enter Tsung. Tsung is a load testing tool written in Erlang (everybody's favorite scalable language) that can not only generate large amounts of traffic, but it can parametrize requests based on data returned by your web application or with data pulled from external files. It also can generate very nice HTML reports using gnuplot.

Here's how we're running Tsung on Ubuntu in EC2:

  1. Start a new instance. We're using an Ubuntu Hardy instance build by Alestic.

  2. Download, configure, compile, and install Erlang.

  3. Get and install the Tsung dependencies: gnuplot and perl5.
  4. Download, configure, compile, and install Tsung.

  5. Install your favorite web server. I prefer Apache HTTPD...others in this office perfer nginx. If you want to be really Erlang-y, install Yaws or Mochiweb.

  6. Configure your ~/.tsung/tsung.xml configuration file for your test. The Tsung user manual has pretty good documentation about how to do this. Note that you do NOT want to use vm-transport for heavy loads, as this prevents Erlang from spawning additional virtual machines, which limits the number of requests you can use at a time. This does require you to set up passwordless ssh access to localhost.

  7. Point your web server at "~/.tsung/log/". Each test you run will log the results in a subdirectory of this location.

  8. Start your test with the "tsung start" command.

  9. Set the report-generating script /usr/lib/tsung/bin/tsung_stats.pl to run in the appropriate log directory every 10 seconds. You can do this via crons or simply having a "watch" command running in the background.



Now, you can just browse over to your machine to view the latest test report. Tsung exposes all of the statistics you would expect (req/sec, throughput, latency, etc) both in numerical and graphical form. All of the graphs can be downloaded as high quality postscript graphs, too.

If you want to generate truly large amounts of traffic, Tsung supports distributed testing environments (as you might expect from an Erlang testing tool). Just make sure that you have passwordless SSH set up between your test machines and configure the client list in your tsung.xml file appropriately.

Tuesday, June 23, 2009

custom UDFs and hive

We just started playing around with Hive. Basically, it lets you write your hadoop map/reduce jobs using a SQL-like language. This is pretty powerful. Hive also seems to be pretty extendable -- custom data/serialization formats, custom functions, etc.

It turns out that writing your own UDF (user defined function) for use in hive is actually pretty simple.

All you need to do is extend UDF, and write one or more evaluate methods with a hadoop Writable return type. Here's an example of a complete implementation for a lower case function:

package com.bizo.hive.udf;

import org.apache.hadoop.hive.ql.exec.UDF;
import org.apache.hadoop.io.Text;

public final class Lower extends UDF {
public Text evaluate(final Text s) {
if (s == null) { return null; }
return new Text(s.toString().toLowerCase());
}
}


(Note that there's already a built-in function for this, it's just an easy example).

As you've probably noticed from the import statements, you'll need to add buildtime dependencies for hadoop and hive_exec.

The next step is to add the jar with your UDF code to the hive claspath. The easiest way I've found to do this is to set HIVE_AUX_JARS_PATH to a directory containing any jars you need to add before starting hive. Alternatively you can edit $HIVE_HOME/conf/hive-site.xml with a hive.aux.jars.path property. Either way you need to do this before starting hive. It looks like there's a patch out there to dynamically add/remove jars to the classpath, so, hopefully this will be easier soon.

example:
# directory containing any additional jars you want in the classpath
export HIVE_AUX_JARS_PATH=/tmp/hive_aux

# start hive normally
/opt/hive/bin/hive

Once you have hive running, the last step is to register your function:
create temporary function my_lower as 'com.bizo.hive.udf.Lower';

Now, you can use it:
hive> select my_lower(title), sum(freq) from titles group by my_lower(title);

...

Ended Job = job_200906231019_0006
OK
cmo 13.0
vp 7.0


Although it's pretty simple, I didn't see this documented anywhere so I thought I would write it up. I also added it to the wiki.

Thursday, June 11, 2009

Force.com's SOAP/REST library for Google App Engine/Java

As long as I'm reflecting on our Google I/O experiences, I also want to point out what looks like a very useful library from Salesforce. The Force.com Web Services Connector is a toolkit designed to simplify calling WSDL-defined SOAP and REST services. The best part is that they have a version that works on Google App Engine for Java! (Make sure that you use wsc-gae-16_0.jar, not the regular version.)

I haven't had the chance to do a lot of development on GAE/J, but my colleagues have definitely had some headaches getting SOAP and REST calls working around the GAE/J whitelist. Maybe one of them can comment after we give this toolkit a whirl.

Google Visualizations Java Data Source Library

As with any data-oriented company, most of our projects revolve around collecting data, processing data, and exposing data to users. In that third category, we've been moving towards Google Visualizations to draw our pretty graphs and charts. So, while the free Android phone and Google Wave were attracting a lot of attention at Google I/O, from a practical standpoint, I was actually most excited about Google's new Data Source Java Library. We had previously written something similar to this in-house, but we were still working on some of the optional parts of the specification when this library was released.

In a nutshell, Google Visualizations is a Javascript library that draws charts and graphs. The data is inserted in one of three ways: programatically in Javascript, via a JSON object, or by pointing the Javascript at a Data Source URL. For example, Google spreadsheets have built-in functionality to expose their contents as a Data Source, so you can just point the Javascript at a special URL, and a graph of your spreadsheet's data will pop up on your webpage. If you use the last method, you can use Gadgets to easily create custom dashboards displaying your data.

The Data Source Java Library makes it very easy to implement a Data Source backed by whatever internal data store you might be using -- it's just a matter of creating a DataTable object and populating it with data. The library provides everything else, up to and including the servlet to drop into your web container. (We ended up implementing a Spring controller instead. The library provides helper code for this; I estimate using a Spring conroller instead of a servlet cost us four lines of code.)

The best part is that it also implements a SQL-like query language for you, so you can expose your data in different forms (which are required by different visualizations) based on the parameters to the URL you call. Dumping data into JSON objects is very straightforward. Writing a parser and interpreter for queries is a real pain.

The library lets you specify how much of the query language you want to implement and which parts you want to make the library worry about. The only (small) complaint I have about this is that this configuration is rather coarsely defined -- we wanted to support basic column SELECTs (to improve performance on our backend) but have the library handle the aggregation functions (which our backend does not support). It wasn't too tough working around this restriction, although it does cost us a bit of extra parsing (so we can get a copy of the complete query) and column filtering (because both our code and the library processes the SELECT phrase).

Wednesday, May 20, 2009

new version of s3-simple

Just committed some small changes to the s3-simple library for specifying ACLs, and/or arbitrary request headers/meta-data while storing keys.

Example usage:


S3Store s3 = new S3Store("s3.amazonaws.com", ACCESS_KEY, SECRET_KEY);
s3.setBucket("my-bucket");

// upload an item as public-read
s3.storeItem("test", new String("hello").getBytes(), "public-read");

// upload a js file, with a cache control-header
final Map<String, List<String>> headers = new HashMap<String, List<String>>();
headers.put("Cache-Control", Collections.singletonList("max-age=300, must-revalidate"));
headers.put("Content-Type", Collections.singletonList("application/x-javascript"));

s3.storeItem("test2.js", new String("document.write('hello');").getBytes(), "public-read", headers);


Download it here.

Currently, you can only do this while storing keys, and there's no way to retrieve this data later. Still, it was enough of a pain to get this working correctly with the request signing, so I figured I'd share the code anyway.

Friday, May 8, 2009

Work @ Bizo

We’re looking for an out-of-the-box thinker with a good sense-of-humor and a great attitude to join our product development team. As one of five software engineers for Bizo, you will take responsibility for developing key components of the Bizographic Targeting Platform, a revolutionary new way to target business advertising online. You will be a key player on an incredible team as we build our world-beating, game-changing, and massively-scalable bizographic advertising and targeting platform. In a nutshell, you will be working on difficult problems with cool people.


The Team:

We’re a small team of very talented people (if we don’t say so ourselves!). We use Agile development methodologies. We care about high quality results, not how many hours you’re in the office. We develop on Mac, run on the Cloud, and use Google Apps. We don’t write huge requirements documents or TPS reports. We believe there is no spoon!


The Ideal Candidate:


  • Self-motivated

  • Entrepreneurial / Hacker spirit

  • Track record of achievement

  • Hands on problem solver that enjoys cracking difficult nuts

  • Enjoys working on teams

  • Experience working with highly scalable systems

  • Linux/Unix proficiency

  • Amazon Web Services experience

  • Bachelors degree in Computer Science or related field – points for advanced degrees

  • Willing to blog about Bizo Dev :)

  • Gets stuff done!

  • Bonus points for MDS (Mad Bocce Skills)




Technical Highlights:


  • Languages: Java (90%), Javascript, Ruby (thinking about Scala, Clojure and Erlang too!)

  • All Clouds, all the time: Amazon Web Services, Google App Engine

  • Frameworks/Libraries: Hadoop, Thrift, Google Collections, Spring



Send me your resume and a cover letter if you are interested in joining our team.

Tuesday, May 5, 2009

Spring MVC on Google App Engine

I've been developing an application on Google App Engine and reached a point where I really wanted to be able to use Spring and Spring MVC on the server side.

To my surprise, I discovered that it was relatively easy to get Spring running on Google App Engine (at least for the basic functionality that I'm using it for). However, I did come across the javax.naming.InitialContextFactory issue.

The root cause of this particular issue is that, under certain circumstances, Spring will attempt to load a bean that depends on javax.naming.InitialContextFactory. The problem is that this class is not on GAE's JRE whitelist, and, therefore, results in a ClassNotFoundException.

For example, this occurs when you're using annotation based configuration and have a line similar to the following in your config:

<context:component-scan base-package="com.bizo.accounting.ui.server" />

I searched the forums to see if anyone else had found a solution. This post discusses a potential solution.

I ended up trying the suggested solution and found that it worked. The solution involves loading a 'dummy' bean with the same id as the bean that depends on javax.naming.InitialContextFactory (prior to the context:component-scan). This tricks the container into thinking that the bean has already been loaded.

For example, I put the following line at the top of my config:

<bean id="org.springframework.context.annotation.internalPersistenceAnnotationProcessor" class="java.lang.String"></bean>

I have no doubt that I'll run into other issues with GAE, but I was pleasantly surprised to find that basic Spring integration works without too much difficulty.

Friday, May 1, 2009

google app engine (java) and s3

After struggling for way too long, I finally (sort of) got app engine talking to s3.

Background


I've used the python app engine before for a few small personal projects, and it rocks. Not having to worry, at all, about physical machines, or deployment is a huge win. And, on top of that, they provide some really nice services right out of the box (authentication, memcache, a data store, etc.). So, when they announced the availability of app engine for java, we were all really excited.

Of course, there are some limitations. No Threads, no Sockets (this really sucks), and not all JRE classes.... BUT, it's probably enough for most applications...

And, it just so happens I'm working on a project where all this sounds okay.

Local Environment


They provide a really nice local development environment. Of course, there's not 100% correlation between things working locally and things working remotely. It's to be expected, but can be a pain.

Some things to watch out for:

Connecting to S3


We normally use jets3t for all of our s3 access. It's a great library. However, it's not a great app engine choice because it uses Threads, and Sockets... It seemed like a big task to modify it to work for app engine... I thought using a simpler library as a base would be better.

The author of jets3t has some s3 sample code he published as part of his AWS book. After making some small changes to get over the XPath problem, I just couldn't get it to work. The code worked fine locally, and it would work when I first deployed to app engine, but after that it would fail with IOException: Unknown... Everything looked pretty straightforward... I even tried setting up a fake s3 server to see if there was some weird issues with headers, or what... nothing...

So, I decided to try out some even simpler approaches. After all, it's just a simple REST service, right? That lead me to two different paths that (sort-of) worked.

J2ME and J2SE Toolkit for Amazon S3 -- this is a small SOAP library designed for use in a J2ME environment. This works in app engine! At least for GETs (all I tested). It is very basic and only has support for 1MB objects.

S3Shell in Java -- This is a small REST library designed for use as a shell. There's a small bug (mentioned in the comments), and you need to remove some references to com.sun classes (for Base64 encoding), but otherwise it seems to work pretty well! You will have problems using PUTs locally, but it works fine in production.

Putting it all together


I decided to go with the S3Shell code. I had to make a few changes (as mentioned above), but so far so good. I put up a modified copy of the code on github, or you can download the jar directly. This code should work fine as-is on google app engine. As mentioned, there's an issue with local PUTs (Please vote for it!).

The functionality is pretty basic. If you add any other app-engine supported features (would definitely like to get some meta-data support), definitely let me know.

Tuesday, April 28, 2009

Calling SOAP Web Services in Google App Engine

We have taken the plunge and are begining to develop a couple of internal apps using GWT and Google App Engine (GAE) for Java.

The app that I'm writing needs to make SOAP calls to get external data from a service. The service provides a set of WSDLs and we use Apache Axis to generate code to call this service. However, when I tried using the generated code in the Google Eclipse Plugin environment, I received the following exception:

Caused by: java.security.AccessControlException: access denied (java.net.SocketPermission xxx.xxxx.com resolve)
at java.security.AccessControlContext.checkPermission(AccessControlContext.java:264)
at java.security.AccessController.checkPermission(AccessController.java:427)
at java.lang.SecurityManager.checkPermission(SecurityManager.java:532)
at com.google.appengine.tools.development.DevAppServerFactory$CustomSecurityManager.checkPermission(DevAppServerFactory.java:76)
at java.lang.SecurityManager.checkConnect(SecurityManager.java:1031)
at java.net.InetAddress.getAllByName0(InetAddress.java:1134)
at java.net.InetAddress.getAllByName(InetAddress.java:1072)
at java.net.InetAddress.getAllByName(InetAddress.java:1008)
at java.net.InetAddress.getByName(InetAddress.java:958)
at java.net.InetSocketAddress.(InetSocketAddress.java:124)
at com.sun.net.ssl.internal.ssl.SSLSocketImpl.(SSLSocketImpl.java:350)
at com.sun.net.ssl.internal.ssl.SSLSocketFactoryImpl.createSocket(SSLSocketFactoryImpl.java:69)
at org.apache.axis.components.net.JSSESocketFactory.create(JSSESocketFactory.java:92)
at org.apache.axis.transport.http.HTTPSender.getSocket(HTTPSender.java:191)
at org.apache.axis.transport.http.HTTPSender.writeToSocket(HTTPSender.java:404)
at org.apache.axis.transport.http.HTTPSender.invoke(HTTPSender.java:138)
... 45 more


This was annoying, but not a huge surprise as it is well known and documented that GAE has its own JVM implementation which prevents things like the creation of socket connections and threads. I did some investigation to see if others had solved this particular problem, calling a SOAP Web Service from GAE, but didn't really find any good answers.

In any case, I understood that HTTP communication in GAE must occur through either the URL Fetch Service or java.net.URL class. So, I needed to find a way to make Axis use one of these methods instead of opening sockets directly.

After learning more about how Axis works and then doing more searching, I came across SimpleHTTPSender. This class is an Axis Handler that uses only java.net.URL and friends for HTTP communication.

Problem solved.

Now, on to solving the next one... :)

Saturday, April 25, 2009

Public <=> Private DNS names in EC2

Every instance in Amazon's EC2 cloud has both a public and a private DNS name. When dealing with my instances, I often find that I have one of these two names, but really want the other. For example, one of tools lists the status of our servers by their private DNS names (domU-11-22-33-44-55-66.compute-1.internal). If I need to log on to one of these machines to do something, I need to convert this to a public DNS name (ec2-111-222-333-444.compute-1.amazonaws.com). My solution so far to mapping between public and private DNS names has been to simply write down this information on a glass panel by my desk whenever I start an instance.

Needless to say, this solution works fine whenever I'm dealing with my developer instances from my desk, but it doesn't work so well when I'm working from home or dealing with instances other people have started. We have a few general-purpose tools that will pull down instance meta-data, but it's a bit of a pain to use. So, I finally gave in and wrote a simple Ruby script that pulls down all instance metadata and does a simple regex pattern match against the public and private DNS names. Just pass it an initial substring of the private DNS name (eg, "domU-11-22-33-44-55-66", "domU-11-22-33-44-55-66.compute-1.internal", or simply "domU"), and it will spit out the public DNS name for all matching instances. It works in reverse, too, if you want the private DNS name for a given public DNS name.

Note that you'll need to have the amazon-ec2 gem installed, which you can get by simply running the command "gem install amazon-ec2".

Thursday, April 9, 2009

IvyDE: Eclipse plug-in for Ivy

If you're using Eclipse and Ivy (as we do here at Bizo), you definitely want the IvyDE plug-in. IvyDE will create a classpath container containing all of your Ivy dependencies, making it trivial to keep your Eclipse classpath synchronized with your ivy file.

For automatic installation, point Eclipse's software updates at the following URL:

http://www.apache.org/dist/ant/ivyde/updatesite

Once installed, you should go to Eclipse Preferences->Ivy and, in the first text field under "Global settings" point the plug-in at your ivyconf.xml file. While you're there, you can add your organization info at the bottom of the dialog, too. Now, it's simply a matter of right-clicking on your ivy.xml file and selecting "Add Ivy Library" to include all of your Ivy dependencies in the project. Alternatively, you can edit your Java build path and (in the Libraries section) select "IvyDE Managed Dependencies" from the "Add Library..." dialog.

Prior to joining Bizo, I used Maven and m2eclipse for the one-two punch of dependency management and Eclipse integration. With the addition of this plug-in, I now feel like Ant/Ivy/IvyDE gives me all of the convenience of Maven/m2eclipse with a great deal more flexibility. Of course, some of that may be a result of the common-build system we use for all of our projects. Hopefully, we'll have some time to talk about common-build in the near future....

Wednesday, April 8, 2009

'On Demand' javascript reloading with jQuery

So I was writing some "exploratory" javascript the other day. What I mean by "exploratory" is that I wanted to pull some data out of the DOM, but I didn't know the exact location of the data I wanted to retrieve.

I used Firebug to examine the DOM, figure out what I wanted to retrieve, and then wrote some javascript to actually retrieve the data I needed. However, I wanted to be able to perform this cycle without re-loading the page. In other words, I needed a way to be able to reload my javascript file on demand.

One really simple way to do this is to create a button on your page and then call the jQuery.getScript function in the 'onclick' handler. For example:

<input value="Reload Me" onclick="jQuery.getScript('http://localhost/my.js');" type="button">

This function will asynchronously load your javascript file and execute it. The only issue I had in doing this is that it broke Firebug's ability to debug the code. However, it was nice to be able to modify my javascript code in Eclipse and then immediately run it in my browser.

If anyone has other/better ways to do this I'd love to hear about them.

Tuesday, April 7, 2009

Another win for findbugs: is Math.abs broken?

I was wrapping up some code today, and ran into a weirdly interesting findbugs warning: RV_ABSOLUTE_VALUE_OF_HASHCODE.

Bill Pugh has a more in-depth post on his blog from 2006 about this. But, if you've ever written code like:

Object x = bucket[ Math.abs(y.hashCode()) % bucket.length]

you have a bug. Or, maybe it's a bug in Math.abs, depending on your point of view.

You would probably be surprised to learn that Math.abs(Integer.MIN_VALUE) == Integer.MIN_VALUE, but it's true (go ahead and try it out).

So, about one time out of 4 billion, you'll get an ArrayIndexOutOfBoundsException, when your object's hashCode returns Integer.MIN_VALUE, and you end up with a negative array index.

This seems like a fairly common bit of code to write. In fact, back when Bill wrote this post, he uncovered this occurring 7 times in eclipse, 6 times in jboss, and twice in tomcat, among others.

This behavior sort of makes sense when you think about it. There's really no other value Math.abs could return for Integer.MIN_VALUE, and, it is, in fact, documented. Still, I'd bet that this bug exists somewhere in code you are using.

Anyway, it's worth reading Bill's post to check out some suggested fixes, and dig a little deeper into this.

Friday, April 3, 2009

stupid simple xml parsing

In the last few months, I've been doing a lot of XML parsing. It's been mostly small little parsers, in both java and python, trying to get some stats on different data sets. This is all on huge data sets, so streaming/stax type parsers. It's also kind of 3rd party data files, so of course the structure is crazy and weird :), or at least, I can't change it...

Anyway, last night I was working on probably my 5th or 6th parser, and man, was it getting repetitive. With these types of parsers, it's a lot of just keeping track of what state you're in, and then shoving it into your model.

Some pseudocode illustrating the repetitive nature of this:

case START_ELEMENT:
if (name == "company") { inCompany = true; }
if (name == "domains") { inDomains = true; }
if (name == "value" && inDomains) { inDomainValue = true; }
case CHARACTERS:
if (inDomainValue) { company.addDomain(characters) };
case END_ELEMENT:
if (name == "company") { inCompany = false; }
if (name == "domains" && inCompany) { inDomains = false; }
if (name == "value" && inDomain) { inDomainValue = false; }


Anyway, really repetitive... and just a pain to keep testing these similar parsers over and over again.

So, last night, I tried to come up with a super simple generic parser. In fact, I decided that all I really needed was some way to get the Strings I'm interested in into java objects. Initially, my model object had Integers, enums, sub objects. I decided that's too much -- No complex binding, just get the strings our of XML into java, and then from there I can form more complex objects if I need to, do filtering, etc.

I've also been wanted to take the time and learn more about java annotations. Of course, I've used them before, but I'd never created my own, or parsed them.

Here's the new model:

@XmlPath("company")
public class Company {
@XmlPath("name")
private String name;
@XmlPath("domains/value")
private List domains;
....
}


Now, you can just write

new SimpleParser(consumer, Company.class).parse(in);

The parser will inspect the Company class' annotations to figure out what paths it cares about, then internally keep a simple queue of its state. Each time it comes across a path the class cares about, it will either set the value, or add to the List of values depending on the field type. Again, only strings! Once it reaches the end element, it will pass off the last constructed to the consumer, then create a new bean.

Anyway, surprisingly, it worked pretty well! I realize this isn't going to solve all XML parsing needs, but it seems like for 90% of my use cases, it's "good enough."

I did a few searches, and didn't really find anything as simple, although StaXMate looks pretty interesting...

I'd love to get some feedback on this. I wonder if there's already something else out there? Or a better way to do this?

I put the code up on github, as stupid-xml. Quick warning: this is just a late night experiment. It's not production code. I did almost no checking of anything, there are no comments, it's probably crazy inefficient, and I'm probably doing a ton of dumb things. If so, please feel free to fork the code and submit changes, or send feedback, etc.

Monday, March 30, 2009

Exposing SVN version and build time with Ant

As a developer, I find that an application's most useful version number is its SVN version; once I have that number, I can head over to the repository and figure out exactly what code is running. This is extremely useful both during development (Did I upload the correct .war to my staging server?) and for ongoing maintenance (What version is running on this server that needs to be updated?)

Fortunately, it's pretty easy to configure Ant to automatically inject the SVN version into your application every time you build it. Let's say that you have a particular file (/WEB-INF/jsp/admin.jsp) that should contain the SVN version number.

We take advantage of svntask, a nice simple library that lets Ant use SVN. Download/unzip the project files, place them somewhere convenient in your project, and add the following to your Ant build file: (modifying the fileset directory to where you put the svntask libraries)
<target name="version">
<typedef resource="com/googlecode/svntask/svntask.xml">
<classpath>
<fileset dir="${lib.dir}/buildtime">
<include name="svnkit.jar" />
<include name="svntask.jar"/>
</fileset>
</classpath>
</typedef>

<svn>
<info path="." revisionProperty="revisionVersion" />
</svn>

<property name="svn.version" value="${revisionVersion}" />
</target>

This stores the SVN version number of your project root in the "svn.version" Ant property. Now, it's just a matter of copying the correct file from the source web directory to the target web directory before you package your .war file: (The following will require some modification depending on how you normally package your wars.) We use Ant's copy/filter tasks to replace all occurances of the token "@version@" in admin.jsp with the SVN version number.
<target name="war.pre" depends="version">
<copy file="${src.web.dir}/WEB-INF/jsp/admin.jsp"
tofile="${target.war.expanded.dir}/WEB-INF/jsp/admin.jsp"
overwrite="true">
<filterset>
<filter token="version" value="${svn.version}" />
</filterset>
</copy>
</target>

Now, the following snippit in admin.jsp
<div id="footer">
<p>Build Number @version@</p>
</div>

becomes
<div id="footer">
<p>Build Number 1692</p>
</div>

This is good enough for production releases, where the package application will (or should) be compiled straight from the repository. However, during development, I often won't be checking in changes between builds. So, to tell those builds apart, I find it useful to include the build timestamp as well, which I can get using Ant's tstamp task. This isn't a perfect solution, but it's enough to give me some reassurance that I didn't accidentally deploy an old build to my local server.
<target name="war.pre" depends="version">
<tstamp>
<format property="war.tstamp" pattern="yyyy-MM-dd HH:mm:ss z" />
</tstamp>
<copy file="${src.web.dir}/WEB-INF/jsp/admin.jsp"
tofile="${target.war.expanded.dir}/WEB-INF/jsp/admin.jsp"
overwrite="true">
<filterset>
<filter token="version" value="${svn.version}" />
<filter token="time" value="${war.tstamp}" />
</filterset>
</copy>
</target>

Now, inside of admin.jsp, I can include the following snippit
<div id="footer">
<p>Build Number @version@ (@time@)</p>
</div>

and Ant will transform this into
<div id="footer">
<p>Build Number 1692 (2009-03-30 11:20:15 PDT)</p>
</div>

which is exactly what I want.

Wednesday, March 25, 2009

Dude. Sweet.

Larry tipped me off to the new AWS Toolkit for Eclipse. This looks like it could be pretty awesome for integrating/viewing our EC2 instances.

Friday, March 20, 2009

super lightweight cms

We wanted to roll something out that would allow us to update some areas of our site more easily, and make the content a little more dynamic. Specifically, the News and Announcements page and the Behind the Scenes page.

We've all had experiences with big crazy cms solutions, that do a million things, none of them all that well, and never exactly what you want.

Also, we're a really small team, so we didn't want to set something up that would require a lot of maintenance, or even require a new service or machine.

For "Behind the Scenes", we're already using twitter for company updates. I think this is a great solution for posting short news items, or links, for any site.

The next step was to integrate with Flickr for our Around the office photos. We set up a public, invite only Bizo group. Anyone at Bizo can post photos to this group, and they'll show up on our site. Flickr handles the authentication/authorization, content storage and hosting, and a decent API for getting content on the site.

For "News and Announcements", we wanted a way to be able to easily update any news or press releases, quickly, and without requiring a site release.

We ended up creating a simple blogspot blog to host the content. Blogs/RSS seemed like a pretty close fit. You already have the concept of multiple published items, with dates, titles, and content. Using the excellent rome java library, we pull the content in on the backend, do a little content parsing, then render the results.

The "content parsing" is a little hacky. We wanted slightly more structure around our entries than you get from an RSS feed. So, we expect all of the posts to be formatted like "image (optional), text, read more link". Blogger lets you set up Post Templates, so, it's actually not that bad. When you go "super lightweight", you have to expect to make some tradeoffs, and this seemed like one worth making.

The other nice thing about externalizing your content like this is that you can start pulling it into other places. Last night we launched a new version of our homepage with a flash ticker, that pulls in these news items from the same place.

So there you go: a free, lightweight, zero infrastructure CMS recipe.

Thursday, February 26, 2009

Contradictions

So I've got a need to do some work in Rails. Since I've not really used it before, I decided to go through the nice tutorial at Rubyonrails.org. I'm running merrily along when:

% rake db:create
Rails requires RubyGems >= 1.3.1 (you have 1.2.0). Please `gem update --system` and try again.


Okay, fine. I try it...

% sudo gem update --system
Updating RubyGems
Nothing to update


Huh. Really?

I found this helpful tidbit about updating RubyGems on Mac OS X.

% sudo gem install rubygems-update
Successfully installed rubygems-update-1.3.1
1 gem installed
% sudo update_rubygems
Installing RubyGems 1.3.1
[... a bunch of stuff ...]
------------------------------------------------------------------------------
RubyGems installed the following executables:
/System/Library/Frameworks/Ruby.framework/Versions/1.8/usr/bin/gem

If `gem` was installed by a previous RubyGems installation, you may need
to remove it by hand.


Well, that definitely looks more promising!

% gem --version
1.3.1

Sweet.

Tuesday, February 17, 2009

java keyword challenge

Found this C++ coding exercise on twitter this morning: Write a standards-conforming C++ program containing a sequence of at least ten different consecutive keywords not separated by identifiers, operators, punctutation characters, etc.

Pretty fun read.

I thought it would be interesting to try and come up with something similar in Java. It seems like I'm constantly adding "final" everywhere, so it can't be that hard, right?

Here's the list of the Java Language Keywords.

Unfortunately, in Java, we don't have anything we can repeat, like sizeof. Also, some of the java keywords require non-keyword tokens ({, }) to be used (do, while), where their c++ equivalents do not.

Anyway, let's give it a shot:

private static transient volatile boolean
b;
private static transient final boolean
b2 = true;

5 sequential keywords in a member declaration.

Okay, method declarations:

public static synchronized final native void a();
public static synchronized final strictfp void b() { }

I can't say I've ever used native or strictfp, but that gets us to 6 sequential keywords.

There aren't as many sequential keywords we can use for class declarations:

private static final class c { }
private abstract static class d { }

That's only 4 sequential keywords.

So 6 sequential java keywords. That's my max. Can anyone do better?

Wednesday, January 28, 2009

hadoop job visualization

Last week we had our first hack day at bizo.

We run a ton of hadoop jobs here, many of them multi-stage and completely automated, including machine startup/shutdown, copying data to/from s3, etc.

I wanted to put something together that would help us to visualize the job workflow. I was hoping that by doing this it would give us some insight into how the jobs could be optimized in terms of time and cost.

Here's a generated graph (scaled down a bit, click for full size) for one of our daily jobs:



The program runs through our historical job data, and constructs a workflow based on job dependencies. It also calculates cost based on the machine type, number of machines, and runtime. Amazon's pricing is worth mentioning. Each instance type has a per hour price, and there is no partial hour billing. So, if you use a small machine for 1 hour it's 10 cents, if you use it for 5 minutes, it's also 10 cents. If you use it for 61 minutes, that's 20 cents (2 hours).

As we can see above, a run of this entire workflow cost us $5.40. You can also see that there are a number of steps where we're spending considerably less than an hour, so if we don't care about total runtime, we can save money by reducing the machine size (here we're using all medium machines), or running with fewer machines.

I think the workflow visualization is interesting because it shows that you don't really care about total runtime at all. Since a number of the tasks can run in parallel, it's really the runtime of the longest path along that graph that drives how long you are waiting for the job to complete. In this example, even though we're using 176 minutes of machine runtime, we're only really waiting for 137 minutes.

This means that there are cases where we can spend less money and not affect the overall runtime. You can see that pretty clearly in the example above. "ta_dailyReport" only takes 16 minutes, but on the other branch we're spending 90 minutes on "ta_monthlyAgg" and "ta_monthlyReport." So, if we can spend less money on ta_dailyReport by using fewer machines or smaller machines, as long as we finish in less than 90 minutes we're not slowing down the process.

Yesterday, I decided to play around with the machine configurations based on what I learned from the above graph. Here are the results:



For this job we don't care so much about total runtime, within reason, so I made a few small machine changes at a few different steps.

Even though we're spending an extra 45 minutes of machine time, the longest path is only 13 minutes longer. So, the entire job only takes an extra 13 minutes to run, and we save $1.00 (an 18% savings!).

As you can see there are still a number of tasks where we're using less than an hour, so we could cut costs even further even if we didn't care too much about affecting the overall runtime.

All in all, a pretty successful hackday. My co-workers also worked on some pretty cool stuff, which hopefully they will post about soon.

A quick note on the graphs, for those who are interested. My graphing program simply generated Graphviz files as its output. Graphviz is a totally awesome simple text graphing language for constructing graphs like these. For a really quick introduction, check out this article. Finally, if you're interested, you can check out my raw .gv files here and here.

Tuesday, January 13, 2009

Is a Square a Rectangle?

Is a Square a Rectangle?

Really great breakdown of thinking about "is a square a rectangle" in the OO sense, and thinking about designing with Inheritance.