Thursday, October 13, 2011

Advanced Amazon Web Services Meetup


Bizo Engineering, with the help of Amazon Web Servies, has developed a new meetup in San Francisco that focuses on Advanced Amazon Web Services topics. The main concept is that there are many companies who have been operating on AWS for a number of years and have significant experience but do not have a forum to discuss more advanced issues and architectures.

The next meetup is on Tuesday October 18th at 7pm in San Francisco (more details). We have speakers from Bizo, Twilio and Amazon Web Services.

We look forward to meeting other AWS companies and sharing stories from the trenches. Hope to see you there!

Monday, October 10, 2011

the story of HOD : ahead of its time, obsolete at launch

Last week we shut down an early part of the bizo infrastructure : HOD (Hadoop on Demand). I thought it might be fun to look back on this project a bit.

We've been using AWS as long as bizo has been around, since early 2008. Hadoop has always been a big part of that. When we first started, we were mostly using a shared hadoop cluster. This was kind of a pain for job scheduling, but also was mostly wasteful during off-peak hours… Thus, HOD was born.

From its documentation, "The goal of HOD is to provide an on demand, scalable, sandboxed infrastructure to run Hadoop jobs." Sound familiar? HOD was developed late September and October of 2008, and launched for internal use December 12th, 2008. Amazon announced EMR April of 2009. It's amazing how similar they ended up being… especially since we had no knowledge of EMR at the time.

Even though HOD had a few nice features missing from EMR, the writing was on the wall. For new tasks, we wrote them for EMR. We slowly migrated old reports to EMR when they needed changes, or we had the time.

Architecture

HOD borrowed quite liberally from the design of Alexa's GrepTheWeb.

Users submitted job requests to a controller which managed starting exclusive hadoop clusters (master, slaves, hdfs), retrieving job input from S3 to HDFS, executing the job (hadoop map/reduce), monitoring the job, storing job results, and shutting down the cluster on completion. Job information and status was stored in SimpleDB, S3 was used for job inputs and outputs, and SQS was used to manage the job workflow.

Job Definition

Jobs were defined as thrift structures:
 struct JobRequest {
   1: JobConf job = {},
   2: i32 requested_nodes = 4, // requested number of hadoop slaves
   3: string node_type = "m1.large", // machine size
   4: i32 requested_priority, // job priority
   5: string hadoop_dist, // hadoop version e.g. "0.18.3"
   6: set depends_on = [], // ids of job dependencies
   
   7: list on_success = [], // success notification
   8: list on_failure = [], // failure notification
   9: set flags = [] // everyone loves flags
 }

 struct JobConf {
   1: string job_name,
   2: string job_jar,  // s3 jar path
   3: string mapper_class,
   4: string combiner_class,
   5: string reducer_class,

   6: set job_input,  // s3 input paths
   7: string job_output,  // s3 output path (optional)
   
   8: string input_format_class,
   9: string output_format_class,
   
   10: string input_key_class,
   11: string input_value_class,
   
   12: string output_key_class,
   13: string output_value_class,
   
   // list of files for hadoop distributed cache
   14: set user_data = [],
   
   15: map other_config = {}, // passed directly JobConf.set(k, v)
 }
You'll notice that dependencies could be specified. HOD could hook up the output of 1 or more jobs into the input of a job and wouldn't run the job until all of its dependencies have successfully completed.

User Interaction

We had a user program, similar to emr-client that helped construct and job jobs, e.g.:
JOB="\
 -m com.bizo.blah.SplitUDCMap \
 -r com.bizo.blah.UDCReduce \
 -jar com-bizo-release:blah/blah/blah.jar
 -jobName blah \
 -i com-bizo-data:blah/blah/blah/${MONTH} \
 -outputKeyClass org.apache.hadoop.io.Text \
 -outputValueClass org.apache.hadoop.io.Text \
 -nodes 10 \
 -nodeType c1.medium \
 -dist 0.18.3
 -emailSuccess larry@bizo.com \
 -emailFailure larry@bizo.com \
"

$HOD_HOME/bin/hod_submit $JOB $@
There was also some nice support for looking at jobs, either by status or by name: As well as support for viewing job output, logs, counters, etc.

Nice features

We've been very happy users of Amazon's EMR since it launched in 2009. There's nothing better than systems you don't need to support/maintain yourself! And they've been really busy making EMR more easy to use and adding great features. Still, there are a few things I miss from HOD.

Workflow support

As mentioned, HOD had support for constructing job workflows. You could wire up dependencies amount multiple jobs. E.g. here's an example workflow (also mentioned in this blog previously under hadoop-job-visualization).

It would be nice to see something like this in EMR. For really simple workflows, you can sometime squeeze them into a single EMR job as multiple steps, but that doesn't always make sense and isn't always convenient.

Notification support

HOD supported notifications directly. Initially just email notifications, but there was a plugin structure in place with an eye towards supporting HTTP endpoint, and SQS notifications.

Yes, this is possible by adding a custom EMR job step at the end that checks the status of itself and sends an email/failure notification… But, c'mon, why not just build in easy SNS support? Please?

Counter support

Building on that, HOD had direct support/understanding for hadoop counters. When processing large volumes of data, they become really critical in tracking the health of your reports over time. This is something I really miss. Although, it's less obvious how to fold this in with Hive jobs, which is how most of our reports are written these days.

Arbitrary Hadoop versions

HOD operated with straight hadoop, so it was possible to have it install an arbitrary version/package just by pointing it to the right distribution in S3.

Since Amazon isn't directly using a distribution from the hadoop/hive teams, you need to wait for them to apply their patches/changes and can only run with versions they directly support. This has mostly been a problem with Hive, which moves pretty quickly.

It would be really great if they could get to a point where their changes have been folded back into the main distribution.

Of course, this is probably something you can do yourself, again with a custom job step to install your own version of Hive… Still, they have some nice improvements, and again, it would be nice if it were just a simple option to the job.

The Past / The Future

Of course, HOD wasn't without its problems :). It's become a bear to manage, especially since we pretty much stopped development / maintenance (aside from rebooting it) back in 2009. It was definitely with a sigh of relief that I pulled the plug.

Still, HOD was a really fun project! It was an early project for me at Bizo, and it was really amazing how easy it was to write a program that starts up machines! and gets other programs installed and running! Part of me wonders if there isn't a place for an open source EMR-like infrastructure somewhere? Maybe for private clouds? Maybe for people who want/need more control? Or for cheapskates? :) (EMR does cost money).

Or maybe HOD v2 is just some wrappers around EMR that provides some of the things I miss : workflow support, notifications, easier job configuration…

Something to think about for that next hack day :).

Wednesday, October 5, 2011

Writing better 3rd party Javascript with Coffeescript, Jasmine, PhantomJS and Dependence.js

Here at Bizo we recently underwent a major change to our Javascript tags that’s used in our free analytics product (http://www.bizo.com/marketer/audience_analytics). Our code gets loaded by millions of visitors each month across thousands of web sites; so our Javascript has to run reliably in just about any browser on any page.

The Old Javascript:

Unfortunately our codebase had accumulated about three years of cruft,
resulting in a single monolithic Javascript file. The file contained
a single closure, over 600 lines long, with all kinds of edge cases,
some of which no longer existed. Since you can’t access anything in a
closure from the outside, writing unit tests was nearly impossible and
the code had suffered as a result.

That’s not to say we had no tests – there were several selenium tests
that tested functionality at a high level. The problem however was
making the required changes was going to be a time consuming (and
somewhat terrifying) process. The selenium tests provided a very slow
testing feedback loop and debugging a large closure comes with it’s
own challenges. If it’s scary changing your production code, then
you’re doing something wrong.

Modularity and Dependency Management


So we decided to do a complete overhaul and rewrite our Javascript
tags in CoffeeScript (smaller code base with clearer code). The
biggest problem the original code had was that it wasn’t modular and
thus difficult to test. Ideally we wanted to split our project into
multiple files that we could unit test. To do this we needed some
kind of dependency management system for Javascript, which in 2011
surprisingly isn’t standardized yet. We looked at several projects but
none of them really met our needs. Our users are quite sensitive about
the number of http requests 3rd party Javacript makes so solutions
that load several scripts in parallel weren’t an option (ex.
Requirejs). Others like Sprockets were close but didn’t quite support
everything we needed.

We ended up writing Dependence.js, a gem to manage our Javascript
dependencies. Dependence will compile all your files in a module into
a single file. Features include javascript and/or Coffeescript
compilation, dependency resolution (via a topological sort your
dependency graph), allowing you to use an “exports” object for your
modules interface, and optional compression using Google’s Closure
compiler. Check it out on github:
(http://github.com/jcarver989/dependence.js)

Fast Unit testing with Phantom.js

Another way we were looking to improve our Javascript setup was to
have a comprehensive suite of unit tests. After looking at several
possibilities we settled on using the Jasmine test framework
(http://pivotal.github.com/jasmine/) in conjunction with PhantomJS (a
headless webkit browser). So far using Jasmine and PhantomJs together
has been awesome. As our Javascript is inherently DOM coupled, each of
our unit tests executes in a separate iframe (so each test has its own
separate document object). 126 unit tests later the entire suite runs
locally in about 0.1 seconds!

Functional Testing with Selenium

Our functional tests are still executed with Selenium webdriver.
Although there are alternative options such as HtmlUnit, we wanted to
test our code in real browsers and for this Selenium is still the best
option around. A combination of capybara and rspec make for writing
functional tests with a nicer api than then raw selenium. A bonus is
that capybara allows you to swap out selenium in favor of another
driver should we ever want to switch to something else. Lastly a
custom gem for creating static html fixtures allows us to
programmatically generate test pages for each possible configuration
option found in our Javascript module. You can find that here:
(http://github.com/jcarver989/js-fixtures).

Wrapping up

The new code is far more modular, comprehensively tested and way
easier to extend. Overall working with Dependence.js, CoffeeScript,
PhantomJs, Capybara, Rspec and Selenium has been a workflow that works
great for us. If you have a different workflow that you like for
Javascript projects, let us know!