Friday, February 19, 2010

Triggering post-Elastic MapReduce steps as parameterized jobs in Hudson

Here at Bizo, the combination of Hudson for cron management, Hive for report generation, and Elastic MapReduce for provisioning compute power has greatly simplified our data processing. Periodically and automatically, our Hudson cron instance generates Hive scripts for us and launches them in EC2.

The main inconvenience with this process is that the results of our Hive jobs are left as one or more obscurely named files in S3. These often need some post-processing to put them into a more friendly form. Unfortunately, EMR doesn't have an easy hook for launching these post-processing tasks -- although we could implement them as MapReduce steps, we'd need to write our own workflows, losing the simplicity of using EMR's simple "--hive-script" flag.

Our solution is to use SimpleDB to store some basic metadata about jobs. Using this metadata, a Hudson job periodically checks the EMR API to determine whether tasks have completed. If so, it then triggers other Hudson jobs that are responsible for processing the results.

Here are some tools that make this process work:

  • Simple script to put data into SimpleDB. Our metadata scheme is to use the jobflow ID as item names and the name/parameters of the jobs to trigger as attributes.

  • The Hudson parameterized build feature. It's not really feasible to create a new Hudson job for each individual report that runs, so we pass parameters to Hudson so the post-processing step can figure out where the results are in S3 and what to do with them. It's not well-documented how to do this programatically (as opposed to from the web interface); the solution is to send some JSON to the build url.

  • The Trigger Script. This is the script that periodically runs on our cron server to check if a post-processing step should be triggered. The JSON format for parameterized jobs is described in the comments of this file.



The end result is that a job can run an EMR job and configure a post-processing step for itself with the following commands:


JOB_ID=`elastic-mapreduce --create --hive-script --arg ${s3.location}` | grep "Created job flow" | awk '{ print $4 }' -`

simpledb-put.rb -d ${metadata.domain} -i $JOB_ID "next_on_cron_server_job_name=post-processing-step" "next_on_cron_server_job_params={\"parameter\": [{\"name\":\"PARAM1\", \"value\":\"VALUE1\" }]}" "next_on_cron_server_triggered=false"


This launches the hive script in the specified s3.location and configures the "post-processing-step" job on the cron server to run with the parameter "PARAM1=VALUE1".