Monday, September 20, 2010

quick script: emr-mailer

We write a lot of hive reports. Frequently we want to email the resulting report to a list. In the past I've usually done this with some one-off post processing scripts, but I thought it would be nice to write a reusable emr job step that will execute as part of the hive job.

The script will download files from an s3 url, concatenate them together, zip up the results and send it as an attachment to a specified email address. It sends email through, using account credentials you specify.

I wanted to make it easy to just append an additional step to any existing job, not requiring any additional machine setup or dependencies. I was able do this by making use of amazon's script-runner (s3://us-east-1.elasticmapreduce/libs/script-runner/script-runner.jar). The script-runner.jar step will let you execute an arbitrary script from a location in s3 as an emr job step.

As I mentioned, the intended usage is to run it as a job step with your hive script, passing it in the location of the resulting report.


elastic-mapreduce --create --name "my awesome report ${MONTH}" \
--num-instances 10 --instance-type c1.medium --hadoop-version 0.20 \
--hive-script --arg s3://path/to/hive/script.sql \
--args -d,MONTH=${MONTH} --args -d,START=${START} --args -d,END=${END} \
--jar s3://us-east-1.elasticmapreduce/libs/script-runner/script-runner.jar \
--args s3://path/to/emr-mailer/send-report.rb \
--args -n,report_${MONTH} --args -s,"my awesome report ${MONTH}" \
--args -e, \
--args -r,s3://path/to/report/results

Above you can see I'm starting a hive report as normal, then simply appending the script-runner step, calling the emr-mailer send-report.rb, telling it where the report will end up, and details about the email.

The full source code is available on github as emr-mailer.

The script is pretty simple, but let me know if you have any suggestions for improvements or other feedback.

