Wednesday, April 6, 2011

Hive 0.7 no longer auto-downloads transform scripts

I ran into a bit of a surprise moving a Hive 0.5 script to Hive 0.7 the other day.

Previously, in Hive 0.5, we called our Java transform code like:

insert overwrite table the_table
using 'java -cp s3://bucket-name/code.jar MapperClassName'

Behind the scenes, before actually calling the "java" executable, Hive would inspect each of the arguments and, if it found an "s3://..." URL, download that file from S3 to a local copy, and then pass the path to the local copy to your program.

This was convenient as then your external "java" executable didn't have to know anything about S3, how to authenticate with it, etc.

However, in Hive 0.7, this no longer works. Perhaps for the understandable reason that if you did want to pass the literal string "s3://..." to your mapper class, Hive implicitly interjecting on your behalf may not be what you want, and, AFAIK, you had no way to avoid it.

So, now an explicit "add file" command is required, e.g.:

add file s3://bucket-name/code.jar
insert overwrite table the_table
using 'java -cp code.jar MapperClassName'

The add file command downloads code.jar to the local execution directory (without any bucket name/path mangling like in Hive 0.5), and then your transform arguments can reference the local file directly.

All in all, a pretty easy fix, but rather frustrating to figure out given the long cycle time of EMR jobs.

Also, kudos to this post in the AWS developer forums that describes the same problem and solution:

No comments: