Wednesday, January 26, 2011

EMR/Hive: recovering a large number of partitions

If you try to run "alter table ... recover partitions" on a table with a large number of partitions, you may run into this error:

FAILED: Error in metadata: org.jets3t.service.S3ServiceException: Failed to sanitize XML document destined for handler class$ListBucketHandler null 'null' -- ResponseCode: -1, ResponseStatus: null, RequestId: null, HostId: null
FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask

There's some discussion in the aws forums. The underlying cause is that it's running out of memory when trying to build the partition list.

A workaround is to increase the HADOOP_HEAPSIZE. This can be done by modifying with an EMR bootstrap action. On an m1.large instance, 2G seems to do the trick for us.

Upload a script like the following somewhere in s3:

You can now run this bootstrap action as part of your job:

elastic-mapreduce --create --alive \
--name "large partitions..." --hive-interactive \
--num-instances 1 --instance-type m1.large \
--hadoop-version 0.20 \
--bootstrap-action s3://<bucket/path>/

You should now be able to load your partitions.

No comments: