Wednesday, January 26, 2011

EMR/Hive: recovering a large number of partitions

If you try to run "alter table ... recover partitions" on a table with a large number of partitions, you may run into this error:


FAILED: Error in metadata: org.jets3t.service.S3ServiceException: Failed to sanitize XML document destined for handler class org.jets3t.service.impl.rest.XmlResponsesSaxParser$ListBucketHandler null 'null' -- ResponseCode: -1, ResponseStatus: null, RequestId: null, HostId: null
FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask


There's some discussion in the aws forums. The underlying cause is that it's running out of memory when trying to build the partition list.

A workaround is to increase the HADOOP_HEAPSIZE. This can be done by modifying hadoop-user-env.sh with an EMR bootstrap action. On an m1.large instance, 2G seems to do the trick for us.

Upload a script like the following somewhere in s3:



You can now run this bootstrap action as part of your job:

elastic-mapreduce --create --alive \
--name "large partitions..." --hive-interactive \
--num-instances 1 --instance-type m1.large \
--hadoop-version 0.20 \
--bootstrap-action s3://<bucket/path>/set-hadoop-heap.sh


You should now be able to load your partitions.

No comments: