bizo developer blog: EMR/Hive: recovering a large number of partitions

Wednesday, January 26, 2011

EMR/Hive: recovering a large number of partitions

If you try to run "alter table ... recover partitions" on a table with a large number of partitions, you may run into this error:

FAILED: Error in metadata: org.jets3t.service.S3ServiceException: Failed to sanitize XML document destined for handler class org.jets3t.service.impl.rest.XmlResponsesSaxParser$ListBucketHandler null 'null' -- ResponseCode: -1, ResponseStatus: null, RequestId: null, HostId: null FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask

There's some discussion in the aws forums. The underlying cause is that it's running out of memory when trying to build the partition list.

A workaround is to increase the HADOOP_HEAPSIZE. This can be done by modifying hadoop-user-env.sh with an EMR bootstrap action. On an m1.large instance, 2G seems to do the trick for us.

Upload a script like the following somewhere in s3:

You can now run this bootstrap action as part of your job:


elastic-mapreduce --create --alive \
      --name "large partitions..." --hive-interactive \
      --num-instances 1 --instance-type m1.large \
      --hadoop-version 0.20 \
      --bootstrap-action s3://<bucket/path>/set-hadoop-heap.sh

You should now be able to load your partitions.

Links

Wednesday, January 26, 2011

EMR/Hive: recovering a large number of partitions

No comments:

Blog Archive

Links

Wednesday, January 26, 2011

EMR/Hive: recovering a large number of partitions

No comments:

Blog Archive

Subscribe To