Wednesday, September 19, 2012

Using GROUP BYs or multiple INSERTs with complex data types in Hive.

In any sort of ad hoc data analysis, the first step is often to extract a specific subset of log lines from our files.  For example, when looking at a single partner’s web traffic, I often use an initial query to copy that partner’s data into a new table.  In addition to segregating out only the data relevant to my analysis, I use this to copy the data from S3 into HDFS, which will make later queries more efficient.  (Using maps as our log lines is how we support dynamic columns.)

create external table if not exists
original_logs(fields map<string,string>) location “...” ;

create table if not exists
extracted_logs(fields map<string,string>) ;

insert overwrite table extracted_logs
select * from original_logs where fields[“partnerId”] = 123 ;

If I’m doing this for multiple partners, it’s tempting to use a multiple-insert so Hadoop only needs to make one pass of the original data.

create external table if not exists
original_logs(fields map<string,string>) location “...” ;

create table if not exists
extracted_logs(fields map<string,string>)
partitioned by (partnerId int);

from original_logs
insert overwrite table extracted_logs partition (partnerId = 123)
select * from original_logs where fields[“partnerId”] = 123
insert overwrite table extracted_logs partition (partnerId = 234)
select * from original_logs where fields[“partnerId”] = 234

Unfortunately, in Hive 0.7.x, this query fails with the error message “Hash code on complex types not supported yet.”  A multiple-insert statement uses an implicit group by, and Hive 0.7.x does not support grouping by complex types.  This bug was partially addressed in 0.8, which added support for arrays and maps, but structs and unions are still not supported.

At an initial glance, it does look like adding this support should be straightforward.  This could be a good candidate for our next open source day.

No comments: