Tuesday, June 23, 2009

custom UDFs and hive

We just started playing around with Hive. Basically, it lets you write your hadoop map/reduce jobs using a SQL-like language. This is pretty powerful. Hive also seems to be pretty extendable -- custom data/serialization formats, custom functions, etc.

It turns out that writing your own UDF (user defined function) for use in hive is actually pretty simple.

All you need to do is extend UDF, and write one or more evaluate methods with a hadoop Writable return type. Here's an example of a complete implementation for a lower case function:

package com.bizo.hive.udf;

import org.apache.hadoop.hive.ql.exec.UDF;
import org.apache.hadoop.io.Text;

public final class Lower extends UDF {
public Text evaluate(final Text s) {
if (s == null) { return null; }
return new Text(s.toString().toLowerCase());
}
}


(Note that there's already a built-in function for this, it's just an easy example).

As you've probably noticed from the import statements, you'll need to add buildtime dependencies for hadoop and hive_exec.

The next step is to add the jar with your UDF code to the hive claspath. The easiest way I've found to do this is to set HIVE_AUX_JARS_PATH to a directory containing any jars you need to add before starting hive. Alternatively you can edit $HIVE_HOME/conf/hive-site.xml with a hive.aux.jars.path property. Either way you need to do this before starting hive. It looks like there's a patch out there to dynamically add/remove jars to the classpath, so, hopefully this will be easier soon.

example:
# directory containing any additional jars you want in the classpath
export HIVE_AUX_JARS_PATH=/tmp/hive_aux

# start hive normally
/opt/hive/bin/hive

Once you have hive running, the last step is to register your function:
create temporary function my_lower as 'com.bizo.hive.udf.Lower';

Now, you can use it:
hive> select my_lower(title), sum(freq) from titles group by my_lower(title);

...

Ended Job = job_200906231019_0006
OK
cmo 13.0
vp 7.0


Although it's pretty simple, I didn't see this documented anywhere so I thought I would write it up. I also added it to the wiki.

No comments: