Tuesday, June 23, 2009

custom UDFs and hive

We just started playing around with Hive. Basically, it lets you write your hadoop map/reduce jobs using a SQL-like language. This is pretty powerful. Hive also seems to be pretty extendable -- custom data/serialization formats, custom functions, etc.

It turns out that writing your own UDF (user defined function) for use in hive is actually pretty simple.

All you need to do is extend UDF, and write one or more evaluate methods with a hadoop Writable return type. Here's an example of a complete implementation for a lower case function:

package com.bizo.hive.udf;

import org.apache.hadoop.hive.ql.exec.UDF;
import org.apache.hadoop.io.Text;

public final class Lower extends UDF {
public Text evaluate(final Text s) {
if (s == null) { return null; }
return new Text(s.toString().toLowerCase());

(Note that there's already a built-in function for this, it's just an easy example).

As you've probably noticed from the import statements, you'll need to add buildtime dependencies for hadoop and hive_exec.

The next step is to add the jar with your UDF code to the hive claspath. The easiest way I've found to do this is to set HIVE_AUX_JARS_PATH to a directory containing any jars you need to add before starting hive. Alternatively you can edit $HIVE_HOME/conf/hive-site.xml with a hive.aux.jars.path property. Either way you need to do this before starting hive. It looks like there's a patch out there to dynamically add/remove jars to the classpath, so, hopefully this will be easier soon.

# directory containing any additional jars you want in the classpath
export HIVE_AUX_JARS_PATH=/tmp/hive_aux

# start hive normally

Once you have hive running, the last step is to register your function:
create temporary function my_lower as 'com.bizo.hive.udf.Lower';

Now, you can use it:
hive> select my_lower(title), sum(freq) from titles group by my_lower(title);


Ended Job = job_200906231019_0006
cmo 13.0
vp 7.0

Although it's pretty simple, I didn't see this documented anywhere so I thought I would write it up. I also added it to the wiki.

Thursday, June 11, 2009

Force.com's SOAP/REST library for Google App Engine/Java

As long as I'm reflecting on our Google I/O experiences, I also want to point out what looks like a very useful library from Salesforce. The Force.com Web Services Connector is a toolkit designed to simplify calling WSDL-defined SOAP and REST services. The best part is that they have a version that works on Google App Engine for Java! (Make sure that you use wsc-gae-16_0.jar, not the regular version.)

I haven't had the chance to do a lot of development on GAE/J, but my colleagues have definitely had some headaches getting SOAP and REST calls working around the GAE/J whitelist. Maybe one of them can comment after we give this toolkit a whirl.

Google Visualizations Java Data Source Library

As with any data-oriented company, most of our projects revolve around collecting data, processing data, and exposing data to users. In that third category, we've been moving towards Google Visualizations to draw our pretty graphs and charts. So, while the free Android phone and Google Wave were attracting a lot of attention at Google I/O, from a practical standpoint, I was actually most excited about Google's new Data Source Java Library. We had previously written something similar to this in-house, but we were still working on some of the optional parts of the specification when this library was released.

In a nutshell, Google Visualizations is a Javascript library that draws charts and graphs. The data is inserted in one of three ways: programatically in Javascript, via a JSON object, or by pointing the Javascript at a Data Source URL. For example, Google spreadsheets have built-in functionality to expose their contents as a Data Source, so you can just point the Javascript at a special URL, and a graph of your spreadsheet's data will pop up on your webpage. If you use the last method, you can use Gadgets to easily create custom dashboards displaying your data.

The Data Source Java Library makes it very easy to implement a Data Source backed by whatever internal data store you might be using -- it's just a matter of creating a DataTable object and populating it with data. The library provides everything else, up to and including the servlet to drop into your web container. (We ended up implementing a Spring controller instead. The library provides helper code for this; I estimate using a Spring conroller instead of a servlet cost us four lines of code.)

The best part is that it also implements a SQL-like query language for you, so you can expose your data in different forms (which are required by different visualizations) based on the parameters to the URL you call. Dumping data into JSON objects is very straightforward. Writing a parser and interpreter for queries is a real pain.

The library lets you specify how much of the query language you want to implement and which parts you want to make the library worry about. The only (small) complaint I have about this is that this configuration is rather coarsely defined -- we wanted to support basic column SELECTs (to improve performance on our backend) but have the library handle the aggregation functions (which our backend does not support). It wasn't too tough working around this restriction, although it does cost us a bit of extra parsing (so we can get a copy of the complete query) and column filtering (because both our code and the library processes the SELECT phrase).