Thursday, February 24, 2011

"dynamic" columns in Hive

One of the presentations at the HBase meetup the other night was on building a query language on top of HBase. No less than 3 people asked "Why not use Hive?". The main reason given was that Hive is too slow for doing simple selects. But, the other thing they really liked about using HBase was that your columns were dynamic -- it's easy to add new fields to your data.

Most of the data we log is in a simple log file format:
  • One record per line, separated by newline.
  • Each record can have one or more fields. Fields are separated by ^A (\001).
  • Each field is a key/value pair separated by ^B (\002). Field order is not specified.
In practice this bascially looks like:
ts=1298598378404/code=403/message=bad referrer: bizo.com/...
(Where / is ^A and = is ^B).

This is a format we chose pretty early on, way before we ever looked at Hive. It turns out to be a great format:

  • Human readable.
  • Trivial to parse in any language.
  • Dynamic -- easy to add/remove fields from your data.
It also turns out that it works really well with Hive. Our typical Hive table looks something like:

create external table api_logs(d map<string,string>)
partitioned by (...)
row format delimited
fields terminated by '\004'
collection items terminated by '\001'
map keys terminated by '\002'
stored as textfile
;
That is, each row is just a single column, which is a map. At first this seemed a little degenerate to me, but it actually models our data perfectly. There are no guarantees about which fields are available, and it's easy to add/remove fields in the data over time. I should mention that this is really just for our report input, typically our report output will be in a fixed format.

If you're using Hive 0.6 or greater, with Hive View support it's also easy to get the best of both worlds.

create view api_errors(ts, code, message) as
select d["ts"], d["code"], d["message"]
from api_logs
where d["code"] >= 400
;
You can even change the type information or transform the data by including a cast or a UDF as part of your view. Creating a view doesn't cause anything to run, or create any additional storage. Its query conditions are basically just merged with subsequent queries on that view.