Friday, November 12, 2010

CSV and Hive

CSV

Anyone who's ever dealt with CSV files knows how much of a pain the format actually is to parse. It's not as simple as splitting on commas -- the fields might have commas embedded in them, so, okay you put quotes around the field... but what if the field had quotes in it? Then you double up the quotes... "okay, ""great""" -- that was a single CSV field.

We normally use the excellent opencsv (apache2 licensed) library to deal with CSV files.

Hive

We love Hive. Almost all of our reporting is written as Hive scripts. How do you deal with CSV files with Hive? If you know for sure your fields don't have any commas in them, you can get away with the delimited format. There's the RegexSerDe, but as mentioned the format is non-trivial, and you need to change the regex string depending on how many columns you are expecting.

CSVSerde

Enter the CSVSerde. It's a Hive SerDe that uses the opencsv parser to serialize and deserialize tables properly in the CSV format.

Using it is pretty simple:


add jar path/to/csv-serde.jar;

create table my_table(a string, b string, ...)
row format serde 'com.bizo.hive.serde.csv.CSVSerde'
stored as textfile
;

This is my first time writing a Hive SerDe. There were a couple of road bumps, but overall I was surprised with how easy it was. I mostly just followed along with RegexSerDe.

I'm sure there are a lot of ways it could be improved, so I'd appreciate any feedback or comments on how to make it better.

Source.
Binary (jar packaged with opencsv).

4 comments:

igorsf said...

Guys,
The CSVSerde handles all the simple cases well. Thank you for your work.
However, it does not handle some more complicated cases. Unfortunately, it prevents me from using the CSVSerDe. Particularly,the SerDe breaks on the strings including end of line character (multi-line string values). The SerDe treats such values as multiple rows. Another breaking case includes a string value with a slash-double-quote ( \") combination inside. Both of these cases might probably be rather opencsv's problem that CSVSerDe's. However, if you are looking for a way to improve the SerDe, perhaps you could look if these cases might be handled? Yet another possibility for improvement might be enabling custom delimiter and quote characters, which is allowed in opencsv as I understand.
Regards
Igor

larry ogrodnek said...

Igor,

Thanks for the feedback. I've added issues to address them here: https://github.com/ogrodnek/csv-serde/issues

The multi-line string values might be a little more involved because the TextInputFormat automatically splits records on newlines... it would require a new input format in addition to the serde...

It's great feedback though, so thank you.

Max said...

Is there any licence for this SerDe?

larry ogrodnek said...

Yes, it's licensed under apache2. I just updated the git repo with license information: https://github.com/ogrodnek/csv-serde