Wednesday, November 17, 2010

Spring NamespaceHandler debugging

While updating a library that used Spring yesterday, I began suffering the dreaded "Unable to locate Spring NamespaceHandler for XML schema namespace" exception.

The net is littered with threads about this exception, all of which end with "put the spring jars in your WEB-INF/lib directory".

I was fairly determined not to do this, as I frequently start a webapp with an embedded instance of Jetty, and use the Eclipse project's classpath for all of the dependencies. Since all of the jars are from the project's classpath, the webapp is always using exactly the same jars that Eclipse is for compiling your code, so you never have the two drift apart.

Too many times, after copying jars to WEB-INF/lib and forgetting about them, I'll upgrade a library, everything compiles fine, but spend an embarrassing amount of time wondering why it's not working in the webapp, before remembering the stale jar in the WEB-INF/lib directory.

Anyway, the real cause of the NamespaceHandler exception in my case was a buggy ClassLoader.getResources implementation.

The way spring works, when doing it's XML parsing/whatever magic, is when it sees "xmlns:tx=...", it wants a new NamespaceHandler that knows how to handle those tags.

To allow extensible NamespaceHandlers, Spring uses ClassLoader.getResources("META-INF/spring.handers") to get a list of all of the spring.handlers files across all of the jar files in the classloader. So if spring-core, spring-tx, spring-etc. all have spring.handlers with NamespaceHandlers in them, each file gets found and loaded.

Here's the rub: whatever Eclipse project classloader that had spring-tx on it only returned spring.handlers files from jars that already had classes loaded from them. The ClassLoader.getResources implementation would not look into jar files that was on its classpath, but had not yet been opened for loading classes from.

Of all things, adding:

Class.forName("org.springframework.transaction.support.TransactionTemplate");

Before spring was initialized fixed the NamespaceHandler error. Everything boots up correctly now.

While it took way too long to figure out, I'm pleased that I can continue using the Eclipse project classloader for the webapp I'm starting and avoid the annoying "copy jars to WEB-INF/lib" solution.

I'd like to know which classloader had the buggy getResources implementation, but I've already spent too much time on this so far.

Friday, November 12, 2010

CSV and Hive

CSV

Anyone who's ever dealt with CSV files knows how much of a pain the format actually is to parse. It's not as simple as splitting on commas -- the fields might have commas embedded in them, so, okay you put quotes around the field... but what if the field had quotes in it? Then you double up the quotes... "okay, ""great""" -- that was a single CSV field.

We normally use the excellent opencsv (apache2 licensed) library to deal with CSV files.

Hive

We love Hive. Almost all of our reporting is written as Hive scripts. How do you deal with CSV files with Hive? If you know for sure your fields don't have any commas in them, you can get away with the delimited format. There's the RegexSerDe, but as mentioned the format is non-trivial, and you need to change the regex string depending on how many columns you are expecting.

CSVSerde

Enter the CSVSerde. It's a Hive SerDe that uses the opencsv parser to serialize and deserialize tables properly in the CSV format.

Using it is pretty simple:


add jar path/to/csv-serde.jar;

create table my_table(a string, b string, ...)
row format serde 'com.bizo.hive.serde.csv.CSVSerde'
stored as textfile
;

This is my first time writing a Hive SerDe. There were a couple of road bumps, but overall I was surprised with how easy it was. I mostly just followed along with RegexSerDe.

I'm sure there are a lot of ways it could be improved, so I'd appreciate any feedback or comments on how to make it better.

Source.
Binary (jar packaged with opencsv).