Friday, April 3, 2009

stupid simple xml parsing

In the last few months, I've been doing a lot of XML parsing. It's been mostly small little parsers, in both java and python, trying to get some stats on different data sets. This is all on huge data sets, so streaming/stax type parsers. It's also kind of 3rd party data files, so of course the structure is crazy and weird :), or at least, I can't change it...

Anyway, last night I was working on probably my 5th or 6th parser, and man, was it getting repetitive. With these types of parsers, it's a lot of just keeping track of what state you're in, and then shoving it into your model.

Some pseudocode illustrating the repetitive nature of this:

case START_ELEMENT:
if (name == "company") { inCompany = true; }
if (name == "domains") { inDomains = true; }
if (name == "value" && inDomains) { inDomainValue = true; }
case CHARACTERS:
if (inDomainValue) { company.addDomain(characters) };
case END_ELEMENT:
if (name == "company") { inCompany = false; }
if (name == "domains" && inCompany) { inDomains = false; }
if (name == "value" && inDomain) { inDomainValue = false; }


Anyway, really repetitive... and just a pain to keep testing these similar parsers over and over again.

So, last night, I tried to come up with a super simple generic parser. In fact, I decided that all I really needed was some way to get the Strings I'm interested in into java objects. Initially, my model object had Integers, enums, sub objects. I decided that's too much -- No complex binding, just get the strings our of XML into java, and then from there I can form more complex objects if I need to, do filtering, etc.

I've also been wanted to take the time and learn more about java annotations. Of course, I've used them before, but I'd never created my own, or parsed them.

Here's the new model:

@XmlPath("company")
public class Company {
@XmlPath("name")
private String name;
@XmlPath("domains/value")
private List domains;
....
}


Now, you can just write

new SimpleParser(consumer, Company.class).parse(in);

The parser will inspect the Company class' annotations to figure out what paths it cares about, then internally keep a simple queue of its state. Each time it comes across a path the class cares about, it will either set the value, or add to the List of values depending on the field type. Again, only strings! Once it reaches the end element, it will pass off the last constructed to the consumer, then create a new bean.

Anyway, surprisingly, it worked pretty well! I realize this isn't going to solve all XML parsing needs, but it seems like for 90% of my use cases, it's "good enough."

I did a few searches, and didn't really find anything as simple, although StaXMate looks pretty interesting...

I'd love to get some feedback on this. I wonder if there's already something else out there? Or a better way to do this?

I put the code up on github, as stupid-xml. Quick warning: this is just a late night experiment. It's not production code. I did almost no checking of anything, there are no comments, it's probably crazy inefficient, and I'm probably doing a ton of dumb things. If so, please feel free to fork the code and submit changes, or send feedback, etc.

1 comment:

cowtowncoder said...

In theory, SXC [http://sxc.codehaus.org/] was to do something similar: you define XPath, it'd build you binder classes.

This is a good idea, I think; streaming parsing, annotations, to get an efficient data binder.