Thursday, June 14, 2012

AWS Billing Info in Hive

Amazon recently (finally!) launched programmatic access to your AWS billing data.

Once you turn it on, select a bucket, grant access to the AWS system user, you'll get a .csv file with your estimated billing for the month. The files are delivered daily, but they contain month-to-date information, and will replace the file from the previous day.

It's easy enough to view this information in excel (or similar), but I thought it would be fun to take a look in hive, especially once we start having data for a few months to aggregate over.

Amazon delivers the data to the root of your bucket. I decided to start moving it to a hive-partitioned path, to make it easier to query once we start have more data. I wrote a simple scala script to move the data to [bucket]/partioned/year=[year]/month=[month]/[file]. Here's some example code.

Ok, now we're ready to read the data in Hive.

Here's a hive schema for the AWS billing information. It uses the csv-serde (make sure you add that jar before running the create table statement). Run alter table aws_billing recover partitions; to load in the partitions (one per year/month), and you're ready to query.

Like I said, it's overkill to use hive to read this data for a month or so, but it's just so addictive having a SQL interface to arbitrary S3 data :).

Here are some example queries to get you started.

Costs by Service

select ProductCode, UsageType, Operation, sum(TotalCost)
  from aws_billing
 where RecordType in ("PayerLineItem", "LinkedLineItem")
    by ProductCode, UsageType, Operation

EC2 usage, by size (across EC2/EMR)

select ProductCode, UsageType,sum(TotalCost)
  from aws_billing
 where RecordType in ("PayerLineItem", "LinkedLineItem")
   and UsageType like "BoxUsage%"
    by ProductCode, UsageType

Wednesday, June 13, 2012

the golden rule of programming style

There's an interesting page on the subject of compilation units per file over at the scala style guide.

The guideline, is, delightfully vague, which I will paraphrase as: Mostly use single files, unless you can't, or unless it's better if you don't.

The author(s) go on to expand on the reasoning behind breaking the guideline:

Another case is when multiple classes logically form a single, cohesive group, sharing concepts to the point where maintenance is greatly served by containing them within a single file. These situations are harder to predict… Generally speaking, if it is easier to perform long-term maintenance and development on several units in a single file rather than spread across multiple, then such an organizational strategy should be preferred for these classes.

This touches on what I consider to be the golden rule of programming style: Make your intent clear and the code easy to read.

Software spends most of its life in maintenance, which is why we have style guides and coding standards. It's valuable to have consistent looking code to promote a shared vocabulary, improve readability, and steer away from confusing or error-prone constructs.

It is just as important to be able to understand, both as an author and as a reviewer, that in certain cases following the letter of the law goes against the main goal of improving readability and maintenance. A one-size-fits-all rule does not always work, and as the authors of this particular guideline mention, "these situations are harder to predict."

Make your intent clear and the code easy to read.