Wednesday, August 24, 2011

Cloudwatch metrics revisited

In a previous post, I discussed our initial usage of cloudwatch custom metrics. Since then, we've added more metrics and changed how we're recording them, so I thought it might be helpful to revisit the topic.

Metric Namespaces

Initially we had a single namespace per application. We decided that the stage should be included in the namespace. E.g. api-web-prod, api-web-dev. It seems to make sense to keep metrics from different stages completely separate, especially if you are using them for alerting or scaling events.

Metric Regions

When we started, we were logging all metrics to us-east (may have been a requirement of the beta?). Going forward, it made sense to log to the specific region where the events occurred. There's a little more work if you want to aggregate across regions, but it matches the rest of our infrastructure layout better. Also, if you want to use metrics for auto-scaling events, it's a requirement.

Dropping InstanceId dimensions (by default)

This is something we are currently working on rolling out. When we first started logging events, we would include a metric update tagged with the InstanceId. This mirrors how the built-in AWS metrics work. It seemed like it would be useful to be able to "drill down" when investigating an issue, i.e. the Maximum CPU utilization in this group is at 100%, okay, which instance is it?

In practice, we have started to question the utility versus cost, especially for custom metrics. When you run large services with auto-scaling, you end up generating a lot of metrics for very transient instances. Since the cost structure is based on the number of unique metrics used, this can really add up.

For some numbers, looking at the output of mon-list-metrics in us-east-1 only, we have 31,888 metrics with an InstanceId dimension! That's just for the last 2 weeks. If we were paying for all of those (luckily most of them are for built-in metrics), it would cost us $15k for those 2 weeks of metrics on very transient instances.

It has been useful to have InstanceId granularity metrics in the past, and in a perfect world maybe we'd still be collecting them, but with the current price structure it's just too expensive for most of our (auto-scaled) services.

Metric Dimensions revisited

When we first started using cloudwatch custom metrics, we would log the following dimensions for each event:
  • Version, e.g. 124021 (svn revision number)
  • Stage, e.g. prod
  • Region, e.g. us-west-1
  • Application, e.g. api-web
  • InstanceId, e.g. i-201345a
We can drop Stage and Region due to our namespace and region changes above. As mentioned, we've also decide to drop InstanceId for most of our services. This makes our current list of dimension defaults:
  • Version, e.g. 124021 (svn revision number)
  • Application, e.g. api-web
We're still tracking stage and region, based on the namespace or region, they just don't need to be expressed as dimensions.

More Metrics!

One of our developers, Darren, put together a JMX->Cloudwatch bridge. Each application can express which JMX stats it would like to export via a JSON config file. Here's a short except that will send HeapMemoryUsage to cloudwatch every 60 seconds:
  {
    "objectName" : "java.lang:type=Memory",
    "attribute" : "HeapMemoryUsage",
    "compositeDataKey" : "used",
    "metricName" : "HeapMemoryUsage",
    "unit" : "Bytes",
    "frequency" : 60,
  },
I'm sure the list will grow, but some of the metrics we've found most useful so far:
  • NonHeapMemoryUsage
  • HeapMemoryUsage
  • OpenFileDescriptorCount
  • SystemLoadAverage
  • ThreadCount

I'm hoping Darren will describe the bridge in more detail in a future post. It's made it really easy for applications to push system metrics to cloudwatch.

Of course, we're also sending a lot of application specific event metrics.

Homegrown interfaces

The AWS cloudwatch console is really slow. It also only seems like it will load 5,000 metrics. Our us-east "AWS/EC2" namespace alone has 28k metrics. Additionally, you can only view metrics for a single region at a time. We just haven't had a lot of success with the web console.

We've been relying pretty heavily on the command line tools for investigation, which can be a little tedious.

We've also written some scripts that will aggregate daily metrics for each app and insert them into a google docs spreadsheet to help track trends.

For our last hack day, I started working on a (very rough!) prototype for a custom cloudwatch console.

The app is written using play (scala) with flot for the graphs.

It heavily caches the namespace/metric/dimension/value hierarchies, and queries all regions simultaneously. It certainly feels much faster than the built-in console.

It's great just being able to quickly graph metrics by name, but my main motivation for this console was to provide a place where we could start to inject some intelligence about our metrics. The cloudwatch interface has to be really generic to support a wide range of uses/metrics. For our metrics, we have a better understanding of what they mean and how they're related. E.g. If the ErrorCount metric is high, we know which other metrics/dimensions can help us drill down and find the cause. I'm hoping to build those kinds of relationships into this dashboard.

Summary

So that's how we're currently using cloudwatch at bizo. There are still some rough edges, but we've been pretty happy with it. It's really easy to log and aggregate metric data with hardly any infrastructure.

I'd love to hear any other experiences, comments, uses people have had with cloudwatch.

Friday, August 19, 2011

Report delivery from Hive via Google Spreadsheets

At Bizo, we run a number of periodically scheduled Hive jobs produce a high level summary as just a few (often, just one) row of data. In the past, we’ve simply used the same delivery mechanism as with larger reports; the output is emailed as a CSV file to the appropriate distribution list. This was less than ideal for a number of reasons:


  1. Managing the distribution lists is difficult. We either needed to create a new list for each type of report, giving us a lot of lists to manage, or just send reports to a generic distribution list, resulting in a lot of unnecessary emails to people who weren’t necessarily interested in the report.

  2. Handling the historical context is manual; the report needs to pull in past results to include in the output or recipients of the output need to find older emails to see trends appear.

  3. Report delivery required an additional step in the job workflow outside of the Hive script.

With the GData storage handler, we now just create a Google Spreadsheet, add appropriate column headers, and do something like this in our script:



add jar gdata-storagehandler.jar ;

create external table gdata_output(
day string, cnt int, source_class string, source_method string, thrown_class string
)
stored by 'com.bizo.hive.gdata.GDataStorageHandler'
with serdeproperties (
"gdata.user" = "user@bizo.com",
"gdata.consumer.key" = "bizo.com",
"gdata.consumer.secret" = "...",
"gdata.spreadsheet.name" = "Daily Exception Summary",
"gdata.worksheet.name" = "My Application",
"gdata.columns.mapping" = "day,count,class,method,thrown"
)
;

This appends whatever data is written to the table to the specified spreadsheet.


The source code is available here. If you’re running your jobs on Amazon’s Elastic MapReduce, you can access the storage handler by adding the following line to your Hive script:



add jar s3://com-bizo-public/hive/storagehandler/gdata-storagehandler-0.1.jar ;

Note that the library only supports 2-legged OAuth access to Google Apps for Domains, which needs to be enabled in your Google Apps control panel.

Friday, August 12, 2011

Bizo dev team @ TechShopSF

IMG_0887

Every quarter we have an "all hands" week, where the entire company comes to SF (the Bizo team is spread out across the country).

As part of this, we typically spend a day as a development team going over previous accomplishments and upcoming projects, as well as discussing our development process, architecture, etc.

We also spend some time making cool stuff! Last time around we had an internal Arduino workshop. Each developer got an Arduino and various components, and we went through a bunch of exercises from Getting Started with Arduino. We ended the day getting Wii controllers hooked up to our Arduinos (can't beat that).

This time around, we decided to head over to the SF Techshop and learn how to screen print.

We ended up with some great shirts:

IMG_0898

They use a really cool process there, where you use a vinyl cutter to create a stencil for your artwork, which you can then just apply to your screen.

It was a lot of fun, and I think we all learned a lot. Special thanks to our instructor, Liz, as well as Devon at TechShop for helping us get this set up.

Check out some more shirts in this photo set.