Wednesday, August 24, 2011

Cloudwatch metrics revisited

In a previous post, I discussed our initial usage of cloudwatch custom metrics. Since then, we've added more metrics and changed how we're recording them, so I thought it might be helpful to revisit the topic.

Metric Namespaces

Initially we had a single namespace per application. We decided that the stage should be included in the namespace. E.g. api-web-prod, api-web-dev. It seems to make sense to keep metrics from different stages completely separate, especially if you are using them for alerting or scaling events.

Metric Regions

When we started, we were logging all metrics to us-east (may have been a requirement of the beta?). Going forward, it made sense to log to the specific region where the events occurred. There's a little more work if you want to aggregate across regions, but it matches the rest of our infrastructure layout better. Also, if you want to use metrics for auto-scaling events, it's a requirement.

Dropping InstanceId dimensions (by default)

This is something we are currently working on rolling out. When we first started logging events, we would include a metric update tagged with the InstanceId. This mirrors how the built-in AWS metrics work. It seemed like it would be useful to be able to "drill down" when investigating an issue, i.e. the Maximum CPU utilization in this group is at 100%, okay, which instance is it?

In practice, we have started to question the utility versus cost, especially for custom metrics. When you run large services with auto-scaling, you end up generating a lot of metrics for very transient instances. Since the cost structure is based on the number of unique metrics used, this can really add up.

For some numbers, looking at the output of mon-list-metrics in us-east-1 only, we have 31,888 metrics with an InstanceId dimension! That's just for the last 2 weeks. If we were paying for all of those (luckily most of them are for built-in metrics), it would cost us $15k for those 2 weeks of metrics on very transient instances.

It has been useful to have InstanceId granularity metrics in the past, and in a perfect world maybe we'd still be collecting them, but with the current price structure it's just too expensive for most of our (auto-scaled) services.

Metric Dimensions revisited

When we first started using cloudwatch custom metrics, we would log the following dimensions for each event:
  • Version, e.g. 124021 (svn revision number)
  • Stage, e.g. prod
  • Region, e.g. us-west-1
  • Application, e.g. api-web
  • InstanceId, e.g. i-201345a
We can drop Stage and Region due to our namespace and region changes above. As mentioned, we've also decide to drop InstanceId for most of our services. This makes our current list of dimension defaults:
  • Version, e.g. 124021 (svn revision number)
  • Application, e.g. api-web
We're still tracking stage and region, based on the namespace or region, they just don't need to be expressed as dimensions.

More Metrics!

One of our developers, Darren, put together a JMX->Cloudwatch bridge. Each application can express which JMX stats it would like to export via a JSON config file. Here's a short except that will send HeapMemoryUsage to cloudwatch every 60 seconds:
  {
    "objectName" : "java.lang:type=Memory",
    "attribute" : "HeapMemoryUsage",
    "compositeDataKey" : "used",
    "metricName" : "HeapMemoryUsage",
    "unit" : "Bytes",
    "frequency" : 60,
  },
I'm sure the list will grow, but some of the metrics we've found most useful so far:
  • NonHeapMemoryUsage
  • HeapMemoryUsage
  • OpenFileDescriptorCount
  • SystemLoadAverage
  • ThreadCount

I'm hoping Darren will describe the bridge in more detail in a future post. It's made it really easy for applications to push system metrics to cloudwatch.

Of course, we're also sending a lot of application specific event metrics.

Homegrown interfaces

The AWS cloudwatch console is really slow. It also only seems like it will load 5,000 metrics. Our us-east "AWS/EC2" namespace alone has 28k metrics. Additionally, you can only view metrics for a single region at a time. We just haven't had a lot of success with the web console.

We've been relying pretty heavily on the command line tools for investigation, which can be a little tedious.

We've also written some scripts that will aggregate daily metrics for each app and insert them into a google docs spreadsheet to help track trends.

For our last hack day, I started working on a (very rough!) prototype for a custom cloudwatch console.

The app is written using play (scala) with flot for the graphs.

It heavily caches the namespace/metric/dimension/value hierarchies, and queries all regions simultaneously. It certainly feels much faster than the built-in console.

It's great just being able to quickly graph metrics by name, but my main motivation for this console was to provide a place where we could start to inject some intelligence about our metrics. The cloudwatch interface has to be really generic to support a wide range of uses/metrics. For our metrics, we have a better understanding of what they mean and how they're related. E.g. If the ErrorCount metric is high, we know which other metrics/dimensions can help us drill down and find the cause. I'm hoping to build those kinds of relationships into this dashboard.

Summary

So that's how we're currently using cloudwatch at bizo. There are still some rough edges, but we've been pretty happy with it. It's really easy to log and aggregate metric data with hardly any infrastructure.

I'd love to hear any other experiences, comments, uses people have had with cloudwatch.

No comments: