Monday, May 16, 2011

Synchronizing Stashboard with Pingdom alerts

First, what's Stashboard? It's is an open-source status page for cloud services and APIs. Here's a basic example:
Alright, now what's Pingdom? It's a commercial service for monitoring cloud services and APIs. You define how to "ping" a service, and Pingdom periodically checks if the service is responding to the ping request and if not, sends email or SMS alerts.



See the connection? At Bizo, we've had Stashboard deployed on Google's AppEngine for a while but we were updating the status of services manually -- only when major outages happened.

Recently, we've been wanting for something more automated and so we decided to synchronize Stashboard status with Pingdom's notification history and came out with the following requirements:
  1. Synchronize Stashboard within 15 minutes of Pingdom's alert.
  2. "Roll-up" several Pingdom alerts into a single Stashboard status (i.e., for a given service, we have several Pingdom alerts covering different regions around the world but we only want to show a single service status in Stashboard)
  3. If any of the related Pingdom alerts indicate a service is currently unavailable, show "Service is currently down" status.
  4. If the service is available but there have been any alerts in the past 24 hours, show "Service is experiencing intermittent problems" status.
  5. Otherwise, display "Service is up" status.
There are several ways we could have implemented this. We initially thought about using AppEngine's Python Mail API but decided against it since we're not familiar enough with Python and we didn't want to customize Stashboard from the inside. We ended up doing an integration "from the outside" using a cron job and a Ruby script that uses the stashboard and the pingdom-client gems.

It was actually pretty simple. To connect to both services,

require 'pingdom-client'
require 'stashboard'

pingdom = Pingdom::Client.new pingdom_auth.merge(:logger => logger)

stashboard = Stashboard::Stashboard.new(
  stashboard_auth[:url],
  stashboard_auth[:oauth_token],
  stashboard_auth[:oauth_secret]
)

then define the mappings between our Pingdom alerts and Stashboard services using a hash of regular expressions,

# Stashboard service id => Regex matching pingdom check name(s)
services = {
  'api' => /api/i,
  'analyze' => /analyze/i,
  'self-service' => /bizads/i,
  'data-collector' => /data collector/i
}

and iterate over all all Pingdom alerts and for each mapping determine if the service is either up or has had alerts in the past 24 hours,

up_services = services
warning_services = {}

# Synchronize recent pingdom outages over to stashboard
# and determine which services are currently up.
pingdom.checks.each do |check|
  service = services.keys.find do |service|
    regex = services[service]
    check.name =~ regex
  end
  next unless service
  
  # check if any outages in past 24 hours
  yesterday = Time.now - 24.hours
  recent_outages = check.summary.outages.select do |outage|
    outage.timefrom > yesterday || outage.timeto > yesterday
  end
  
  # synchronize outage if necessary
  recent_events = stashboard.events(service, "start" => yesterday.strftime("%Y-%m-%d"))
  recent_outages.each do |outage|
    msg = "Service #{check.name} unavailable: " +
    "#{outage.timefrom.strftime(TIME_FORMAT)} - #{outage.timeto.strftime(TIME_FORMAT)}"
    unless recent_events.any? { |event| event["message"] == msg }
      stashboard.create_event(service, "down", msg)
    end
  end
  
  # if service has recent outages, display warning
  unless recent_outages.empty?
    up_services.delete(service)
    warning_services[service] = true
  end

  # if any pingdom check fails for a given service, consider the service down.
  up_services.delete(service) if check.status == "down"
end

Lastly, if any services are up or should indicate a warning then we update their status accordingly,

up_services.each_key do |service|
  current = stashboard.current_event(service)
  if current["message"] =~ /(Service .* unavailable)|(Service operational but has experienced outage)/i
    stashboard.create_event(service, "up", "Service operating normally.")
  end
end

warning_services.each_key do |service|
  current = stashboard.current_event(service)
  if current["message"] =~ /Service .* unavailable/i
    stashboard.create_event(service, "warning", "Service operational but has experienced outage(s) in past 24 hours.")
  end
end

Note that any manually-entered Stashboard status messages will not be changed unless they match any of the automated messages or if there is a new outage reported by Pingdom. This is intentional to allow overriding automated updates if for any reason, some kind of failure isn't accurately reported.

Curious about what the end result looks like? Take a look at Bizo's status dashboard.

When you click on a specific service, you can see individual outages,

We hope this is useful to somebody out there... and big thanks to the Stashboard authors at Twilio, Matt Todd for creating the pingdom-client gem and Sam Mulube for the stashboard gem. You guys rule!

PS: You can download the full Ruby script from https://gist.github.com/975141.

No comments: