Thursday, April 21, 2011

How Bizo survived the Great AWS Outage of 2011 relatively unscathed...

The twittersphere, techblogs and even some business sites are a buzz with the news that the US East Region of AWS has been experiencing a major outage. This outage has taken down some of the most well known names on the web. Bizo's infrastructure is 100% AWS and we support 1000s of publisher sites (including some very well know business sites) doing billions of impressions a month. Sure, we had a few bruises early yesterday morning when the outage first began, but soon after then we've been operating our core, high volume services on top of AWS but without the East region.


Here is how we have remained up despite not having a single ops person on our engineering team:


1) Our services are well monitored
We rely on pingdom for external verifcation of site availability on a world wide basis. Additionally, we have our own internal alarms and dashboards that give us up to the minute metrics such as request rate, cpu utilization etc. Most of this data comes from AWS Cloudwatch monitoring but we also track error ratesand have alarms setup to alert us when these rates change or go over a certain threshold.


2) Our services have circuit breakers between remote services that trip when other services become unavailable and we heavily cache data
When building our services, we always assume that remote services will fail at some point. We've spend a good deal of time investing in minimizing the domino effect of a failing remote service. When a remote service becomes unavailable the caller detects this and will go into tripped mode occasionally retrying with backoffs. Of course we also rely on caching read-only data heavily and are able to take advantage of the fact that the data needed for most of our services does not change very often.


3) We utilize autoscaling
One of the promises of AWS is the ability to start and stop more servers based on traffic and load. We've been using autoscaling since it was launched and it worked like a charm. You can see the instances starting up based on the new load in the US West region as traffic was diverted over from US East.


(all times UTC)


4) Our architecture is designed to let us funnel traffic around an entire region if necessary
We utilize Global Load Balancing to direct traffic to the closest region based on the end-user's location. For instance, if a user is in California, wedirect their traffic to the US West region. This was extremely valuable in keeping us fully functioning in the face of a regional outage. When we finally decided that the US East region was going to cause major issues, switching all traffic to US West was as easy as clicking a few buttons. You can see how the requests transitioned over quickly after we made the decision. (By the way, quick shout-out to Dynect who is our GSLB service provider. Thanks!)


(all times UTC)


Bumps and Bruises
Of course we didn't escape without sustaining some issues. We'll do another blog post on some of the issues we did run into but they were relatively minor.


Conclusion
After 3 years running full time on AWS across 4 regions and 8 availability zones we design our systems with the assumption that failure will happen and it helped us come through this outage relatively unscathed.

3 comments:

DZONEMVB said...

I like the post.
I had trouble finding contact info on the blog so I thought I'd comment and ask you if you'd be interested in a syndication program on the developer network that I work for. email me at mitch at dzone dot com and I can tell you more details.

Unknown said...

Great article, really excellent to see how a well structured architecture can avoid the pitfalls of using third party services.

pstehlik said...

great post! it seems many folks at other sites and services can learn a lot from your infrastructure design!


If you are in SF next week (for Google I/O?) I would love to invite you to the 'T1000 Gathering' meetup on May 9th to discuss what strategies we can promote in 'startup land' (and in general) to build new things with technical failure in mind and how to prevent issues like the ones that the EBS outage generated for so many companies.

It is an informal meetup with no set speakers but a big topic to discuss - would love your input
details also at http://bit.ly/mpU5tT