A couple of days ago it happened again. On June 14 around 9 pm PDT Amazon AWS hit a power outage in its Northern Virginia data center, affecting EC2, RDS, Elastic Beanstalk and other services in the US-EAST region.
The AWS status page reported:
Some Cache Clusters in a single
AZ in the US-EAST-1 region are currently unavailable. We are also
experiencing increased error rates and latencies for the ElastiCache
APIs in the US-EAST-1 Region. We are investigating the issue.
This outage affected major sites
such as Quora, Foursquare, Pintrest, Heroku and DropBox. I followed the
outage reports, the tweets, the blog posts, and it all sounded all too
familiar. A year ago AWS faced a mega-outage
that lasted over 3 days, when another datacenter (in Virginia, no
less!) went down, and took down with it major sites (Quora, Foursquare…
ring a bell?).
Back during last year’s outage I
analyzed the reports of the sites that managed to survive the outage,
and compiled a list of field-proven guidelines and best practices
to apply in your architecture to make it resilient when deployed on
AWS and other IaaS providers. I find these guidelines and best
practices highly useful in my architectures. On this blog post I’d like
to address one specific guideline in greater depth – architecting for
Disaster Recovery – Characteristics and Challenges
Read the rest of this article on CloudCow.com