Virtualization Technology News and Information
Amazon's Cloud Storage Hiccups
Several companies lost access to their own files when Inc.'s pay-as-you-go data storage system went down Friday morning.

Amazon said computers that power its Simple Storage Service were unreachable at one of three data centers for about two hours. By 7 a.m. Pacific Time, most users' problems were resolved.

The two-year-old storage service is one of several "cloud computing" offerings from Amazon.

Web startups and others pay to store and crunch data on Amazon's servers rather than running their own. By the end of 2007, about 330,000 people had registered to use the services.

Simple Storage Service customers flocked to Amazon's support discussion board Friday to report problems, seek updates and vent frustrations.

"S3 service has stopped working about 2 hours ago. This is really a severe blow to confidence in trusting AWS services," wrote one, under the name Andrea Barbieri.

Several startups that use Amazon Web Services, including digital photo sharing site SmugMug Inc. and Web e-mail provider Mailtrust, said Friday they were not affected.

Asheville, N.C.-based DigitalChalk Inc., which delivers multimedia training over the Web, said some of its content was inaccessible as a result of the outage.

"While we are very concerned about the potential impact this had on, we were glad to see that the recovery was fairly rapid and we had no loss of data or files," Tony McCune, DigitalChalk's vice president of sales and marketing, wrote in an e-mail to The Associated Press.

"Our biggest concern going forward will be how well Amazon communicates with their customers about the incident," he said, echoing the online comments of several people affected by the outage.

In an e-mail, Amazon spokesman Drew Herdener wrote, "Any amount of downtime is unacceptable and we won't be satisfied until it's perfect."


Here’s some additional detail about the problem we experienced earlier today.

Early this morning, at 3:30am PST, we started seeing elevated levels of authenticated requests from multiple users in one of our locations.  While we carefully monitor our overall request volumes and these remained within normal ranges, we had not been monitoring the proportion of authenticated requests.  Importantly, these cryptographic requests consume more resources per call than other request types.

Shortly before 4:00am PST, we began to see several other users significantly increase their volume of authenticated calls.  The last of these pushed the authentication service over its maximum capacity before we could complete putting new capacity in place.  In addition to processing authenticated requests, the authentication service also performs account validation on every request Amazon S3 handles.  This caused Amazon S3 to be unable to process any requests in that location, beginning at 4:31am PST.  By 6:48am PST, we had moved enough capacity online to resolve the issue.

As we said earlier today, though we're proud of our uptime track record over the past two years with this service, any amount of downtime is unacceptable.  As part of the post mortem for this event, we have identified a set of short-term actions as well as longer term improvements.  We are taking immediate action on the following:  (a) improving our monitoring of the proportion of authenticated requests; (b) further increasing our authentication service capacity; and (c) adding additional defensive measures around the authenticated calls.  Additionally, we’ve begun work on a service health dashboard, and expect to release that shortly.

The Amazon Web Services Team

Published Saturday, February 16, 2008 4:35 PM by David Marshall
Filed under:
There are no comments for this post.
To post a comment, you must be a registered user. Registration is free and easy! Sign up now!
<February 2008>