parts of the Internet came to a grinding halt when the servers that powered them
suddenly vanished. The disappearing
server act came from servers that were housed as part of Amazon S3, Amazon's
popular Web hosting service.
When that incident
happened, several big and popular services and Web sites were disrupted,
including DraftKings, Gizmodo, IFTTT, Quora, Slack and Trello.
to the Web site monitoring firm Apica, 54 of the largest online retailers
experienced performance impairments on their Web sites, with some slowing down
by more than 20 percent; 3 sites went down completely (Express, Lulu Lemon, One Kings Lane); and for effected websites, average slow down time was 29.7 seconds - 42.7 seconds to load.
"At 9:37 a.m. PST, an authorized S3 team member using an
established playbook executed a command which was intended to remove a small
number of servers for one of the S3 subsystems that is used by the S3 billing
process," Amazon said. "Unfortunately,
one of the inputs to the command was entered incorrectly and a larger set of
servers was removed than intended. The
servers that were inadvertently removed supported two other S3 subsystems."
Those subsystems are important. One of them "manages the metadata and location
information of all S3 objects in the region," according to Amazon. And without it, services that depend on it
couldn't perform basic data retrieval and storage tasks. The second subsystem, the placement
subsystem, "manages allocation of new storage and requires the index subsystem
to be functioning properly to correctly operate." The placement subsystem is used to allocate
storage for new objects.
While S3 was down, a variety of other Amazon Web services stopped
functioning, including the S3 console, Amazon Elastic Compute Cloud (EC2) new
instance launches, Amazon Elastic Block Store (EBS) volumes and AWS.
address the problems, Amazon staff had to restart all of these subsystems. And during the restart period, they were
unable to service requests. As part of Amazon's official response, the company said that it would immediately begin
implementing changes to its internal systems to prevent similar cascading
problems from happening again in the future.
organizations that made the move to the public cloud, doing so may have been done with a "set
it and forget it" mentality. After all,
migrating to a public cloud is supposed to inherently make things disaster proof,
right? Not so fast.
anything, last week's event should shine a light on the need to design for failure,
whether on-premises with a private cloud, all in with a public cloud or using some
type of hybrid or multi-cloud setup.
like what happened to AWS and Amazon S3 are bound to happen," said Manoj
Chaudhary, CTO and VP of Engineering at Loggly. "But during these outages, logs are more
important than ever for companies and customers, as they can capture data that
would otherwise be lost and pinpoint the root cause of a service interruption."
To help address a problem like last week's outage, Chaudhary
told VMblog that in the end, you want to make sure you're monitoring your
risk by adopting a multi-cloud solution, and hosting monitoring applications in
a different environment than the apps they are monitoring, ensures you have the
ability to access and search data when it is needed most, even in the time of
Keeping Cloud Private
For some organizations, the recent Amazon outage could be a call to return data back to on-premises control.
turmoil caused by the AWS S3 outage shows just how vital reliable data access
is," said Geoff Barrall, Chief Operating Officer, Nexsan.
He continued, "With
so many businesses utilizing a connected workforce, constant access to data is
necessary to keep operating. Any amount
of downtime costs businesses time and money and can be more easily managed if
data is kept within an organization's own IT infrastructure. With sophisticated file, sync and share
capabilities, private cloud solutions can offer the flexibility that a
connected workforce needs, with the security and control of on premises data
Organizations and users did take to Social media and the Internet (those that were still online anyway) to express similar judgement. But the question being asked by many is now, should this single event cause someone to blow up a public cloud migration? In some cases, this will happen. In other cases, a complete public cloud migration may go back to the design board. And yes, this will keep some organizations from going to the public cloud and instead stay where they believe they can better maintain control of their future with on-premises private cloud. The right answer is and always has been, what's best for your organization.
Public Cloud - Manage Data by Region
OK to put all your data in one public cloud, according to Don
Foster, Senior Director of Solutions Marketing and Technical Alliances at Commvault, but you need a viewpoint
of where the data lives across regions. If
a region has an outage, your data management platform should give you a clear
view of data across multi-regions.
"If your data lives in the East, ensure you have a complete data
backup in the West or a region on another continent," said Foster. "If an outage happens, you can recover
quickly in the other region and keep your business running during the service
The important part here is backup.
Foster explained, "Critical data and services native to the cloud
should ensure backups are scheduled in/across/from clouds so your data is
available. Automated backups - and the
ability to verify those backups - make your life a lot less stressful."
financial markets, investors protect themselves from volatility by diversifying,"
explained Chuck Dubuque, VP of product and solution marketing, Tintri. "The same might hold true for companies and
organizations that rely on the cloud."
"The S3 outage demonstrates the risks of putting all your eggs into one cart or
cloud. Moreover, it's difficult to
engineer even cloud native applications for public cloud SLAs as seen by these events.
It's even more complex to deploy and
manage enterprise applications that weren't designed for the cloud to begin
with. If nothing else, the S3 outages
will cause some businesses to reconsider a diversified environment-that
includes enterprise cloud-to reduce their risks."
"For the near foreseeable future, we're going to hear commentary and see
various business impact estimates related to the effects of the S3 outage,"
said Paul Zeiter, President, Zerto. "Still,
many IT professionals will be wondering what they should be doing differently
to protect their organizations for when, not if, something like this happens
on to say, "The
growing frequency of major headline-creating outages across every industry
points to a systemic issue as IT environments become increasingly complex:
Disaster recovery is just as essential as cyber security to protect enterprises
from the mundane erroneous keystroke or power outages to natural catastrophes,
but often under invested in. Business
and IT leaders are getting ahead of the curve by carefully crafting their
hybrid cloud strategies - one that gives them multiple layers of infrastructure
redundancy protection - to achieve IT resilience that keeps critical business
operations seamlessly moving forward. This
is possible using a combination of multiple cloud types for recovery including
public, private, and managed to ensure any disruption is quickly remediated in
a manner that is imperceptible to customers."
As a result of this operational event, Amazon said it is making several
changes to the way its systems are managed.
"While removal of capacity is a key operational practice, in
this instance, the tool used allowed too much capacity to be removed too
quickly," the company said.
Amazon has already modified the tool that was used to pull down
the intended servers. It has not only been
updated to remove servers more slowly in the future, but they have also added
safeguards to prevent servers from being removed when it will bring the system
below a minimum level of capacity.
Amazon also promised to make changes to improve the recovery time
of key S3 subsystems and to audit its other operational tools to ensure they
also have similar safety checks.
Finally, they will also make changes to the AWS Service Health
Dashboard. During the outage, the dashboard flagged all services as running with
a "green" status check because the dashboard itself was dependent on S3. To keep false status updates from embarrassing
the company in the future, they have made changes so that the next time S3 goes
down, dashboard status updates should function properly, i.e. show them as down
or marked as "red."
Beyond the post mortem and system corrections being made, Amazon
offered an apology to those who were affected by the outage, stating:
"We want to apologize for the impact this event caused for our
customers. While we are proud of our long track record of availability with
Amazon S3, we know how critical this service is to our customers, their
applications and end users, and their businesses. We will do everything we can
to learn from this event and use it to improve our availability even further."