Catchpoint released Preventing Outages in
2023, a new white paper comprising of six critical learnings and ten pieces
of in-depth analysis of major and hidden outages from across the last 18
months. IT Operations,
Network Engineers, and SREs, along with VPs of Infrastructure, CIOs and CTOs
will be able to draw on the Catchpoint team's expertise to learn from the
failures of the past and inform future approaches to incident management. The
full white paper, Preventing Outages in 2023: What We Can Learn From Recent
Failures, is available for download here (no
registration required).
"What the recent
failures from Internet giants demonstrate is that the question of the next
outage is not if, but when," says Dritan Suljoti, Chief Product and
Technology Officer of Catchpoint. "Moreover, the downstream effect of major
outages to essential Internet infrastructure, such as cloud platforms, CDNs or
DNS providers, means that no company is immune, no matter how well prepared
they think they are. The white paper demonstrates why it's so important for all
of us to be proactive to reduce Mean Time to Repair (MTTR) when the next outage
occurs."
Key lessons from the past include:
- Develop an Internet Performance Monitoring strategy
that allows you to monitor precisely what customers, workforce, and other
users expect and build an Experience Score.
- Monitor not only what is under your direct control, map
your Internet stack to ensure you are monitoring every component of the
Internet Stack relied on to deliver your content (including DNS, CDN, ISP,
BGP, TCP configuration, SSL, and other cloud services, etc.).
- Automate intelligently - design and test automation to
ensure there are no bugs hiding in the code.
- Be prepared to take fast action to remediate outages as
they occur, for example, switching to a backup solution or dropping the
third-party causing the issue. Develop runbooks and practice recovery.
- Whenever change is scheduled, ensure your team is ready
for any outages that may occur (intentionally or not) with a crisis call
plan that includes a communication plan and templates, a plan to mitigate
failures from third-parties, and a best practices monitoring and
observability plan.
"Given
the impact of serious outages to the bottom line, not to mention the long-tail
impact to brand and reputation, amidst a landscape of increased Internet
reliance alongside ever-growing Internet fragility and greater and great
complexity, the need for community learnings from past failures to be shared
and practical advice disseminated around stemming future major incidents and
ensuring Internet Resilience is imperative," says Gerardo Dada, CMO at
Catchpoint. "We believe this white paper offers an invaluable deep dive into
recent outages past and key lessons learned that all of us can learn from to
prevent (or mitigate the consequences of) the next major outage."
In-depth
outage analysis of ten major recent incidents include:
- Amazon's
Search issue from December 5-7, 2022, that impacted at least 20% of all
global users for 22 hours - Catchpoint's IPM platform pinpointed root
cause to an HTTP 503 being returned by Amazon CloudFront.
- A
$B eCommerce company suffering issues around the DNS authoritative name
servers they were using to resolve a critical page on their website in
August 2022 - by monitoring the entire DNS resolution chain, Catchpoint
was able to identify precisely where the DNS resolution failure was
occurring, Learn three best practices for monitoring DNS.
- The
AWS December 2021 trifecta of outages - Catchpoint observed all three
outages well before they hit the AWS status page and unlike many of its
competitors was unaffected by them. Find out four key lessons for working
with a hosting provider.
- The
downstream effect of the Google Cloud outage in November 2021 - "a latent
bug in a network configuration service" led to outages across multiple
Google Cloud products and failures across many other non-Google companies,
from Home Depot to Spotify, whose websites were knocked offline for a
prolonged period.
- The
BGP misconfiguration at the heart of the Telia outage in October 2021
which affected many other companies, including Cloudflare, Equinix Metal
and Fastly.
- The
now notorious mega outage in October 2021 that took down Facebook,
WhatsApp, Messenger, Instagram and Oculus VR, for five hours and what
Catchpoint's deep dive into the BGP data revealed.