Catchpoint conducted a study with VMware
Tanzu and DevOps Institute of nearly 300 site reliability
engineers (SREs). The SRE
Report is one of the most data-backed studies of its kind and has played a
critical role in defining the nature of what it means to be a SRE since it
launched four years ago. This year's report underscores the challenges of
multi-cloud, calls out the underutilization of AIOps, and
shows
a systemic shift in core baselining data. The report concludes by
offering an actionable path for SREs to consistently deliver customer value.
Download the report here.
"SREs deal with a very broad set of challenges that span across
transformational and operational activities," says Mehdi Daoudi, CEO of
Catchpoint. "This report arms them with the insights they need to help address
these challenges - to balance the need for agility against the need for
stability when
building and operating massive, distributed, and reliable systems."
Levels of Toil Fall Around
the World
Toil is the work tied to a
production service that tends to be manual, repetitive, automatable, and devoid
of enduring value. Google
suggests that SREs should do no more than doing 50% ops work (including toil)
and 50% dev work. This year, the SRE Report notes an average year-over-year drop in toil
of 15%.
"The reason this is such an impactful
insight is that the drop in toil was across all geographies," says Tony
Ferrelli, Vice President of Technical Operations at Catchpoint. "If this drop
in toil was because work felt more meaningful since COVID-19 led to SREs
working-from-home, then will reported toil levels rise next year as people
return to the office or a hybrid work environment?"
The Accelerating Use of
Multiple Providers Warns of a Looming Scalability Ceiling
If the cloud is your new datacenter,
then third-party services like DNS and CDN are your new racks
and cabinets. When combining the rising use of multiple same-service
platforms (e.g., multi-cloud) with the increase in the volume, velocity,
and variety of monitoring data, there is little wonder why lack of
visibility across the stack (53%) was the most cited cloud-app monitoring
challenge or why SREs continually refine service level objectives (50%). The survey responses give rise to the critical
question, how can companies most effectively scale SRE implementations?
"Spanning
the gaps between the interfaces and the data that each provider offers
increases the difficulty for SRE teams to automate across those multiple
providers. These integrations are rarely simple except for the most superficial
aspects. Effectively mapping disparate data models together may be the next
frontier for SRE in a multi-vendor environment," says Kurt Andersen, SRE
Architect, Blameless
The Shift Toward AIOps Is
Slow
AIOps has been widely touted
to reduce laborious ops work and to intelligently sift through the
ever-increasing volumes of data that organizations are continually presented
with. However,
the report shows that many SREs have never used AIOps and their rating of
its received value evenly spanned the 1-9 value scale.
According to J. Bobby Dorlus, Staff SRE at
Twitter, "Most SREs working at this scale are already
leveraging machine learning, especially when it comes to efficiencies around
data centers (locations, cooling, and all the things that happen inside it) for
networks and building out infrastructure ... Evolving that into AIOps is the next
logical step."
Observability
Should Include Digital Experience Metrics and Business KPIs
SREs that fail to deliver customer
value run the risk of being stuck in an operational toil rut. Conversely,
businesses that fail to recognize the importance of SRE activities risk losing
talented employees and their competitive edge.
The highest-ranked driver for
successful SRE implementations was incident resolution (60%), while expanding
the business was fifth lowest (33%). These findings show that SREs are still
inwardly focusing on IT operations versus outwardly focusing on the business
results that deliver customer value. To close this IT-to-business gap, SRE
teams must expand observability boundaries to include digital experience
metrics and business KPIs.
"The
balancing work of innovation while providing operational excellence has forced
many IT teams to put heavy emphasis on improving reliability and stability of
services and applications," says Eveline Oehrlich, Chief Research and Content
Officer at the DevOps Institute. "What SREs now need to do is make sure the
value of these reliable services and applications are understood by the
customer."
RECOMMENDATIONS
- Businesses
and SREs need to establish a baselining program around core SRE tenets and
business level metrics to know whether things are getting better or worse.
- Platform
Operations teams should be implemented to achieve higher levels of scale
and efficiency. Platform Ops should develop normalized capabilities for
SREs across the organization to draw on (even though underlying platforms
will have different interfaces) and treat those capabilities as a product
to sell and market to other teams within the business.
- To
achieve the promise of AIOps, SREs and managers must break down AIOps into
smaller components and incrementally develop from there, in addition to
investing in training in AI and ML for SRE teams.
- It is
crucial to find ways to bridge the gap between SRE and business goals.
Start conversations around capabilities, for instance, versus focusing on
low-level monitoring metrics and high-level business outcomes.