By Ashley Stirrup, COO, Shoreline.io
Production environments are becoming larger
and significantly more complex, leading to an explosion in on-call incidents. In our 2022 Benchmarking Production Operations Report,
conducted by Dimensional Research, we polled over 300 on-call
practitioners, managers, and executives responsible for incident response in
cloud production environments.
Our data revealed that while reliability is a
high priority among production operations executives, incidents are
happening more frequently - and take far more time to resolve - than many
realize.
Download the full 2022 Benchmarking Production Operations
Report here.
Reliability is a high priority
(and a $2.5M investment) for production operations
Based on our survey data, we estimate that
companies spend $2.5M per year on on-call operations. And it's no surprise that
production operations is a huge area of investment - our survey found that 97%
of companies report their leadership teams have reliability priorities for their
cloud infrastructure.
The top priority (48%) is to reduce the number
of incidents overall. Other high-priority areas include:
- Shorten time to recover from
incidents (41%)
- Increase user or customer
satisfaction (39%)
Despite the fact that incident reduction is
the top priority for almost half of our survey respondents -
frequent incidents still consistently plague IT teams.
Almost 300 cloud infrastructure
incidents occur monthly
Our survey found, on average, organizations
deal with a total of 278 incidents per month. These issues could be as small as
a slow-loading page within an app, or as large as widespread downtime that
lasts hours. As a result, companies spend a whopping 2,084 hours tending to
incidents every month - it's an incredibly time-consuming and expensive issue.
In addition to the hundreds of incidents that
occur each month, our survey respondents reported an average of 8.7 major incidents that occur annually. These
major incidents are categorized as any incident that directly impacted business
outcomes such as customer experience or employee productivity.
Now that we understand the prevalence of
incidents, it's time to focus on solutions to help prevent and manage incidents
when they do occur. This represents a huge area of opportunity for production
operations teams.
How can we fix production
operations issues?
While many organizations or IT leaders realize
that incidents occur regularly - it's often tough to quantify their impact and
proactively implement solutions for the future.
One key is to shift our thinking around
ticketing data. Ticketing data, if used correctly, can identify which issues
are hurting the customer experience most and where your teams are spending the
most time. Too often, operations teams
just use their incident management tool for assigning and routing work when
issues occur. These teams are not
insisting that all incidents are tracked through a single source of truth and
they aren't ensuring that incidents are tagged correctly so they can measure:
- Frequency of similar incidents
- Number of customers impacted per
incident
- Number of engineers assigned to an
incident
Without this data, it's unlikely that
operations and engineering teams are spending their time where it can reduce
the most risk, increase customer satisfaction the most or reduce engineering
toil.
In a world where it's simply not sustainable
to scale your on-call and DevOps team to match the complexity of your
environment, productivity improvements are the only logical solution. With a
more productive and future-focused team, organization's can then look to proactively
eliminate the root causes that lead to major incidents and create tools that
shorten time to resolve both major and minor incidents.
At Shoreline, our goal is to help production
operations teams improve productivity and tackle these very tasks. We've
created a Cloud
Reliability Platform that makes it easy to search across your entire
infrastructure to find, diagnose, and automate the repair of issues to reduce
the risk of major outages and improve team productivity.
For more tips to improve on-call reliability, download the full 2022 Benchmarking Production Operations
Report here.
##
To hear more about cloud native topics,
join the Cloud Native Computing
Foundation and the cloud native community at KubeCon + CloudNativeCon North America 2022 in Detroit
(and virtual) from October 24-28.
ABOUT THE AUTHOR
Ashley Stirrup, COO, Shoreline.io
Ashley is the chief operating officer for
Shoreline.io. Before Shoreline, Ashley was the chief marketing officer for
Algolia, a search engine used by e-commerce, media and technology companies to
power more than 1 trillion search requests per year. Prior to Algolia, Ashley
was the chief marketing officer for Talend, a leading data integration company
helping some of the world's largest companies turn data into insight. Ashley
has held a number of senior leadership positions in marketing and products at
leading cloud and software companies, including ServiceSource, Taleo, Citrix
and Siebel Systems.