Transposit announced results from its third annual State of
DevOps Automation and AI research study about the intricate challenges faced by
organizations in managing incidents effectively.
Findings uncovered an incident
management paradox: despite a majority of respondents (59.4%) who have a
defined incident management process in place and a level of automation that
meets their needs (71.1%), organizations grapple with a surge in service
incidents and still struggle to quickly resolve them. Nearly two-thirds of
organizations (66.5%) reported an increase in the frequency of service
incidents that have affected their customers over the past 12 months, a 3.6%
increase from the 2022 survey.
These downtime-producing incidents (i.e., application outages,
service degradation) are putting organizations at risk of losing up to $499,999
per hour on average, according to 63% of respondents - a nearly 5% increase
from 2022. Almost half (46.6%) also said downtime can cost anywhere from $100K
to $2M. Research points to generative AI as a means to resolve the incident
management paradox with 84.5% who either believe AI can significantly
streamline their incident management processes and improve overall efficiency
or are excited about the opportunities AI presents for automating certain
aspects of incident management.
Transposit surveyed more than 1,000 U.S.-based
IT Operations, DevOps, site reliability engineering (SRE), and platform
engineering professionals with the role of VP, Director, Manager, and engineer.
"The
insights unearthed in our research underscore the pressing need for adaptive,
LLM-based automation that transcends mere task repetition and, instead,
dynamically adapts to evolving circumstances by assimilating cues and context
in real-time," said Divanny Lamas, CEO of Transposit. "Traditional, rule-based
automation tools are no longer sufficient for the demands of modern operations
teams. Despite robust incident management processes within numerous
organizations, the relentless surge in service incidents - with its
consequential impact on customers and financial ramifications - mandates a
transformative approach. The path forward lies in harnessing innovative
solutions like generative AI, augmented by automation and guided by human
judgment, to not only expedite incident resolution but also proactively detect
and preempt potential issues before they escalate."
Time Lags and Knowledge Gaps Lead to Inefficient Incident
Management
In
the realm of incident management, reliability engineering teams face
significant hurdles. Nearly three-quarters (73.9%) of those responsible for
reliability engineering experience challenges while trying to solve incidents
including brittle automation scripts (59.7%), too many manual processes
(47.8%), and difficulty accessing specialized knowledge (47.2%). Moreover, more
than four in 10 (42.5%) organizations said their current incident management
process is not effective or is only being used by some team members due to
confusing documentation (41.3%), limited access to tools (40.4%), and reliance
on institutional knowledge (39.7%).
61.5%
of organizations also cited an increase in the amount of time it takes to
resolve incidents over the course of the last year, with nearly eight in 10
respondents (79.8%) saying it takes up to six hours on average to resolve
incidents from the first alert to mitigating the issue. Beyond the extended
incident resolution time, there's an added layer of complexity in assembling
the right team members, as indicated by 71.3% who reported this process can
take up to 30 minutes. Adding to this, a significant portion of team members
find it challenging to grasp and routinely apply the organization's defined
procedures. Over one-third of organizations (37.4%) report that only select
team members have a comprehensive understanding of the defined incident management
process and adhere to it consistently.
Automation Hurdles Add to Service Incident Complexity
Organizations
grapple not only with inefficiencies in incident resolution but also encounter
hurdles in implementing automation. One-third of respondents (33.3%) cited only
11-25% of their incident management tasks or workflows are automated,
showcasing an opportunity for more automation in organizations' incident
management processes. Delving deeper, respondents expressed keen interest in
automating pivotal aspects of the incident lifecycle, such as incident setup
(50.0%), communication protocols (44.2%), investigative processes (30%), and
remediation (29%).
Despite
the interest in implementing automation, respondents cited these top four
barriers to achieving it:
- There is not enough buy-in
from leadership or management (57.1%)
- Share of knowledge is not
enough (54.3%)
- Inadequate documentation of
institutional knowledge and existing processes (54%)
- Lack of clarity about what
to automate (52.4%)
When
using SaaS tools, organizations are able to more quickly create automations.
Nearly three in four respondents (74.6%) embraced SaaS tools, with 82.0%
confirming their ability to create automations without coding. 84.3% reported
spending just 11 minutes to an hour, underscoring the efficiency of SaaS
solutions in incident management.
Organizations Enhance Tech Stack with AI-Based Applications and
Automation Tools, and Strategically Increase SRE and Platform Engineering
Initiatives
Over
the next 12 months, 72.1% of teams expect to expand their tech stack. To
strengthen their incident management process and decrease mean time to
resolution/repair (MTTR), organizations plan to implement new tools, including:
- AI- or ML-based tools or
applications (60.0%)
- Automation tools or
applications (53.1%)
- Communication/collaboration
tools or applications (48.1%)
SRE
and platform engineering play a vital role in implementing AI and automation.
Over the past year, 61.5% increased their focus on SRE practices, intending to
hire more site reliability engineers, while 57.5% enhanced platform engineering
efforts, planning to bring in more platform engineers. These strategic moves
highlight organizations' dedication to fortifying their incident management
capabilities.
Operations Teams Embrace SaaS Tools that Harness Generative AI and
Human-in-the-Loop Automation for Rapid MTTR Reduction
Findings
illuminate a clear path forward for the incident response lifecycle,
emphasizing the need for a SaaS tool or platform that seamlessly integrates all
of the incident management tools organizations use, leverages human data
insights, and harnesses generative AI to bolster operational efficiency and
decision-making.
An
overwhelming majority (90.4%) of respondents believe that systematically mining
insights from human data (such as archived Slack communications, retrospective
interviews, group feedback, etc.) could improve future incident response and
improve operational excellence. However, 90.2% agree automation should let
humans use their judgment at critical decision points to be more reliable and
effective, a nearly 10% (9.8%) increase from the 2022 study.
Integrating
generative AI capabilities into incident management tools or platforms was
found by 89.8% as a way to decrease the time it takes to create new
automations, freeing time for other high-value work. Almost all (96.3%) believe
it would be beneficial if all of the tools their organization used during an
incident were integrated through one tool or platform.
For
the 79.5% of organizations that have embraced AI in their tech stack, the
impact is significant:
- More than half (51%) feel AI
is making their job better, showing an improving work life for humans
- 63.5% use it to improve the
accuracy and quality of data
- 50.7% report faster time to
incident resolution
- 49.4% use it to more quickly
and easily identify root cause of issues, potential threats and vulnerabilities
- 48% use it to automate
repetitive tasks or processes, streamlining their operations effectively
Lamas
concluded, "In light of the evolving demands placed on modern ops teams, it
becomes evident that what these teams require is an adaptive, LLM-based
automation and incident management solution. This unified, intelligent approach
goes beyond streamlining processes; it empowers teams to leverage automation
and AI to enhance their organization's incident management processes and
develop more efficient automated workflows. By ensuring that humans remain
actively engaged in the process, this approach becomes increasingly vital for
seamless incident resolution and a reduction in MTTR. Ultimately, it enables
teams to concentrate their efforts on what truly matters - delivering efficient
and effective solutions to complex problems."