Virtualization Technology News and Information
Production Operations Survey Finds Companies Suffer 8.7 Major Cloud Infrastructure Incidents Annually

By Ashley Stirrup, COO,

Production environments are becoming larger and significantly more complex, leading to an explosion in on-call incidents. In our 2022 Benchmarking Production Operations Report, conducted by Dimensional Research, we polled over 300 on-call practitioners, managers, and executives responsible for incident response in cloud production environments.

Our data revealed that while reliability is a high priority among production operations executives, incidents are happening more frequently - and take far more time to resolve - than many realize.

Download the full 2022 Benchmarking Production Operations Report here.

Reliability is a high priority (and a $2.5M investment) for production operations

Based on our survey data, we estimate that companies spend $2.5M per year on on-call operations. And it's no surprise that production operations is a huge area of investment - our survey found that 97% of companies report their leadership teams have reliability priorities for their cloud infrastructure.

The top priority (48%) is to reduce the number of incidents overall. Other high-priority areas include:

  • Decrease costs (42%)
  • Shorten time to recover from incidents (41%)
  • Increase user or customer satisfaction (39%)

Despite the fact that incident reduction is the top priority for almost half of our survey respondents - frequent incidents still consistently plague IT teams.

Almost 300 cloud infrastructure incidents occur monthly

Our survey found, on average, organizations deal with a total of 278 incidents per month. These issues could be as small as a slow-loading page within an app, or as large as widespread downtime that lasts hours. As a result, companies spend a whopping 2,084 hours tending to incidents every month - it's an incredibly time-consuming and expensive issue.

In addition to the hundreds of incidents that occur each month, our survey respondents reported an average of 8.7 major incidents that occur annually. These major incidents are categorized as any incident that directly impacted business outcomes such as customer experience or employee productivity.  

Now that we understand the prevalence of incidents, it's time to focus on solutions to help prevent and manage incidents when they do occur. This represents a huge area of opportunity for production operations teams.

How can we fix production operations issues?

While many organizations or IT leaders realize that incidents occur regularly - it's often tough to quantify their impact and proactively implement solutions for the future.

One key is to shift our thinking around ticketing data. Ticketing data, if used correctly, can identify which issues are hurting the customer experience most and where your teams are spending the most time.  Too often, operations teams just use their incident management tool for assigning and routing work when issues occur.  These teams are not insisting that all incidents are tracked through a single source of truth and they aren't ensuring that incidents are tagged correctly so they can measure:

  • Frequency of similar incidents
  • Number of customers impacted per incident
  • Severity of the incident
  • Time to resolve
  • Number of engineers assigned to an incident

Without this data, it's unlikely that operations and engineering teams are spending their time where it can reduce the most risk, increase customer satisfaction the most or reduce engineering toil. 

In a world where it's simply not sustainable to scale your on-call and DevOps team to match the complexity of your environment, productivity improvements are the only logical solution. With a more productive and future-focused team, organization's can then look to proactively eliminate the root causes that lead to major incidents and create tools that shorten time to resolve both major and minor incidents. 

At Shoreline, our goal is to help production operations teams improve productivity and tackle these very tasks. We've created a Cloud Reliability Platform that makes it easy to search across your entire infrastructure to find, diagnose, and automate the repair of issues to reduce the risk of major outages and improve team productivity. 

For more tips to improve on-call reliability, download the full 2022 Benchmarking Production Operations Report here.


To hear more about cloud native topics, join the Cloud Native Computing Foundation and the cloud native community at KubeCon + CloudNativeCon North America 2022 in Detroit (and virtual) from October 24-28. 


Ashley Stirrup, COO,


Ashley is the chief operating officer for Before Shoreline, Ashley was the chief marketing officer for Algolia, a search engine used by e-commerce, media and technology companies to power more than 1 trillion search requests per year. Prior to Algolia, Ashley was the chief marketing officer for Talend, a leading data integration company helping some of the world's largest companies turn data into insight. Ashley has held a number of senior leadership positions in marketing and products at leading cloud and software companies, including ServiceSource, Taleo, Citrix and Siebel Systems.

Published Thursday, October 20, 2022 7:33 AM by David Marshall
Filed under: ,
There are no comments for this post.
To post a comment, you must be a registered user. Registration is free and easy! Sign up now!
<October 2022>