Virtualization Technology News and Information
RSS 2022 Predictions: More Outages and Less SREs to Fix Them

vmblog predictions 2022 

Industry executives and experts share their predictions for 2022.  Read them in this 14th annual series exclusive.

More Outages and Less SREs to Fix Them

By Ashley Stirrup, COO at Shoreline, the incident automation company

This past year has taught us many lessons, and at the same time, kept us on our toes and learning. As we head into 2022, many of these lessons will be carried with us. When looking at the DevOps industry, and's area of expertise specifically, there are a few forward looking predictions I foresee in the year ahead:

Outages will continue to cost companies 100's of billions of dollars.

Affecting more than 3.5 billion people globally and disrupting what has become one of the world's primary communications and business platforms, the five-hour-plus disappearance of Facebook and its family of apps on Oct. 4, 2021 was a technology outage for the ages.

Outages of varying scope and duration will continue to happen. The right question for every company has always been and remains not whether an outage could occur - of course it could - but what can be done to reduce the risk, duration, and impact.

We watched the episodes - which on Oct. 4 specifically, cost Facebook between $60 and $100 million in advertising, according to various estimates - unfold from the unique perspective of industry insiders when it comes to managing outages. In assorted ways, this and other outages will serve as a wake-up call for organizations to look within and make sure they have created the right technical and cultural atmosphere to prevent or mitigate a Facebook-like disaster.

According to LinkedIn there are 1.1M SREs and according to IDC, each SRE manages almost 400 incidents per year, spending on average 2.25 hours to resolve each issue. That means the industry at large will spend over one billion hours on incidents in 2022 alone. Even with a conservative $100 per incident price tag, that is a cost of $100B for 2022. As a side note, an average SRE costs $53.19 per hour according to IDC, so the customer impact on downtime is likely to be much more than $47 per hour.

Thousands of SRE's will quit their jobs

We are continuing to see massive turnover and it is hard to find and retain folks based on a variety of reasons. Any given day there are roughly 400K software programming jobs open, and almost 300K SRE jobs open. A theory I have is that this number has gone from 10 to 1, to 2 to 1 over the last ten years. SRE's are becoming a critical bottleneck in the software development life cycle.

According to LinkedIn there are over 1.1M people with SRE, Site Reliability, DevOps Engineer or Cloud Operations Engineering in their title. The average tenure of an SRE is slightly less than two years. That implies that over 500,000 SREs will quit their jobs in 2022. There are also almost 300K of these jobs open on LinkedIn right now.  If you assume the average job opening is available for 3 months, that implies we will hire 1.2M SREs in 2022, more than there are people with this job in the market. If you assume half of the jobs are filled with people who already have the title, that is a lot of shuffling of deck chairs on the titanic.

There is no doubt the SRE job is getting harder and harder. Production environments get more and more complex with more cloud services, more features, more managed services, containers, and microservices. On top of that, agile software development and CI/CD tools lead to more and more often software updates. 

The same trends are making development faster (cloud, managed services, containers, microservices) and making operations harder (higher change rate, more resources to manage, higher expectations.) 

The death of the runbook

In the coming year, we will see runbooks fuse out. Our team, specifically, has been surprised in the last few months with how infrequently companies have had true runbooks.

Runbooks are a great place for centralizing operational information, but they are a terrible way to empower on-call teams. When a customer is down, the last thing anyone wants to do is read the manual. A major reason why runbooks don't work is that they are both too big and too general. No one cares to read "The 15 things to check if CPU is high." They instead want a checklist of "The three things to do if CPU is high, and the JVM heap dump is maxed out."

This is the exact reason we built Shoreline Notebooks, which can be tied to very specific alarms. This allows notebooks to be very simple, concise, and targeted. It also pays off when looking at the maintenance side. To put it in perspective, when you have a giant notebook, it's much harder to determine if a certain section is out of date because it could have been written to handle multiple issues.

Furthermore, most companies will tell you they in fact have a bunch of runbooks. Though, when you push them, you'll find there are really only one or two runbooks. And, when you ask and really start to dig in, much of what is actually done is not even in the runbook. This is because it's too hard to build a runbook and things change too quickly (so runbooks quickly become out of date).

The New Year will shine a light on many issues within this space, though we can hope through continuous innovation and purposeful acknowledgement, teams will embrace all that's to come.



ashley stirrup 

Ashley Stirrup as its Chief Operating Officer at Shoreline, the incident automation company, where he is responsible for driving growth initiatives and scaling business operations to support Shoreline’s rapid customer adoption.
Stirrup is an industry veteran with 28 years of experience at technology companies, leading them through product-market fit to explosive growth to IPO and beyond. With a strong background in sales, marketing and channel programs, Stirrup excels at bringing developer technologies to market. Prior to Shoreline, Stirrup was CMO of Alogia where he helped the business double in two years through both product-led, self-service channels and high touch sales programs. Before Algolia, he was CMO at Talend where, in 5 years, they grew the company from $60M to $200M, taking Talend public in 2016 on the Nasdaq (TLND), tripling its awareness scores, and establishing Talend as an industry leader in multiple Gartner and Forrester analyst reports.

Published Monday, January 17, 2022 7:33 AM by David Marshall
There are no comments for this post.
To post a comment, you must be a registered user. Registration is free and easy! Sign up now!
<January 2022>