Industry executives and experts share their predictions for 2022. Read them in this 14th annual VMblog.com series exclusive.
More Outages and Less SREs to Fix Them
By Ashley Stirrup, COO at Shoreline, the incident automation company
This
past year has taught us many lessons, and at the same time, kept us on our toes
and learning. As we head into 2022, many of these lessons will be carried with
us. When looking at the DevOps industry, and Shoreline.io's area of expertise
specifically, there are a few forward looking predictions I foresee in the year
ahead:
Outages will continue to cost
companies 100's of billions of dollars.
Affecting
more than 3.5 billion people globally and disrupting what has become one of the
world's primary communications and business platforms, the five-hour-plus
disappearance of Facebook and its family of apps on Oct. 4, 2021 was a
technology outage for the ages.
Outages
of varying scope and duration will continue to happen. The right question for
every company has always been and remains not whether an outage could occur -
of course it could - but what can be done to reduce the risk, duration, and
impact.
We
watched the episodes - which on Oct. 4 specifically, cost Facebook between $60
and $100 million in advertising, according to various estimates - unfold from
the unique perspective of industry insiders when it comes to managing outages.
In assorted ways, this and other outages will serve as a wake-up call for
organizations to look within and make sure they have created the right
technical and cultural atmosphere to prevent or mitigate a Facebook-like
disaster.
According
to LinkedIn there are 1.1M SREs and according to IDC, each SRE manages almost
400 incidents per year, spending on average 2.25 hours to resolve each issue.
That means the industry at large will spend over one billion hours on incidents
in 2022 alone. Even with a conservative $100 per incident price tag, that is a
cost of $100B for 2022. As a side note, an average SRE costs $53.19 per hour according
to IDC, so the customer impact on downtime is likely to be much more than $47
per hour.
Thousands of SRE's will quit their
jobs
We
are continuing to see massive turnover and it is hard to find and retain folks
based on a variety of reasons. Any given day there are roughly 400K software
programming jobs open, and almost 300K SRE jobs open. A theory I have is that
this number has gone from 10 to 1, to 2 to 1 over the last ten years. SRE's are
becoming a critical bottleneck in the software development life cycle.
According
to LinkedIn there are over 1.1M people with SRE, Site Reliability, DevOps
Engineer or Cloud Operations Engineering in their title. The average tenure of
an SRE is slightly less than two years. That implies that over 500,000 SREs will
quit their jobs in 2022. There are also almost 300K of these jobs open on
LinkedIn right now. If you assume the
average job opening is available for 3 months, that implies we will hire 1.2M
SREs in 2022, more than there are people with this job in the market. If you
assume half of the jobs are filled with people who already have the title, that
is a lot of shuffling of deck chairs on the titanic.
There
is no doubt the SRE job is getting harder and harder. Production environments
get more and more complex with more cloud services, more features, more managed
services, containers, and microservices. On top of that, agile software
development and CI/CD tools lead to more and more often software updates.
The
same trends are making development faster (cloud, managed services, containers,
microservices) and making operations harder (higher change rate, more resources
to manage, higher expectations.)
The death of the runbook
In
the coming year, we will see runbooks fuse out. Our team, specifically, has been
surprised in the last few months with how infrequently companies have had true
runbooks.
Runbooks
are a great place for centralizing operational information, but they are a
terrible way to empower on-call teams. When a customer is down, the last thing
anyone wants to do is read the manual. A major reason why runbooks don't work
is that they are both too big and too general. No one cares to read "The 15
things to check if CPU is high." They instead want a checklist of "The three
things to do if CPU is high, and the JVM heap dump is maxed out."
This
is the exact reason we built Shoreline Notebooks, which can be tied to very
specific alarms. This allows notebooks to be very simple, concise, and
targeted. It also pays off when looking at the maintenance side. To put it in
perspective, when you have a giant notebook, it's much harder to determine if a
certain section is out of date because it could have been written to handle
multiple issues.
Furthermore,
most companies will tell you they in fact have a bunch of runbooks. Though,
when you push them, you'll find there are really only one or two runbooks. And,
when you ask and really start to dig in, much of what is actually done is not
even in the runbook. This is because it's too hard to build a runbook and
things change too quickly (so runbooks quickly become out of date).
The New Year will shine a light on many issues within
this space, though we can hope through continuous innovation and purposeful
acknowledgement, teams will embrace all that's to come.
##
ABOUT THE AUTHOR
Ashley Stirrup as its Chief Operating Officer at Shoreline, the incident automation company, where he is responsible for driving growth initiatives and scaling business operations to support Shoreline’s rapid customer adoption.
Stirrup is an industry veteran with 28 years of experience at technology companies, leading them through product-market fit to explosive growth to IPO and beyond. With a strong background in sales, marketing and channel programs, Stirrup excels at bringing developer technologies to market. Prior to Shoreline, Stirrup was CMO of Alogia where he helped the business double in two years through both product-led, self-service channels and high touch sales programs. Before Algolia, he was CMO at Talend where, in 5 years, they grew the company from $60M to $200M, taking Talend public in 2016 on the Nasdaq (TLND), tripling its awareness scores, and establishing Talend as an industry leader in multiple Gartner and Forrester analyst reports.