Virtualization Technology News and Information
Blameless 2022 Predictions: The world of Reliability Engineering in the new year

vmblog predictions 2022 

Industry executives and experts share their predictions for 2022.  Read them in this 14th annual series exclusive.

The world of Reliability Engineering in the new year

By Kurt Andersen, SRE Architect, Blameless

As the new year approaches, we at Blameless like to ponder the future of Reliability Engineering. For 2021, we predicted that the practice of site reliability engineering (SRE) would continue to grow in terms of adoption, we would see adoption increase faster among smaller organizations, and SRE practices would get more attention to drive adoption compared to hiring. We're sure you'll agree that these trends have indeed strengthened in the last year. Now we're turning our crystal ball to see what 2022 will bring and where reliability will go!

A growing sense of urgency

News of outages has been unavoidable in the last year. The world is learning that failure is inevitable, whether you're a startup or an enterprise giant. At the same time, the pandemic has made the online experience all the more essential. User expectations have shifted such that features for online services that used to designate best-in-class have now become the baseline. We predict these trends will continue to motivate businesses to invest in reliability capabilities with even greater urgency. Mindy Stevenson, Blameless SRE Manager, emphasizes: "waiting until an outage hits home isn't a strategy. You have to be proactive."

A broadening scope for SREs

As reliability becomes fundamental to a company's ability to operate, we predict the SRE role will come into its true potential instead of being limited by partial implementation. Jake Englund, Sr. SRE at Blameless, put it so: "if SREs are currently like mechanics, fixing cars when they crash, SREs will become more like civil engineers, focusing more on designing the roads for cars." Reliability starts in design, and we see the role of a reliability engineer continue to involve themselves in the earliest stages such as architecture and prototyping. We also see the knowledge base for this role becoming more about learning on the job than coming in with established, specific expertise. As toolstacks get more complex and specialized to each team and purpose, excelling as an SRE depends more on the ability to learn continuously rather than leverage what you already know.

A holistic understanding of users

SRE has always been about alignment based on user expectations. We predict that in 2022, organizations will have a deeper, more holistic understanding of who their users are. Rather than thinking of users as a single entity, organizations will dig into the specific experiences that users have. How does each group use your service, and what's important to them when they do? Looking at cohorts of end users will expand to include internal roles - what is the experience of someone hired to administer or manage your service, and what does reliability mean to them? As this requires a view of reliability that crosses the entire organization, Craig Peters, Blameless Head of Product, sees this as potentially leading to the creation of a Chief Reliability Officer, a role that exists already in manufacturing industries. Software reliability has often followed the lead of companies like Toyota in reliability practices - could reliability recognition in the C-suite come next?

Finding the true potential of SLOs

SLOs have emerged into the spotlight as one of the most recognized aspects of SRE. While the concept is popular, orgs are still figuring out how to do it. Kurt Andersen, Blameless Head of Strategy and former SRE at LinkedIn, thinks SLOs are still on the "upswing" of the hype cycle. Teams expect too much and commit too little, or they call any metric an SLO and end up with unexpected results. Despite this, we predict organizations will come to understand SLOs better in the new year. Orgs are investing more into this initiative, like researching user journeys - another area we anticipate orgs will grow in - and leveraging tools to help them track and measure user happiness. As more and more orgs start to envision the full potential of the SLO, they'll also come to value other aspects of SRE that support and inform the SLO. Mindy Stevenson explains that many organizations think SLOs are the first and most important step in SRE, when really, the sequence of implementing SRE can vary significantly for each org.




Kurt Andersen is a practitioner and an active thought leader in the SRE community. He speaks at major DevOps & SRE conferences and publishes his work through O'Reilly in quintessential SRE books such as Seeking SRE, What is SRE?, and 97 Things Every SRE Should Know. Before joining Blameless, Kurt was a Sr. Staff SRE at LinkedIn, implementing SLOs (reliability metrics) at scale across the board for thousands of  independently deployable services. Kurt is a member of the USENIX Board of Directors and part of the steering committee for the world-wide SREcon conferences.

Published Thursday, December 30, 2021 7:32 AM by David Marshall
There are no comments for this post.
To post a comment, you must be a registered user. Registration is free and easy! Sign up now!
<December 2021>