Virtualization Technology News and Information
VMblog Expert Interview: Blameless Talks Site Reliability, DevOps, Communications During Incidents, Culture & More


With everything happening in the reliability landscape in 2021, VMblog reached out to industry expert Kurt Andersen, SRE Architect at Blameless, to learn more about the latest in reliability challenges that companies face, moving from reactive to proactive, and how to handle incidents, major or minor.

VMblog: What are the biggest challenges companies are facing regarding incidents?

Kurt Andersen:  We continue to see many organizations and specifically senior leadership who don't understand why incidents are inevitable. Without viewing incidents as an opportunity to learn, it becomes difficult for teams to gain both tribal and system-level knowledge, continuously.

Even with the best preventative, proactive engineering, today's highly interdependent, complex landscape means that your product or service will have outages or incidents. Those experiences provide fertile ground for teams and organizations to learn and adapt but only if the organizational culture is willing to embrace the learning.

The other big challenge that groups face when trying to learn from incidents is the difficulty of collecting the relevant data across a wide span of different tools and various communications channels, such as Slack or MSTeams. Effectively capturing the moving parts among the responding teams really helps to analyze, post-incident.

We plan to announce a new feature CommsFlow that streamlines all communication during and after incidents, significantly speeding up the important task of delivering information-rich and timely updates to customers and internal teams. Using customizable templates, updates are automatically sent at workflow transitions or as task reminders for on-call engineers.

VMblog: What is the number one thing that makes companies vulnerable to incidents?

Andersen:  There is no one thing that makes companies vulnerable to incidents, but rather many, many things. With complex systems and especially cloud-native along with what Jeff Bezos calls the "divine discontent" of customers, come multiple opportunities for incidents. We often refer to root cause analysis when troubleshooting an issue. However, it's misleading because there is generally no single root cause. 

Even if your product or service perfectly fulfills the needs of your users at this moment, those needs will be different with every passing moment. David Woods expresses this as the Law of Stretched Systems where "every system is stretched to operate at its capacity. . .as soon as there is some improvement, some new technology, we exploit it to achieve a new intensity and a new tempo of activity" (Woods and Hollnagel, 2006, Joint Cognitive Systems: Patterns in Cognitive Systems Engineering, p 171)

VMblog: What can companies do to embrace the blameless culture mindset?

Andersen:  The biggest cultural and psychological barrier to learning is blame. To overcome this, it is important to decouple actions and their consequences from people's existential being. If a team makes a mistake, it does not mean that the team is a mistake. Some companies have implemented "Failure awards", such as Etsy's 3-armed sweater award. This helps to shift the attitude about failures to see them as learning opportunities.

VMblog: What should SREs be doing to protect their company?

Andersen:  The keys to reliability include:

  • Effective monitoring and observability, because you need to measure, track and iterate
  • Implement good architectural practices, to more easily scale over time
  • Develop well-crafted customer-centric service level objectives (SLOs), to stay focused on the most important parts of your service
  • Have effective processes for incident response and feedback on the learnings to improve services and systems reliability.

VMblog: Give us an update on Blameless and its work with SRE and DevOps?

Andersen:  We generally start working with the DevOps teams inside an org - that usually starts with teams who are on-call. Most modern engineering teams rotate on-call across a unified group - so it's not only operations. If a subject matter expert (SME) is required, those individuals are easily added to a Slack or MS teams channel, once you identify the right individuals. Blameless makes this process highly efficient and seamless. We auto-create the slack channel for the incident and all relevant team members are added so they quickly start incident troubleshooting. This requires focus time from engineers. Blameless orchestrates the entire process, collecting critical data along the path to resolution.

Often organizations want to improve the efficiency of their on-call process from beginning to end and most importantly:

  1. Get away from manual work or point solutions: On-call engineers don't want to spend time stitching together the playbook or steps in a manual way. By using multiple tools that don't actually talk to each other, there's no single thread or centralized place to manage the incident to completion.
  2. The goal is to streamline who is involved when triaging an incident. Invite the right team members at the right time. Avoid noisy alerts by waking up everyone; it's simply not necessary and causes toil and burn-out.
  3. Teams learn from incidents and everyone uplevels plus systems improves - reliability gets better over time -  and the entire team focuses on innovation by delivering new releases. It's critical to close out incidents in the right way and document exactly what happened.
  4. Finally, data insights are critical for not only helping the team learn but the entire company to stay informed as service status changes. Product teams learn valuable insights as to how reliable the service is, especially for critical or new features. GTM teams and e-staff need insights to make top-line decisions and of course, end-customers need to know as they continue to invest in the offering or as businesses expand into new markets or add-ons to an existing product, critical for growth.

VMblog: Tell us about the 2022 SRE Predictions?

Andersen:  Every year we take a step back and look at the market big picture to spot trends we see take place. Often we experience new trends first-hand as our teams work directly with organizations, small to large across multiple sectors and geos.

Our process: we conduct internal meetings with our internal engineering / SRE teams and also with our customer success team-members. Additionally, we talk to industry analysts and thought leaders - we like to stay ahead of market developments and requests from our customers.

VMblog: You say that Blameless is the "backbone for modern engineering teams." Why?       

Andersen:  We use the metaphor of a backbone because it's the centralized place where all team-members communicate, at multiple runbook steps, conducting tasks and follow up actions as they debug the problem. Blameless is the platform that is orchestrating all the way through each step by integrating with day-to-day tools devops teams use - PagerDuty, Jira, StatusPage, MS Teams, Slack, Opsgenie, and so on. It's that golden thread that is running through all critical steps from beginning to end.

While the oncall team is doing the deep thinking, assessing, and fixing the problem, blameless runs in the background continuously updating, communicating at the critical milestone steps.

VMblog: How does Blameless look at reliability?

Andersen:  We really believe Reliability starts with an agreed mindset and discipline which extends beyond just the engineering team.

The company as a whole has to believe that when its customers expect a certain level of reliability, it is directly correlated to their investment plus loyalty .. and so then it requires the company to take it seriously and invest that crosses multiple functions of the business. From product to engineering to support and success.

Operations teams shouldn't be left with the burden of owning reliability ... hamster wheeling their way through yet another issue or incident. Of course not all incidents are severe or cause a total outage but a bad experience is not ideal for the end customer, regardless. Many inferior experiences can actually be worse than one sev 1 or outage.

We tend to break down the reliability journey like this:

  1. Incident management which is the entire process flow and data from beginning to end with a detailed retrospective report for learnings.
  2. SLOs are critical to align on. It's what the team believes is the most critical part of the service that customers rely on. By focusing on those KPIs (so to speak), it streamlines the team's focus and helps to avoid treating everything with equal weight which lets face it is unscalable and unrealistic.
  3. Finally two equally important aspects of reliability are the culture across all teams focused on maintaining these objectives and that is adopting a blameless culture mindset. It's about all teams working together and lifting together plus learning and growing. We believe it's the only way forward.
  4. Lastly, by analyzing how teams are performing, everyone improves and learns. You need the right data sets to inform where to improve and where to invest going forward.

VMblog: What do companies get wrong about reliability?       

Andersen:  Reliability is not only about incident response. It's a business imperative. If your service is unreliable, customers will not adopt (or buy) and over time it's simply bad for the brand and will degrade and damage it over time. Markets are too competitive and customers have high expectations. The cost of switching in many cases is low so it's very easy to lose customers to the next best offering.

The other thing that companies tend to get wrong is that it really is a journey and it's never a quick fix and move on. It's constantly evolving and changing as the business grows. For example, new team-members are on-boarded plus as the services or products change, new SLOs will need adjusting. It's a continuous area that companies need to focus on and invest in as the reliability program matures.

Finally, I would add that incidents are a way of learning and improving and so embracing how you do that is critical. Never expect incidents to go away. They may change shape but they will always take place.


Published Tuesday, January 25, 2022 7:33 AM by David Marshall
Filed under: ,
There are no comments for this post.
To post a comment, you must be a registered user. Registration is free and easy! Sign up now!
<January 2022>