Virtualization Technology News and Information
Fine-tuning Your Incident Response with Chaos Engineering

[ This article is part of a series promoting ChaosConf -- the world's largest chaos engineering event hosted by Gremlin on October 6-8, 2020. Register free! ]

Article by Julie Gunderson, DevOps Advocate, PagerDuty

Chaos. Monkeys. Failure. At first glance Chaos Engineering sounds intimidating, however, it's not as scary as it appears. With some planning and thought, experimenting on systems in the form of Chaos Engineering is something any company can (and should) do.

My first introduction into Chaos Engineering was in 2015 when I met some amazing folks from Netflix who gave me some cool stickers and a fun, yet remarkably aggressive looking, squishy chaos monkey.


I proudly placed my stickers on my laptop and didn't think much about it until a few months later when I was asked to explain the collage of pulp fiction like characters. That was the moment I went down the monkey hole of what Chaos Engineering is and why organizations practice it. Flash forward a few years to Devopsdays Minneapolis, when I had the opportunity to experience Chaos Engineering in the works at a Gremlin workshop and I was hooked.

Throughout the years I learned that Chaos Engineering was not about breaking things in production, but that it was really about creating and then testing a hypothesis. SCIENCE! It turns out Chaos Engineering isn't about chaos at all, it is all about the science behind what we think our systems are going to do, how things might potentially fail, and then learning from those failures to make our systems more reliable. The Principles of Chaos Engineering lays this out rather simply:

"Chaos Engineering is the discipline of experimenting on a system in order to build confidence in the system's capability to withstand turbulent conditions in production."

It's completely understandable that folks get nervous when they hear chaos, injecting failure, releasing the gremlins, and other similar terms, however, when people have the opportunity to really dive in and understand more about the processes from design to measurements, Chaos Engineering becomes much less daunting. In overly simplified terms, Chaos Engineering is about understanding the steady state of the environment, forming a hypothesis, planning the test, having measurements in place, running the experiment(s), and analyzing the results. One of my favorite quotes is from Bruce Wong, Director of Engineering at Stitch Fix, when he said:

"Saying you're getting your systems ready for Chaos Engineering is like saying you are getting in shape to go to the gym." 

This is not to say that you shouldn't prepare: formulating and understanding the current state of your systems, designing the experiment, planning out the parameters, including when to back out are all important factors, but after some time you just have to jump right in and do it.  While it is recommended to run your chaos experiments in production, not all experiments and systems are the same; it's ok to run them in test environments, or for those who are exceptionally wary, start with tabletop thought exercises. The important thing is that you start, just get some practice underneath you.

At PagerDuty, we started running Failure Fridays in 2013, and we learned quite a few things over the years. Some of the benefits that came out of practicing Chaos Engineering are that we were able to uncover implementation issues and discover deficiencies to prevent them from becoming contributing factors in the future. Mostly though, Chaos Engineering helped PagerDuty build a culture of trust and learning. We were reminded during the process that failure happens and is an opportunity for knowledge gain, that people will make mistakes, blame is not beneficial, and engineers gained a better understanding of how their code performed in production.

Chaos Engineering goes beyond the systems though, Chaos Engineering has the added benefit of practicing your incident response process, fine tuning alerts and monitoring systems, and training new folks to be better prepared to handle incidents in the future. At PagerDuty, Failure Fridays have become part of the culture; we strive to have them be an experience that brings learning and people together. Process iteration is a key element of practicing Chaos Engineering and as you start practicing this, think about ways to automate manual commands, add in formal documentation for planned faults, implement checklists for future sessions. For your organization, it's important to remember that you don't have to do everything perfectly the first time, Chaos Engineering is about learning and iterating, building a blameless culture, and understanding and improving the resilience of your systems.

To learn more about the practice of Chaos Engineering join us at Chaos Conf, where the PagerDuty team will be hosting a workshop on the Incident Response process.  As always, we love conversations here at PagerDuty and will be continuing the conversation in the PagerDuty community forums, let us know what you think.


About the Author

Julie Gunderson 

Julie Gunderson is a DevOps Advocate at PagerDuty, who has advocated DevOps best practice methodologies over the last six years. Along with advocacy, in her past role Julie was responsible for building partnerships with the major clouds. Julie loves working with people, advocating best practices, and building relationships. Julie is a founding member and organizer of DevOpsDays Boise, and an organizer of DeliveryConf. When Julie isn't working she is most likely making jewelry out of circuit boards, or traipsing around the mountains in Idaho.

Published Thursday, September 03, 2020 7:38 AM by David Marshall
Filed under: ,
There are no comments for this post.
To post a comment, you must be a registered user. Registration is free and easy! Sign up now!
<September 2020>