[ This article is part of a series
promoting ChaosConf -- the world's largest chaos engineering event hosted by Gremlin on October 6-8, 2020. Register free! ]
Article by Julie
Gunderson, DevOps Advocate, PagerDuty
Chaos.
Monkeys. Failure. At first glance Chaos Engineering sounds intimidating,
however, it's not as scary as it appears. With some planning and thought,
experimenting on systems in the form of Chaos Engineering is something any
company can (and should) do.
My
first introduction into Chaos Engineering was in 2015 when I met some amazing
folks from Netflix who gave me some cool stickers and a fun, yet remarkably
aggressive looking, squishy chaos monkey.
I
proudly placed my stickers on my laptop and didn't think much about it until a
few months later when I was asked to explain the collage of pulp fiction like
characters. That was the moment I went down the monkey hole of what Chaos
Engineering is and why organizations practice it. Flash forward a few years to
Devopsdays Minneapolis, when I had the opportunity to experience Chaos
Engineering in the works at a Gremlin workshop and I was hooked.
Throughout
the years I learned that Chaos Engineering was not about breaking things in
production, but that it was really about creating and then testing a
hypothesis. SCIENCE! It turns out Chaos Engineering isn't about chaos at all,
it is all about the science behind what we think our systems are going to do,
how things might potentially fail, and then learning from those failures to
make our systems more reliable. The Principles of Chaos Engineering lays this out rather simply:
"Chaos Engineering is the
discipline of experimenting on a system in order to build confidence in the
system's capability to withstand turbulent conditions in production."
It's
completely understandable that folks get nervous when they hear chaos,
injecting failure, releasing the gremlins, and other similar terms, however,
when people have the opportunity to really dive in and understand more about
the processes from design to measurements, Chaos Engineering becomes much less
daunting. In overly simplified terms, Chaos Engineering is about understanding
the steady state of the environment, forming a hypothesis, planning the test,
having measurements in place, running the experiment(s), and analyzing the
results. One of my favorite quotes is from Bruce Wong, Director of Engineering
at Stitch Fix, when he said:
"Saying you're getting your systems
ready for Chaos Engineering is like saying you are getting in shape to go to
the gym."
This is not to say that you shouldn't prepare: formulating
and understanding the current state of your systems, designing the experiment,
planning out the parameters, including when to back out are all important
factors, but after some time you just have to jump right in and do it. While it is recommended to run your chaos
experiments in production, not all experiments and systems are the same; it's
ok to run them in test environments, or for those who are exceptionally wary,
start with tabletop thought exercises. The important thing is that you start,
just get some practice underneath you.
At PagerDuty, we started running Failure Fridays in 2013, and we learned quite a few things over the years. Some
of the benefits that came out of practicing Chaos Engineering are that we were
able to uncover implementation issues and discover deficiencies to prevent them
from becoming contributing factors in the future. Mostly though, Chaos
Engineering helped PagerDuty build a culture of trust and learning. We were
reminded during the process that failure happens and is an opportunity for
knowledge gain, that people will make mistakes, blame is not beneficial, and
engineers gained a better understanding of how their code performed in
production.
Chaos Engineering goes beyond the systems though, Chaos
Engineering has the added benefit of practicing your incident response process,
fine tuning alerts and monitoring systems, and training new folks to be better
prepared to handle incidents in the future. At PagerDuty, Failure Fridays have become part of the culture; we strive
to have them be an experience that brings learning and people together. Process
iteration is a key element of practicing Chaos Engineering and as you start
practicing this, think about ways to automate manual commands, add in formal documentation
for planned faults, implement checklists for future sessions. For your
organization, it's important to remember that you don't have to do everything
perfectly the first time, Chaos Engineering is about learning and iterating,
building a blameless culture, and understanding and improving the resilience of
your systems.
To learn more about the practice of Chaos Engineering join us
at Chaos Conf, where the PagerDuty team will be hosting a workshop on the
Incident Response process. As always, we
love conversations here at PagerDuty and will be continuing the conversation in
the PagerDuty community forums, let us know what you think.
##
About the Author
Julie Gunderson is a DevOps Advocate at PagerDuty, who has
advocated DevOps best practice methodologies over the last six years. Along
with advocacy, in her past role Julie was responsible for building partnerships
with the major clouds. Julie loves working with people, advocating best
practices, and building relationships. Julie is a founding member and organizer
of DevOpsDays Boise, and an organizer of DeliveryConf. When Julie isn't working
she is most likely making jewelry out of circuit boards, or traipsing around
the mountains in Idaho.