Gremlin is the world's
first fully hosted SaaS offering for Chaos Engineering. Their new feature
‘Status Checks' makes doing Chaos Engineering much safer in production. In this
VMblog interview, we spoke with Matt Schillerstrom, former engineer at Target and
current product manager at Gremlin.
VMblog: Chaos
Engineering is a topic we are seeing discussed more and more. Why do you think
the interest is growing so fast in this topic that's really been around for
quite some time now?
Matt Schillerstrom: It's certainly true that Chaos Engineering has been talked about in the
tech world for some time, especially since Netflix open-sourced Chaos Monkey
back in 2012. And in a lot of ways, that's been a gift and a curse for the
practice. When we talk to potential customers, Chaos Monkey is often their first
point of reference, so it's certainly helpful that they have an existing
understanding of the concept.
At the same time, the idea of "randomly" breaking things -- which is what Chaos
Monkey offers AWS users -- is not the approach we advocate for at Gremlin.
There's a time and a place for randomness, but more often the value comes from
having a hypothesis and running a targeted experiment. We don't want to
introduce chaos -- we want to control it. Like a vaccine, the idea is to add
some controlled harm upfront, in order to build a longer term immunity.
So yes, the practice has been around, especially at large
organizations who may have internal tooling for doing Chaos Engineering. But
it's only been the last couple years that there's been a company dedicated to
helping other organizations perform Chaos Engineering safely and effectively.
VMblog: Do
you think safety concerns have held some organizations back from doing Chaos
Engineering?
Schillerstrom: Definitely. And that's understandable. At Gremlin, we
don't recommend going out and breaking production systems if you're not
comfortable doing that. Start on staging, build up that muscle memory, train
your teams to respond effectively, and then eventually move on to the
production systems where your customers live.
It's just as much about process, as it is about product. We say things like
"control the blast radius" because the idea is that you want to run the
smallest possible experiment that will teach you something. But on the product
side, we know there's more we can do to ease the concerns of our customers.
That's why since launch we've had a big red HALT button, to rollback attacks
and return your infrastructure to steady status. It's features like these that
make Gremlin stand apart from other open-source solutions, and why we're
excited to launch Status Checks today.
VMblog: How
would you describe Status Checks?
Schillerstrom: It's essentially a proactive halt button. In other words,
our customers love being able to halt an attack at any time -- but what if the
attack should be prevented from running in the first place? Let's take an
obvious example of how we use Status Checks internally at Gremlin. We hook into
PagerDuty, and if it registers that there's an active incident, then no chaos
experiments can run on that infrastructure. Because what's the point of adding
chaos to a system that's already failing?
You want infrastructure that's healthy and ready. This allows you to create a
clear hypothesis, measure the results, understand the impact and ultimately
learn about your system. Experiment in environments that may already be
compromised is not the right way to do Chaos Engineering. So it was sort of a
no-brainer to let our customers build Status Checks into their planned
experiments, and proactively set the conditions that would halt an attack.
VMblog: What
kinds of organizations do you see benefiting the most from this?
Schillerstrom: The demand is highest from our larger customers, like in
the finance industry, who will really feel the pain if an attack has an
unwanted consequence. It's tricky because the companies that will get the most
value from Chaos Engineering, are often the ones that will also feel the most
pain from an experiment gone wrong. But that's why it's important to pick the
right tool for the job, and Gremlin is working hard to minimize the possible
negative side effects from doing Chaos Engineering, while maximizing the
results and learnings.
##
Matt Schillerstrom
is a product manager at Gremlin. Before joining, he was a lead engineer focused
on Disaster Recovery and Prevention at Target. He also co-hosts the Chaos
Engineering meetup in the Twin Cities.
Try Gremlin for free: gremlin.com/free