Virtualization Technology News and Information
VMblog Expert Interview: Matt Schillerstrom of Gremlin Talks Chaos Engineering and Describes New Status Checks Feature

interview gremlin schillerstrom 

Gremlin is the world's first fully hosted SaaS offering for Chaos Engineering. Their new feature ‘Status Checks' makes doing Chaos Engineering much safer in production. In this VMblog interview, we spoke with Matt Schillerstrom, former engineer at Target and current product manager at Gremlin.

VMblog:  Chaos Engineering is a topic we are seeing discussed more and more.  Why do you think the interest is growing so fast in this topic that's really been around for quite some time now?

Matt Schillerstrom:  It's certainly true that Chaos Engineering has been talked about in the tech world for some time, especially since Netflix open-sourced Chaos Monkey back in 2012. And in a lot of ways, that's been a gift and a curse for the practice. When we talk to potential customers, Chaos Monkey is often their first point of reference, so it's certainly helpful that they have an existing understanding of the concept.

At the same time, the idea of "randomly" breaking things -- which is what Chaos Monkey offers AWS users -- is not the approach we advocate for at Gremlin. There's a time and a place for randomness, but more often the value comes from having a hypothesis and running a targeted experiment. We don't want to introduce chaos -- we want to control it. Like a vaccine, the idea is to add some controlled harm upfront, in order to build a longer term immunity.

So yes, the practice has been around, especially at large organizations who may have internal tooling for doing Chaos Engineering. But it's only been the last couple years that there's been a company dedicated to helping other organizations perform Chaos Engineering safely and effectively.

VMblog:  Do you think safety concerns have held some organizations back from doing Chaos Engineering?

Schillerstrom:  Definitely. And that's understandable. At Gremlin, we don't recommend going out and breaking production systems if you're not comfortable doing that. Start on staging, build up that muscle memory, train your teams to respond effectively, and then eventually move on to the production systems where your customers live.

It's just as much about process, as it is about product. We say things like "control the blast radius" because the idea is that you want to run the smallest possible experiment that will teach you something. But on the product side, we know there's more we can do to ease the concerns of our customers. That's why since launch we've had a big red HALT button, to rollback attacks and return your infrastructure to steady status. It's features like these that make Gremlin stand apart from other open-source solutions, and why we're excited to launch Status Checks today.

VMblog:  How would you describe Status Checks?

Schillerstrom:  It's essentially a proactive halt button. In other words, our customers love being able to halt an attack at any time -- but what if the attack should be prevented from running in the first place? Let's take an obvious example of how we use Status Checks internally at Gremlin. We hook into PagerDuty, and if it registers that there's an active incident, then no chaos experiments can run on that infrastructure. Because what's the point of adding chaos to a system that's already failing?

You want infrastructure that's healthy and ready. This allows you to create a clear hypothesis, measure the results, understand the impact and ultimately learn about your system. Experiment in environments that may already be compromised is not the right way to do Chaos Engineering. So it was sort of a no-brainer to let our customers build Status Checks into their planned experiments, and proactively set the conditions that would halt an attack.

VMblog:  What kinds of organizations do you see benefiting the most from this?

Schillerstrom:  The demand is highest from our larger customers, like in the finance industry, who will really feel the pain if an attack has an unwanted consequence. It's tricky because the companies that will get the most value from Chaos Engineering, are often the ones that will also feel the most pain from an experiment gone wrong. But that's why it's important to pick the right tool for the job, and Gremlin is working hard to minimize the possible negative side effects from doing Chaos Engineering, while maximizing the results and learnings.


Matt Schillerstrom is a product manager at Gremlin. Before joining, he was a lead engineer focused on Disaster Recovery and Prevention at Target. He also co-hosts the Chaos Engineering meetup in the Twin Cities.

Try Gremlin for free:

Published Tuesday, June 23, 2020 10:05 AM by David Marshall
Filed under: ,
There are no comments for this post.
To post a comment, you must be a registered user. Registration is free and easy! Sign up now!
<June 2020>