Virtualization Technology News and Information
Gremlin 2020 Predictions: Chaos is Embraced for Good

VMblog Predictions 2020 

Industry executives and experts share their predictions for 2020.  Read them in this 12th annual series exclusive.

By Matthew Helmke is a technical writer for Gremlin

Chaos is Embraced for Good

Two years ago when someone said, "Chaos Engineering," most people would shrug with confusion or indifference. Well, unless they worked for a company like Netflix or Amazon that had been using Chaos Engineering internally for years to help make their systems more reliable.

One year ago, most people would say, "Oh, yeah, I think I've heard of that. It's like that Chaos Monkey thing, right? I'm curious what that is all about."

Gremlin's prediction for 2020 is that most enterprise-level companies will make definitive plans to begin incorporating Chaos Engineering into their site reliability planning and activities. We have moved past questions like "What is Chaos Engineering?" to questions like "How can we get started with Chaos Engineering?" and even "How can we make our Chaos Engineering practice more useful?"

We already have colleagues, community members, and customers using Chaos Engineering to find weaknesses in their systems using safe, controlled, well-designed experimentation to find flaws early. This permits the flaws to be fixed before there are failures that are noticed by customers, that hurt business needs, or worst of all, cause expensive downtime.

Here are a few ways we're seeing people use Chaos Engineering today, plus a couple thoughts on new ways we anticipate it to be used soon.

  • Ensuring your disaster recovery runbooks are accurate and up-to-date
  • Experimenting to learn what happens in your system when resources are exhausted, networking is unreliable, your datastore is saturated, or DNS is unavailable
  • Testing to make sure the system is resilient and reliable even when the clocks change, such as for Daylight Saving Time
  • Preparing for a heavy traffic event like Black Friday
  • Reproducing a disaster scenario in a limited and easy-to-halt way to confirm the mitigation put in place is adequate and works
  • Improving availability and reliability to better serve customers
  • Managing change in a world of rapid transformation
  • Demonstrating compliance with regulatory requirements

Chaos Engineering is not random. However, it is still somewhat new. It is not reckless. But, it acknowledges that today's complex architectures are often opaque and nearly impossible for one person to understand at any given moment. Chaos Engineering accepts the premise that the best way to enhance system reliability when faced with such complexity is to find out how the system actually behaves under stimulus. In a chaotic environment, the right decision when wanting to figure out how the environment works is to take an action and observe.

Chaos Engineering involves careful planning and precision actions. Doing it well involves limiting the magnitude of your tests and the potential impact beyond what you are examining. Those who take the time to do so are already saying it results in dramatic improvements. And we've only scratched the surface.


About the Author

Matthew Helmke

Matthew Helmke is a technical writer for Gremlin, which is dedicated to making the internet more reliable. He has been sitting at a keyboard since buying his first computer in 1981 and has enjoyed doing things with big computers since studying LISP on a VAX in 1985. He has written books about Linux, VMware, and other topics. Matthew always enjoys those moments when no one else notices something has failed because mitigation schemes work as designed.

Published Wednesday, December 11, 2019 7:36 AM by David Marshall
There are no comments for this post.
To post a comment, you must be a registered user. Registration is free and easy! Sign up now!
<December 2019>