Thundra's VP of Product Emrah Samdan recently spoke at
the Chaos Conference, the world's
largest chaos engineering event. After he spoke, the attendees had some great
follow up questions. Given the discussion, we figured other people might have
similar questions and so we've curated those questions and Emrah's answers.
VMblog: How confidently can we start experimenting in non-production environments if we
know for sure that test environments are not mirroring production? What are
some strategies you may apply to tackle this?
Emrah Samdan: You don't always test the production environment. But, you can test how you are
responding as a team. Running chaos experiments on production is the end goal,
not the way to start. This way, you can also "learn" how to run chaos
experiments.
VMblog: You talked about latency injections; what other types of anomalies are included
in chaos injections?
Samdan: Injecting different types of failures, playing with the concurrent execution
limits of serverless functions, and playing with the IAM permissions.
VMblog: In your teams, how do you manage/keep track of what chaos experiments have been
performed across all your serverless functions?
Samdan: Well, my preferred way is to create the shared communication channel, and using
retrospective templates by incident management platforms like Opsgenie and
Pagerduty.
VMblog: Could you give us some examples of chaos injections into web applications? And,
how effective is it? What are the metrics that we can use to measure those as
well?
Samdan: You can start really strong instead of with the "the day after tomorrow"
scenario. For example, you can simply inject a latency to your API endpoints
and see if there will be other problems in the other parts of your system such
as more items waiting in the queue, problems in DB connection. After that, you
can make it bigger by injecting more latency or injecting latency to more
places.
VMblog: Do APM tools come under the chaos engineering umbrella?
Samdan: Not very frequently.
VMblog: What are some of the metrics to measure in chaos attacks?
Samdan: I say business level metrics can be more
important than your system level metrics. For example, your APDEX score is more
important than anything else, because it's the value to your customers. You
should also check the application level metrics such as the latency or
infrastructure and other level metrics such as CPU usage.
VMblog: Can we inject chaos attacks on
the database layer?
Samdan: Yes, you can. But Thundra does
it at the application level. For more infrastructure level chaos to database,
you can use Gremlin.
VMblog: You said "recursion is deadly in
serverless." Why?
Samdan: You never know the base case
actually covers the problematic inputs. You can stay in the infinite recursion
but your function may time out.
VMblog: Can Chaos testing be part of the
CI/CD pipeline?
Samdan: Very, very good question. I think about this a
lot, but automated Chaos doesn't seem to be that much of Chaos for me. Maybe we
can embed previous Chaos experiments into our CI/CD process, just to make sure
it's still working, but for new game days, there should be a new hypothesis
that you can think of.
##