Virtualization Technology News and Information
Left Shifting Your SLOs with Chaos

By Michael Friedrich, Senior Developer Evangelist at GitLab

Measuring the performance of an application isn't easy, especially when systems have gotten increasingly complex in the last decade. It can be especially challenging to measure a system's response to a production incident. Manually simulating incidents in an environment isn't always enough - it can be too predictable, and doesn't cover the wide spectrum of possibilities of incidents.

Application bugs may be introduced in new features, or a small yet impactful Git commit. Without a testing environment, the problem can get deployed to production and in some cases, released to customers and users. Developers may only fix the bug after long hours of debugging ("works on my machine and CI/CD is green too"). It doesn't stop here - future regressions can introduce similar problems, ultimately leading to developer burn out. If only there was a way to simulate a production incident by failing the application, monitoring the application metrics, and detecting the problems early in CI/CD pipelines?

Enter chaos engineering: breaking environments unpredictably in order to simulate potential incidents.

Traditional monitoring workflows with metrics, combined with observability practices correlating events and signals, can help observe the environment. Error budgets define the Service Level Objectives (SLOs) and platforms help correlate and visualize the observability data. Developers and SREs instrument application code with metrics and distributed tracing to provide even more whitebox insights. Your SLOs need to be well understood and simulated early in the development process. With all of the new building blocks coming to play - Continuous Delivery, quality gates and chaos engineering - it may seem complex to left shift SLOs with Chaos in your CI/CD pipelines.

Let's discuss the benefits of chaos engineering to test your applications and how service level objectives (SLOs) can add to the bigger picture.

Getting Started with SLOs and Chaos Engineering

SLOs are critical for SREs and DevOps engineers to determine what they want to achieve, and how teams can measure the reliability of an application and/or environment. For example, ensuring a 99.5% SLA (Service Level Availability) for customers, while aiming for 99.9% as SLO (Service Level Objective). A crashing application, or undetected performance regression, can lead to on-call alerts and active incident handling, thus requiring all teams to collaborate. Such rapid actions take time and can reduce the engineering time for the next development milestone to add more features your users rely on.

Once you add chaos to your development process early on, you can test different failure scenarios to trigger alerts, and see how the application behaves. From there, you can determine whether or not your SLOs still match, and identify actions and improvements depending on the results. All of this can happen with quality gates before the code is pushed to production environments, allowing for analysis and fixing in a controlled environment and timeframe.

If the application connects to other instances over the network, making DNS resolution fail is a great way to see what happens. Another example is to implement a proxy that slows down the connections, and see if the data stream causes memory leaks. High availability and load balancing requests can be tested with chaos experiments that delete and/or restart pods at random times.

The best way to gain confidence with chaos engineering is to implement the practice within cloud-native deployments. Chaos engineering can be overwhelming to learn - the goal when getting started should always be to document steps and results, and build better alerts and incident management processes based on your learnings.

Implementing Chaos into Workflows

Left shifting your SLOs includes using instrumentation and observability to make it part of your workflow, and using cloud-native to scale and add chaos to container clusters in order to see if the SLOs still work. The following steps can be taken for a chaos workflow:

  • Start with an app deployment into a Kubernetes cluster, for example, podtato-head
  • Deploy Prometheus using the Operator
  • Define SLOs as alerts, e.g. for application uptime/probes, to get notified about later failures
  • Deploy a Chaos framework into the cluster, e.g. Litmus Chaos or Chaos Mesh
  • Create a chaos experiment, e.g. which randomly deletes pods, or intercepts HTTP traffic
  • Run the chaos experiments and verify the SLO failing
  • Implement a quality gate that measures the SLOs and updates CI/CD platforms in merge/pull requests before they get merged to the main branch

In addition to already existing chaos experiments, frameworks provide SDKs and integrations to create your own experiments. This is a great way to simulate TLS clients not closing the connections correctly for example.

Chaos engineering will help unveil more observability data requirements: Additional metrics, traces, and logs to provide application insights for everyone to quickly identify where in the environment the issue stems from - the application itself, or the deployment. Metrics, traces and logs should be added proactively to the code, and embedded into the development guidelines. Additionally, instrumentation frameworks such as OpenTelemetry provide a common specification for everyone to build upon.

From DIY monitoring to observability

Chaos engineering, when paired with Observability practices, can provide a way to prevent production failures, and detect them early in the development process. Seeing the value in metrics, logs, traces, and events may also lead to collecting data that provides even more - yet unknown - insights. DevOps engineers, SREs, developers, etc. can create new SLOs based on existing observability data, being reviewed in merge/pull requests.

There are many building blocks that can be overwhelming. Start with using boring solutions, and add metrics and tracing to your observability workflows. Define SLOs and test chaos engineering with the smallest experiments. Iterate on chaos experiments, adopt best practices from the community, and evaluate new observability approaches, including eBPF and auto-instrumentation. Educate your teams, and make everyone see the value in observability and chaos engineering.

To learn more about why SLOs are important, and the benefits of adding chaos engineering to your deployments - including real-world examples from GitLab Senior Developer Evangelist Michael Friedrich, join his talk, From Monitoring to Observability: Left Shift your SLOs with Chaos at KubeCon EU.


***To learn more about containerized infrastructure and cloud native technologies, consider joining us at KubeCon + CloudNativeCon Europe 2022, May 16-20.


Michael Friedrich Senior Developer Evangelist at GitLab


Michael Friedrich is a Senior Developer Evangelist at GitLab focussing on Observability, SRE, and Ops. He studied Hardware/Software Systems Engineering and moved into DNS and monitoring development at the University of Vienna and Michael was a maintainer of an OSS monitoring software for 11 years before joining GitLab. He loves to help educate everyone and regularly speaks at events and meetups. Michael co-founded the #EveryoneCanContribute cafe meetup group to learn cloud-native & DevOps. Michael is a Polynaut advisor at Polywork, created as a learning platform for Observability, and shares insights in the newsletter. 

Published Tuesday, May 03, 2022 7:32 AM by David Marshall
There are no comments for this post.
To post a comment, you must be a registered user. Registration is free and easy! Sign up now!
<May 2022>