Virtualization Technology News and Information
What the Kubernetes Ecosystem Can Learn From Control Theory

By Daniel "Spoons" Spoonhower, CTO and Co-founder, Lightstep

Innovation is often measured in terms of how fast we are going. But less often do we talk about innovation in terms of knowing how fast we are going. Of course there isn't much use to knowing how fast you are going if you're stuck standing still, but that knowledge can play a critical role in dynamic environments like today's production software systems.

In the cloud native community, we've seen an explosion of tools that let you move faster, usually by raising the level of abstraction and offering new primitives to build with. But at the same time, we're falling behind in investing in tools that help understand what we're building and how it's performing. It's like we've built a car with a really fast engine but forgot to include a speedometer. Kubernetes, its associated configuration and package management tools, a powerful collection of networking tools, and scalable storage systems, these all enable us to build and ship faster. But we still often ship code without the confidence in our changes (and stand by, ready to roll them back); when things do break, we often lack the tools to understand why.

While there are a number of projects hosted by the Cloud Native Computing Foundation (CNCF) that can help us understand our software systems, we've got a long way to go. We need tools that align better with the ways we've built our organizations and the use cases of the teams in those organizations.

Feedback Loops

To understand how these tools can fit together - and where the gaps are - let's talk about a feedback loop, a loop that has two halves. On one half of this loop are the tools and processes that let you control what's happening; on the other half are the tools and processes that let you observe what's happening. I wrote "tools and processes" because I want to emphasize that it's really a combination of these - along with the people that use them - that determine whether or not your software systems are able to meet your business goals.

I'm taking these two terms, "control" and "observe," from another field of study that is also about managing complicated systems: control theory. (Yet another case where "observe" missed the top billing!) Control theory is the study of how to build systems with an eye toward optimization and maximizing stability, usually in fields like mechanics or industrial production. In control theory, a system is said to be controllable if the behavior of the system is determined by a small number of inputs and observable if the internal state of that system is captured by a small number of outputs.

Applying the principles of control theory directly to software systems is a bit tenuous, as the inputs and outputs of software systems are far less constrained by natural law than those of an airplane or a chemical process, but there are some useful parallels to draw. In particular, the tools and processes we choose constrain the ways that we can make changes to - and the ways we observe what's happening in - those production systems.

Understanding Change

Control theory is usually applied in cases where the goal is to create stability for a static system within a dynamic environment, but in software systems, change is pervasive: both the environment and the system are constantly changing. Users demand new features, bug fixes must be applied, and cost-saving optimizations need to go out. Much of the Kubernetes ecosystem is focused on accelerating these changes: on how we control them. How we measure the efficacy of our ability to observe should be framed in terms of our ability to understand the effects of those changes. For example, our tools need to be able to answer questions like, "If I deploy a new version of my service, what will change?" Or more often questions asked from the opposite perspective: "Performance of the application just changed... but why?"

It's true that we've had tools that let us measure the impact of these changes for a long time. However, two big shifts have occurred. First, our software has become more distributed, meaning that the connection between cause (or causes) and effect is often mediated by half a dozen (or more) network connections. Second and as part of adopting devops practices, we've also distributed responsibility more broadly across our organizations. It's not only that six different services may have contributed to a performance degradation, but six different teams.

You may think I'm making an extended pitch for a new "observability" tool - or maybe you're saying that we've had tools that let us "observe" software systems since the beginning - but really what I'm advocating for is a new way of thinking about these tools. Whether you are judging a tool that you've been using for twenty years or evaluating a new one, consider how that tool helps you understand change. When you push a new deployment, you should be looking not just at how your service's performance was affected, but how performance of the application as a whole was affected. When you are responding to a page at 3am, you'll need to understand not just what's changed in your service, but what's happening across the application.

You might also have heard observability defined as "the combination of metrics, logs, and distributed traces." While there's some truth to that - those certainly are three important data sources - what's more important to understanding change is gathering all of that data from across services and analyzing it holistically. That is, we'll only be able to observe applications effectively if we break down the barriers between different data streams - both different types of data and different data sources.

Unifying Data

I'm happy to say that OpenTelemetry, a CNCF project which captures telemetry from software applications, is a great step in the right direction. By unifying metrics, logs, and traces behind a single set of APIs and SDKs, it's never been easier to combine these data streams. And by providing a standard way of doing so, it's easier to rally teams across your organization to adopt a common way of collecting and managing them.

But having metics, logs, and distributed traces is not the same as using them effectively. Helpfully OpenTelemetry also makes it easy to evaluate different options. As you do so, consider how these tools - the tools you need to observe your software - complement the tools that you are using to control your software. For each kind of change that you plan to make, how will you measure and explain the impact of that change? Or maybe more importantly, how will one of the other teams in your organization do so?


To learn more about containerized infrastructure and cloud native technologies, consider joining us at KubeCon + CloudNativeCon NA Virtual, November 17-20.

About the Author

Daniel "Spoons" Spoonhower, CTO and Co-founder, Lightstep

Daniel Spoonhower 

Daniel "Spoons" Spoonhower is CTO and a co-founder at Lightstep, where he's building performance management tools for deep software systems. He is an author of Distributed Tracing in Practice (O'Reilly Media, 2020). Previously, Spoons spent almost six years at Google where he worked as part of Google's infrastructure and Cloud Platform teams. He has published papers on the performance of parallel programs, garbage collection, and real-time programming. He has a PhD in programming languages from Carnegie Mellon University but still hasn't found one he loves.

Published Tuesday, November 03, 2020 7:35 AM by David Marshall
There are no comments for this post.
To post a comment, you must be a registered user. Registration is free and easy! Sign up now!
<November 2020>