By Daniel "Spoons" Spoonhower, CTO and
Co-founder, Lightstep
Innovation is often measured in terms of how
fast we are going. But less often do we talk about innovation in terms of knowing how fast we are going. Of course
there isn't much use to knowing how fast you are going if you're stuck standing
still, but that knowledge can play a critical role in dynamic environments like
today's production software systems.
In the cloud native community, we've seen an
explosion of tools that let you move faster, usually by raising the level of
abstraction and offering new primitives to build with. But at the same time,
we're falling behind in investing in tools that help understand what we're
building and how it's performing. It's like we've built a car with a really
fast engine but forgot to include a speedometer. Kubernetes, its associated
configuration and package management tools, a powerful collection of networking
tools, and scalable storage systems, these all enable us to build and ship
faster. But we still often ship code without the confidence in our changes (and
stand by, ready to roll them back); when things do break, we often lack the
tools to understand why.
While there are a number of projects hosted by
the Cloud Native Computing Foundation (CNCF) that can help us understand our
software systems, we've got a long way to go. We need tools that align better
with the ways we've built our organizations and the use cases of the teams in
those organizations.
Feedback
Loops
To understand how these tools can fit together
- and where the gaps are - let's talk about a feedback loop, a loop that has
two halves. On one half of this loop are the tools and processes that let you control what's happening; on the other
half are the tools and processes that let you observe what's happening. I wrote "tools and processes" because I
want to emphasize that it's really a combination of these - along with the
people that use them - that determine whether or not your software systems are
able to meet your business goals.
I'm taking these two terms, "control" and
"observe," from another field of study that is also about managing complicated
systems: control theory. (Yet another case where "observe" missed the top
billing!) Control theory is the study of how to build systems with an eye
toward optimization and maximizing stability, usually in fields like mechanics
or industrial production. In control theory, a system is said to be controllable if the behavior of the
system is determined by a small number of inputs and observable if the internal state of that system is captured by a
small number of outputs.
Applying the principles of control theory
directly to software systems is a bit tenuous, as the inputs and outputs of
software systems are far less constrained by natural law than those of an
airplane or a chemical process, but there are some useful parallels to draw. In
particular, the tools and processes we choose constrain the ways that we can
make changes to - and the ways we observe what's happening in - those
production systems.
Understanding
Change
Control theory is usually applied in cases
where the goal is to create stability for a static
system within a dynamic environment,
but in software systems, change is pervasive: both the environment and the system are constantly changing.
Users demand new features, bug fixes must be applied, and cost-saving
optimizations need to go out. Much of the Kubernetes ecosystem is focused on
accelerating these changes: on how we control
them. How we measure the efficacy of our ability to observe should be framed in terms of our ability to understand the
effects of those changes. For example, our tools need to be able to answer
questions like, "If I deploy a new version of my service, what will change?" Or
more often questions asked from the opposite perspective: "Performance of the
application just changed... but why?"
It's true that we've had tools that let us
measure the impact of these changes for a long time. However, two big shifts
have occurred. First, our software has become more distributed, meaning that
the connection between cause (or causes) and effect is often mediated by half a
dozen (or more) network connections. Second and as part of adopting devops
practices, we've also distributed responsibility
more broadly across our organizations. It's not only that six different
services may have contributed to a performance degradation, but six different teams.
You may think I'm making an extended pitch for
a new "observability" tool - or maybe you're saying that we've had tools that
let us "observe" software systems since the beginning - but really what I'm
advocating for is a new way of thinking about these tools. Whether you are
judging a tool that you've been using for twenty years or evaluating a new one,
consider how that tool helps you understand change. When you push a new
deployment, you should be looking not just at how your service's performance
was affected, but how performance of the application as a whole was affected.
When you are responding to a page at 3am, you'll need to understand not just
what's changed in your service, but what's happening across the application.
You might also have heard observability
defined as "the combination of metrics, logs, and distributed traces." While
there's some truth to that - those certainly are three important data sources -
what's more important to understanding change is gathering all of that data
from across services and analyzing it holistically. That is, we'll only be able
to observe applications effectively if we break down the barriers between
different data streams - both different types of data and different data
sources.
Unifying
Data
I'm happy to say that OpenTelemetry,
a CNCF project which captures telemetry from software applications, is a great
step in the right direction. By unifying metrics, logs, and traces behind a
single set of APIs and SDKs, it's never been easier to combine these data
streams. And by providing a standard way of doing so, it's easier to rally
teams across your organization to adopt a common way of collecting and managing
them.
But having metics, logs, and distributed
traces is not the same as using them effectively. Helpfully OpenTelemetry also
makes it easy to evaluate different options. As you do so, consider how these
tools - the tools you need to observe your software - complement the tools that
you are using to control your software. For each kind of change that you plan
to make, how will you measure and explain the impact of that change? Or maybe
more importantly, how will one of the other
teams in your organization do so?
##
To learn more about containerized infrastructure and cloud native technologies, consider joining us at KubeCon + CloudNativeCon NA Virtual, November 17-20.
About the Author
Daniel "Spoons" Spoonhower, CTO and
Co-founder, Lightstep
Daniel "Spoons" Spoonhower is CTO and a co-founder
at Lightstep, where he's building performance management tools for deep
software systems. He is an author of Distributed Tracing in Practice (O'Reilly
Media, 2020). Previously, Spoons spent almost six years at Google where he
worked as part of Google's infrastructure and Cloud Platform teams. He has
published papers on the performance of parallel programs, garbage collection,
and real-time programming. He has a PhD in programming languages from Carnegie
Mellon University but still hasn't found one he loves.