By Austin Parker is the Head of
Developer Relations at Lightstep
The cloud-native community's adoption of
observability tools, techniques, and practices isn't showing any signs of
slowing down. It's been nearly two years since the Observability
Radar indicated early assessment of projects like OpenTelemetry,
but more recent data found adoption approaching nearly fifty percent of
organizations surveyed. In that time, various OpenTelemetry signals have become
stable, including the Metrics API, bringing with it a truly
next-generation application and system telemetry library. OpenTelemetry
promises to unify observability signals across cloud-native applications, but
this raises questions about "what's next"?
The future of observability for cloud-native
organizations is to step beyond traditional, "query-and-dashboard" style
monitoring, and towards a holistic view of their systems. OpenTelemetry is the
foundation of this, but it doesn't stand alone -- indeed, it's the very start
of the journey.
As the foundation for observability,
OpenTelemetry ensures that application and system telemetry are contextually
linked, where all data can be correlated and tied back to specific end-user
interactions and experiences with a system. This foundation allows us to not
only use telemetry and monitoring to pinpoint incidents and remediate them, but
to model and understand our systems. This is a must as our applications grow in
size and complexity, and as the organizations and teams that support and
maintain those applications become increasingly distributed.
Beyond this point, the future becomes murkier
-- in many ways, our current understanding of observability is mostly obsessed
with figuring out how to slice and dice ever larger amounts of telemetry data
being collected from more and more systems. It's not enough to just increase
the signal, and reduce the noise -- we need to be more deliberate about what
we're using data for, how it's stored and retained, and how we analyze it. As
the cloud-native organization's observability practice matures, they begin to
transform their SDLC into a new, revolutionary, model of software delivery.
This revolution is pinned on observability as a true cross-cutting concern
throughout every aspect of development, deployment, operations, and iteration.
What does this mean in practice? First, we
recognize that it's not enough to simply slap a new signal on our pile and call
it observability. Adding another tool to the mix isn't the goal here.
Cloud-native observability requires new fundamental visualizations and measures.
The first of these is SLO-First
Monitoring. SLOs allow us to monitor our system reliability in the context
of how that reliability tracks towards business objectives and value; In other
words, it helps us understand how our system is working from the perspective of
what our end-users are doing. Monitoring SLOs rather than individual SLIs helps
us, as engineers, communicate expected reliability measures and quantify the
value of our performance and platform engineering -- we can quantify these in
terms of business impact. Better end-user experience, reduced time-to-value,
and improved release velocity are all quantifiable metrics under a SLO-based
monitoring framework.
The other practice to adopt is Model-Based Workflows. This practice
suggest that we reconsider our observability practice from being a mass of
aggregated signals that we interrogate in order to understand point failures,
into a pre-emptive tool to assist in engineers building and sustaining models
of a system and how it works. The inspiration for this comes from safety
culture -- specifically, ‘Safety I' vs. ‘Safety II'. Safety I is concerned
primarily with how to reduce faults and errors (or ‘incidents'). It believes
that you can remove error from a system, and in doing so, make that system
safe. This sounds appealing and even desirable, but there are faults in this
conception. Errors are not, primarily, caused by human intervention or lack
thereof. In cloud-native systems, errors and faults are mostly caused by
unexpected and emergent interactions between components and subsystems. ‘Safety
II' aligns to this, by asking practitioners to build a safe system by analyzing
and understanding the conditions that exist when things go right. If we understand how the system works well, we can work towards
replicating those conditions. For cloud-native systems, this means we replace
our traditional line and bar graphs with visualizations that reflect the actual
underlying system -- service maps, heat maps, and correlation maps, among
others.
While we're not at the point yet where the
tools exist to make these easy
journeys, they're being built. The progress made by projects like OpenTelemetry
in a few short years should give the cloud-native world hope that we can
continue to innovate in the field of observability over the coming months and
years. Already, we're seeing projects like OpenSLO work
to define open source standards for parts of this transformation, and more are
sure to come. In the meantime, I'd like to ask you to join me in defining and
shaping the concept of cloud-native observability in this repository where we're building and
expanding on the concepts outlined above as practitioners.
##
***To learn more about containerized infrastructure and cloud native technologies, consider joining us at KubeCon + CloudNativeCon Europe 2022, May 16-20.
ABOUT THE AUTHOR
Austin Parker Head of Developer Relations
Austin Parker is the Head of Developer
Relations at Lightstep, and has been creating problems with computers for most
of his life. He's a maintainer of the OpenTelemetry project, the host of
several podcasts, organizer of Deserted Island DevOps, infrequent Twitch
streamer, conference speaker, and more. When he's not working, you can find him
posting on Twitter, cooking, and parenting. His most recent book is Distributed
Tracing in Practice, published by O'Reilly Media.