Virtualization Technology News and Information
Observability for Cloud-Native Organizations

By Austin Parker is the Head of Developer Relations at Lightstep

The cloud-native community's adoption of observability tools, techniques, and practices isn't showing any signs of slowing down. It's been nearly two years since the Observability Radar indicated early assessment of projects like OpenTelemetry, but more recent data found adoption approaching nearly fifty percent of organizations surveyed. In that time, various OpenTelemetry signals have become stable, including the Metrics API, bringing with it a truly next-generation application and system telemetry library. OpenTelemetry promises to unify observability signals across cloud-native applications, but this raises questions about "what's next"?

The future of observability for cloud-native organizations is to step beyond traditional, "query-and-dashboard" style monitoring, and towards a holistic view of their systems. OpenTelemetry is the foundation of this, but it doesn't stand alone -- indeed, it's the very start of the journey.

As the foundation for observability, OpenTelemetry ensures that application and system telemetry are contextually linked, where all data can be correlated and tied back to specific end-user interactions and experiences with a system. This foundation allows us to not only use telemetry and monitoring to pinpoint incidents and remediate them, but to model and understand our systems. This is a must as our applications grow in size and complexity, and as the organizations and teams that support and maintain those applications become increasingly distributed.

Beyond this point, the future becomes murkier -- in many ways, our current understanding of observability is mostly obsessed with figuring out how to slice and dice ever larger amounts of telemetry data being collected from more and more systems. It's not enough to just increase the signal, and reduce the noise -- we need to be more deliberate about what we're using data for, how it's stored and retained, and how we analyze it. As the cloud-native organization's observability practice matures, they begin to transform their SDLC into a new, revolutionary, model of software delivery. This revolution is pinned on observability as a true cross-cutting concern throughout every aspect of development, deployment, operations, and iteration.

What does this mean in practice? First, we recognize that it's not enough to simply slap a new signal on our pile and call it observability. Adding another tool to the mix isn't the goal here. Cloud-native observability requires new fundamental visualizations and measures. The first of these is SLO-First Monitoring. SLOs allow us to monitor our system reliability in the context of how that reliability tracks towards business objectives and value; In other words, it helps us understand how our system is working from the perspective of what our end-users are doing. Monitoring SLOs rather than individual SLIs helps us, as engineers, communicate expected reliability measures and quantify the value of our performance and platform engineering -- we can quantify these in terms of business impact. Better end-user experience, reduced time-to-value, and improved release velocity are all quantifiable metrics under a SLO-based monitoring framework.

The other practice to adopt is Model-Based Workflows. This practice suggest that we reconsider our observability practice from being a mass of aggregated signals that we interrogate in order to understand point failures, into a pre-emptive tool to assist in engineers building and sustaining models of a system and how it works. The inspiration for this comes from safety culture -- specifically, ‘Safety I' vs. ‘Safety II'. Safety I is concerned primarily with how to reduce faults and errors (or ‘incidents'). It believes that you can remove error from a system, and in doing so, make that system safe. This sounds appealing and even desirable, but there are faults in this conception. Errors are not, primarily, caused by human intervention or lack thereof. In cloud-native systems, errors and faults are mostly caused by unexpected and emergent interactions between components and subsystems. ‘Safety II' aligns to this, by asking practitioners to build a safe system by analyzing and understanding the conditions that exist when things go right. If we understand how the system works well, we can work towards replicating those conditions. For cloud-native systems, this means we replace our traditional line and bar graphs with visualizations that reflect the actual underlying system -- service maps, heat maps, and correlation maps, among others.

While we're not at the point yet where the tools exist to make these easy journeys, they're being built. The progress made by projects like OpenTelemetry in a few short years should give the cloud-native world hope that we can continue to innovate in the field of observability over the coming months and years. Already, we're seeing projects like OpenSLO work to define open source standards for parts of this transformation, and more are sure to come. In the meantime, I'd like to ask you to join me in defining and shaping the concept of cloud-native observability in this repository where we're building and expanding on the concepts outlined above as practitioners.


***To learn more about containerized infrastructure and cloud native technologies, consider joining us at KubeCon + CloudNativeCon Europe 2022, May 16-20.


Austin Parker Head of Developer Relations

Austin Parker 

Austin Parker is the Head of Developer Relations at Lightstep, and has been creating problems with computers for most of his life. He's a maintainer of the OpenTelemetry project, the host of several podcasts, organizer of Deserted Island DevOps, infrequent Twitch streamer, conference speaker, and more. When he's not working, you can find him posting on Twitter, cooking, and parenting. His most recent book is Distributed Tracing in Practice, published by O'Reilly Media.

Published Monday, May 09, 2022 7:31 AM by David Marshall
There are no comments for this post.
To post a comment, you must be a registered user. Registration is free and easy! Sign up now!
<May 2022>