By John Hayes, Senior Product Marketing
Manager, SquaredUp
Because we can't quite agree what it is
Many vendors I talk to are not even sure if
their product is an observability product and then still others, whose products
almost certainly are observability products, don't want to market them as
observability products because they think that the space is becoming saturated.
We have now kind of reached a consensus on what observability does not mean
- i.e. that observability is not just about the three pillars. At the
same time though, there is not much consensus about what it does mean.
At the empirical end of the spectrum that are those who define observability as
processing telemetry and then at the more abstract end there are positions such
as "observability is about being able to ask questions of your
system". You can have your observability burger your way! Some of us want
to know if our K8S pods are overheating, and some of us want to understand our
sales pipeline. But that's actually ok. As we say, it's a big tent.
Because we want to know about the past
What happened?? Why did that system go
down? If we want to answer that question, then we obviously need to have
historical data. The problem is that modern computer systems generate huge
volumes of telemetry. Since we never in know in advance which system might
fail, we play safe and cover all our bases. This gives us a warm feeling, but
it also means that we have the engineering problem of ingesting that data and
the economic problem of paying for it. To be honest, to an extent, the
engineering problem of mass ingestion has been solved. Unfortunately, the
problem of querying that mountain of data so that we can quickly make sense of
it, is still hard.
Because we really want to know about the future
But it's still not really possible.
Ultimately, as an SRE or a DevOps engineer, I would rather not have to figure
out what went wrong. I would really like a system that would be able to help me
prevent the outage in the first place. Sure, it's great that my system can
alert me that a pod went down with an OOM Kill, but if I'm paying a six or
seven-figure sum for a state of the art system - can't it actually figure this
stuff out pre-emptively? I mean, I have fed it several petabytes of historical
data - why can't it be a bit more predictive? At the moment, it turns out that
this is still a hard problem.
Because we want to find root causes
Root cause analysis is a seductive phrase
but, in practice, it is something of a chimera. Unfortunately, the mechanics of
cause and effect are not always visible. Even more unfortunately, causes
themselves are not always mechanical. In loosely coupled, complex and highly
distributed systems, they can sometimes only be inferred. And, as a growing
body of theory tells us, failures in complex systems are often not mono-causal.
Maybe it is even the case that attempting to pinpoint a single cause for an
error condition is a particular prejudice of human inquiry rather than a
self-evidently correct approach.
Because the systems under observation are complex
Complexity is pretty much taken for granted
in modern IT landscapes. It is part of the wallpaper. And then we pile one
layer of complexity on top of another. Distributed services, interconnected
network topologies, enterprise messaging backbones. They are designed by clever
people and they do complex, high tech stuff. The terabytes of logs, metrics,
traces, events etc generated by these systems don't tell a story by themselves.
Turning these huge heaps of data points into meaningful analyses requires deep
domain knowledge as well as heavy-duty engineering and some highly skilled UI
design. The complexity of observability systems is a function of the complexity
they observe.
Because we don't have enough specialist knowledge
Very few organisations actually have
dedicated observability specialists - at best they may have one or more DevOps
or Platform engineers that have some familiarity with some aspects of one or
two observability products. Most organisations don't have staff with experience
and expertise in instrumenting their systems optimally or in the best ways to
filter, forward, consolidate and configure that telemetry. Traversing a wide
and unfamiliar landscape without a map or a compass is not a simple endeavour.
Because we don't have enough time
Procuring a unified observability system is
actually a considerable undertaking. Effectively, you cannot really evaluate
the system without re-instrumenting some of your existing services and
infrastructure. It is not easy to do this without impacting an existing
environment or spinning up a new one. You will also have to coordinate across
multiple teams and set up a whole variety of testing scenarios. Most
organisations simply do not have the time to go through this process over and
again so that they can compare vendor tools. Often this means that customers
don't end up with the best tool for the job.
Because the narrative is shaped by vendors
If you are a developer and you want to know
about OOP or the principles of RESTful API's, the chances are that the
canonical texts are not written by a particular vendor. More likely they are
the product of collaborations by networks of subject matter experts or reflect
a coalescence of academic traditions.
In the observability realm, much of the
narrative-making tends to have a more vendor-led feel. Those vendors though,
are often focused on the concerns of big-ticket clients. The "problems",
therefore, are often stated as dealing with petabyte scale ingestion or
cardinality explosions or traces with tens of thousands of spans. This results
in heroic feats of engineering that captivate audiences at conferences (me
included!), but it may not reflect the actual day to day concerns of many
practitioners.
But there are positives...
Organisations that have adopted
OpenTelemetry will not have to go through the pain of re-instrumenting their
code in order to evaluate or switch to a new vendor. Equally eBPF offers the
possibility of zero code instrumentation for companies running compatible
workloads. And, of course, AI tooling holds out the prospect of automation to
help mitigate the observability skills gap. Will this lead us to a
technological nirvana of systems functioning in perfect harmony? Hopefully not,
the hard problems are the best ones.
##
To learn more about Kubernetes and the
cloud native ecosystem, join us at KubeCon + CloudNativeCon EU, in London, England, on April 1-4, 2025.
ABOUT THE AUTHOR
John Hayes has been
an IT professional for over 25 years. He spent many years as a software
developer before switching to DevOps and then specialising in observability. He
is a Senior Product Marketing Manager at SquaredUp and also publishes the
Observability 360 newsletter.