With the rise of
cloud native development, observability and monitoring are often
interchangeably used. While monitoring doesn't need any formal introduction,
having been around for ages, according to the CNCF
Glossary, observability is the property of a system that defines the
degree to which it can generate actionable insights. These insights help users
understand the system's state and enable them to take corrective action if
needed.
As a natural
evolution of traditional data collection methods, observability leverages
several specialized tools to track and measure low-level signals from the
system, some of which are visible on the observability and analysis section of the
cloud native landscape.
However, in the case
of cloud native systems, relying solely on observability data to troubleshoot
and resolve issues would be insufficient. Not only are the system architectures
more complex and distributed but they're also ephemeral as opposed to long-lived
and mutable traditional infrastructure. This is where the capabilities
introduced by topology-based observability shine.
What is
topology-based observability?
Topology-based
observability is an approach that not only tracks metrics, logs, and traces but
also maps how the components of a system interact with each other. This allows
users to visualize and analyze the relationships between different components
in a system, giving them a more contextual understanding of performance and
reliability issues.
Fundamentally, the
approach answers the following three questions,
- How do the different services
within a workload interact with each other?
- What are their dependencies?
- Where are the potential failure
points or bottlenecks?
Answering these
questions empowers organizations to make more informed decisions about service
recovery during an outage and architectural improvements.
How does
topology-based observability help?
In the world of cloud
native, where dynamic, microservices-based workloads are the norm,
topology-based observability offers several advantages that are in sync with
these modern practices.
Improved
recovery times and root cause analyses
In cloud-native
environments, failures can cascade quickly through interconnected services,
making it difficult to pinpoint the problem area. This situation makes it
challenging to analyze the root cause and severely impacts service recovery
times. Topology-based observability allows teams to visualize service
dependencies and interconnections between components, helping them quickly
identify and isolate faults. What's more? It also empowers them to adapt their
monitoring strategies dynamically by providing real-time updates on changes in
service interactions and dependencies as they occur, including providing
insights about vulnerabilities and malicious actors in the threat landscape.
Enhanced
contextual information without reinventing the wheel
While projects like OpenTelemetry
provide the raw telemetry data required for observability, such as metrics,
logs, and traces, topology-based observability adds a layer of context to the
existing information by illustrating how the different components within a
workload are connected. This visualization helps teams discover where the
issues occur and point to the underlying problem by presenting the state of the
dependencies and interactions between various services.
Effective
resource optimisation
Cost and resource
management are crucial to developing and maintaining infrastructure
efficiently-whether cloud-native or otherwise. Topping the benefits of
traditional observability metrics with a clear understanding of service
dependencies, topology-based observability enables organizations to make
informed scaling decisions, ensuring that only the required resources are used
without falling into the trap of over-provisioning!
Implementing
topology-based observability
As is evident,
leveraging topology-based observability can significantly enhance an
organization's ability to monitor and troubleshoot complex cloud-native
applications. Let's look at how you can leverage existing tools in the open
source landscape to implement topology-based observability for your workloads.
Define
your observability goals
Before discussing any
tools or their implementation, you must ascertain what metrics, logs, and
traces are most relevant to your context. Always allow your goals to define
your tool choice and not vice versa.
Choose
your core observability stack
Of course, one look
at the number of options on the CNCF Landscape is enough to intimidate even
the steeliest of SREs. However, setting up a basic observability stack would
involve fulfilling the following requirements.
- Collecting raw telemetry data with
projects like OpenTelemetry
- Storing the collected data
- Visualizing the data and creating
separate views with projects like Grafana
- Analyzing data
Create
and visualize dependency maps
However, what about
the added layer of context? This is where the capabilities of projects like Jaeger,
GUAC (Graph for
Understanding Artifact Composition) shine. By presenting information
about service and application dependencies, respectively, these projects enable
you to create dependency maps that would add a rich layer of contextual
information over and above the raw telemetry data you collected with the basic
observability stack. What's even better? You can also plugin these insights
into a Grafana dashboard for a visual representation of these dependencies.
But, building out
this system isn't enough! Continually monitoring, maintaining, and updating the
entire stack you've built is equally crucial for interaction changes or any new
vulnerabilities. Also, encouraging and fostering a collaborative culture around
the stack you've built is vital to reaping the benefits of this approach.
We believe that
topology-based observability is the next step to improving your observability
game. It will enable you to gain valuable insights into your software
ecosystems, allow for proactive management of dependencies, and improve overall
system performance. Have you or your organization implemented it yet? Did we
miss something important? We'd love to learn about your approaches and swap
stories at the SUSE booth ( D2) during the upcoming KubeCon +
CloudNativeCon in Salt Lake City.
To learn more
about Kubernetes and the cloud native ecosystem, join us at KubeCon + CloudNativeCon North
America, in Salt Lake City, Utah, on November 12-15, 2024.
##
ABOUT THE
AUTHOR
Divya Mohan, Principal Technology Advocate at SUSE
Divya is a Principal Technology Advocate at SUSE, where she contributes
to Rancher's cloud native open source projects. She co-chairs the documentation
for the Kubernetes & LitmusChaos projects & has previously worked
extensively in the systems engineering space during her tenure with HSBC &
IGate Global Solutions Pvt Ltd. A co-creator of the KCNA exam & a CNCF
ambassador, she is invested in making technical communities & technologies
more accessible & inclusive.