By Asaf Yigal is co-founder and CTO at Logz.io
Everywhere you look these days someone is
telling you that AI is about to transform the way you do things. Some of the
time it's even true.
Joking aside, it is pretty cool to see where
AI is starting to have a real impact. The added context of integrating GenAI
into chatbot assistants is enabling a new ability to move beyond traditional
querying and begin conversing directly with your data using natural language.
Next up, we see this initial use case giving way to the widespread use of AI
agents that actually have the ability to self-learn, make informed decisions
and trigger automated workflows.
For certain, end users are still trying to
determine where, if anywhere, these capabilities can be trusted to automate
essential tasks. But, the promise is real and the proof of measurable value is
beginning to stack up. Moreover, we can already begin to see how the use of AI
agents should radically improve our ability to observe and improve complex,
microservices-based architectures - those environments that have arguably given
us the hardest time, given their constant state of change and evolution.
Improved monitoring and troubleshooting of
Kubernetes-based systems is obviously a leading example of where this next
phase of AI innovation could really help us out.
GenAI
and LLMs - The Perfect Fit for Making K8s Improvements
As I covered in a previous webinar hosted by
the Linux Foundation, the major challenge we face today as it relates to
Kubernetes observability is the core requirement to surface, recognize and
investigate the endless trends, patterns, and anomalies existent in our
containerized apps.
Beyond the sheer volume of monitoring data
that these systems generate, they're typically in a near-constant state of
change. Has anyone ever used the word "ephemeral" more often to describe a
particular technology? I don't think so. We've adopted K8s so widely because it
hugely simplifies the way that we build and deploy our cloud applications. Yet,
for the teams tasked with managing this infrastructure - simplicity is not the
current state.
This is why the established benefits of
integrating AI into our existing platforms actually provides some tangible
value - because, outside of any shortcomings, these capabilities are excellent
at cutting through the mountains of available data to help determine where we
want to focus next.
Take the practice of chasing alerts, for
example. Most observability users will likely tell you that the bane of their
existence involves deciding which alerts they need to focus on, and which they
don't. Using AI to accelerate this process by any significant percentage would
obviously be great. And, it's already happening. Even if the AI can't always
tell you where to look next with 100% certainty, it can immediately cut down
many of the repetitive tasks involved to get closer to resolution. This alone
represents huge progress.
In fact, based on what we are seeing with our
own use of GenAI at Logz.io, LLMs can prioritize alerts and assess their
severity with a fairly high degree of accuracy, and then help triage them
efficiently for further investigation. You now have vastly improved ability to
analyze the involved patterns and trends to understand the importance of
different events, improving the overall process. But providing this kind of
help is really just the beginning.
GenAI can also help manage and simplify the
vast array of systems and documentation involved in observability processes,
offering continuous learning and adaptation. This capability is crucial for
keeping up with the dynamic and complex nature of modern environments.
Acting as a virtual assistant, LLMs can help
solve problems collaboratively, recommend dashboards, and even answer specific
questions about how to best use the observability platform. These abilities
significantly enhance team efficiency and problem resolution.
Next Up
- Agents Will Revolutionize Root Cause Analysis
As GenAI and LMM applications move beyond this
first phase of serving as a sort of virtual data analyst, the increased use of
AI agent frameworks will begin to have an even more remarkable impact on
critical processes including root cause analysis.
For starters, AI can help by running data
sequences and analyzing complex systems. The models have predictive
capabilities and provide valuable insights into potential issues, enabling
proactive measures. This is great - more help cutting through the noise and in
eliminating complex, repetitive manual processes, such as by removing the need
to pivot between multiple dashboards or enact numerous queries to carry out
in-depth troubleshooting.
But consider that with AI agents, the
observability system will also be able to understand the impact of an alert,
immediately elevate the triggering issue - such as a failed deployment, or a
poorly configured K8s pod - and then tell you what needs to be done to remedy
the issue. This is an actual game changer - with the potential to return hours
if not days of productivity to engineering teams that can now be focused on
other efforts.
Using AI for RCA, instead of using manual
processes distributed across multiple UIs and data silos, one can lean on the
AI-enabled platform to move immediately from issue detection directly into
automated investigation, dramatically simplifying and reducing time from
discovery to response. The system can also pinpoint the specific details about
the manner in which the issue was introduced, and even generate conclusions
that summarize the involved details and offer specific response steps.
In the not too distant future, one can even
envision how these agents should be able to communicate with other systems in
the cloud stack to carry out automated remediation. This isn't just a pipe
dream either, as we see widely used ITSM platforms starting to pilot just those
sorts of capabilities.
For the record, at Logz.io, we are already
providing early versions of these specific agent and RCA capabilities to our
users, and seeing how some organizations are in fact transforming the way that
they build and troubleshoot their complex Kubernetes systems.
There's obviously a lot of hype with AI, and
even with AI for observability. However, we believe there will soon be plenty
of proof that this promised transformation is already happening.
To learn more about Kubernetes and the
cloud native ecosystem, join us at KubeCon + CloudNativeCon
North America, in Salt Lake City, Utah, on November 12-15,
2024.
##
ABOUT THE AUTHOR
Asaf
Yigal, Co-Founder and CTO
Asaf Yigal is
co-founder and CTO at Logz.io, where he leads the company's overall product
vision and strategic direction. Prior to launching Logz.io in 2014, Asaf was
co-founder and VP of product development at forex trading network provider
Currensee, which was acquired by OANDA in 2013. At OANDA he served in the role
of VP product management. Asaf holds an electrical engineering degree from the
Israel Institute of Technology.