Virtualization Technology News and Information
Article
RSS
Cutting Through the Hype: Using AI for Better K8s Observability

By Asaf Yigal is co-founder and CTO at Logz.io

Everywhere you look these days someone is telling you that AI is about to transform the way you do things. Some of the time it's even true.

Joking aside, it is pretty cool to see where AI is starting to have a real impact. The added context of integrating GenAI into chatbot assistants is enabling a new ability to move beyond traditional querying and begin conversing directly with your data using natural language. Next up, we see this initial use case giving way to the widespread use of AI agents that actually have the ability to self-learn, make informed decisions and trigger automated workflows.

For certain, end users are still trying to determine where, if anywhere, these capabilities can be trusted to automate essential tasks. But, the promise is real and the proof of measurable value is beginning to stack up. Moreover, we can already begin to see how the use of AI agents should radically improve our ability to observe and improve complex, microservices-based architectures - those environments that have arguably given us the hardest time, given their constant state of change and evolution.

Improved monitoring and troubleshooting of Kubernetes-based systems is obviously a leading example of where this next phase of AI innovation could really help us out.

GenAI and LLMs - The Perfect Fit for Making K8s Improvements

As I covered in a previous webinar hosted by the Linux Foundation, the major challenge we face today as it relates to Kubernetes observability is the core requirement to surface, recognize and investigate the endless trends, patterns, and anomalies existent in our containerized apps.

Beyond the sheer volume of monitoring data that these systems generate, they're typically in a near-constant state of change. Has anyone ever used the word "ephemeral" more often to describe a particular technology? I don't think so. We've adopted K8s so widely because it hugely simplifies the way that we build and deploy our cloud applications. Yet, for the teams tasked with managing this infrastructure - simplicity is not the current state.

This is why the established benefits of integrating AI into our existing platforms actually provides some tangible value - because, outside of any shortcomings, these capabilities are excellent at cutting through the mountains of available data to help determine where we want to focus next.

Take the practice of chasing alerts, for example. Most observability users will likely tell you that the bane of their existence involves deciding which alerts they need to focus on, and which they don't. Using AI to accelerate this process by any significant percentage would obviously be great. And, it's already happening. Even if the AI can't always tell you where to look next with 100% certainty, it can immediately cut down many of the repetitive tasks involved to get closer to resolution. This alone represents huge progress.

In fact, based on what we are seeing with our own use of GenAI at Logz.io, LLMs can prioritize alerts and assess their severity with a fairly high degree of accuracy, and then help triage them efficiently for further investigation. You now have vastly improved ability to analyze the involved patterns and trends to understand the importance of different events, improving the overall process. But providing this kind of help is really just the beginning.

GenAI can also help manage and simplify the vast array of systems and documentation involved in observability processes, offering continuous learning and adaptation. This capability is crucial for keeping up with the dynamic and complex nature of modern environments.

Acting as a virtual assistant, LLMs can help solve problems collaboratively, recommend dashboards, and even answer specific questions about how to best use the observability platform. These abilities significantly enhance team efficiency and problem resolution.

Next Up - Agents Will Revolutionize Root Cause Analysis

As GenAI and LMM applications move beyond this first phase of serving as a sort of virtual data analyst, the increased use of AI agent frameworks will begin to have an even more remarkable impact on critical processes including root cause analysis.

For starters, AI can help by running data sequences and analyzing complex systems. The models have predictive capabilities and provide valuable insights into potential issues, enabling proactive measures. This is great - more help cutting through the noise and in eliminating complex, repetitive manual processes, such as by removing the need to pivot between multiple dashboards or enact numerous queries to carry out in-depth troubleshooting.

But consider that with AI agents, the observability system will also be able to understand the impact of an alert, immediately elevate the triggering issue - such as a failed deployment, or a poorly configured K8s pod - and then tell you what needs to be done to remedy the issue. This is an actual game changer - with the potential to return hours if not days of productivity to engineering teams that can now be focused on other efforts.

Using AI for RCA, instead of using manual processes distributed across multiple UIs and data silos, one can lean on the AI-enabled platform to move immediately from issue detection directly into automated investigation, dramatically simplifying and reducing time from discovery to response. The system can also pinpoint the specific details about the manner in which the issue was introduced, and even generate conclusions that summarize the involved details and offer specific response steps.

In the not too distant future, one can even envision how these agents should be able to communicate with other systems in the cloud stack to carry out automated remediation. This isn't just a pipe dream either, as we see widely used ITSM platforms starting to pilot just those sorts of capabilities.

For the record, at Logz.io, we are already providing early versions of these specific agent and RCA capabilities to our users, and seeing how some organizations are in fact transforming the way that they build and troubleshoot their complex Kubernetes systems.

There's obviously a lot of hype with AI, and even with AI for observability. However, we believe there will soon be plenty of proof that this promised transformation is already happening.

To learn more about Kubernetes and the cloud native ecosystem, join us at KubeCon + CloudNativeCon North America, in Salt Lake City, Utah, on November 12-15, 2024. 

##

ABOUT THE AUTHOR

Asaf Yigal, Co-Founder and CTO

Asaf Yigal 

Asaf Yigal is co-founder and CTO at Logz.io, where he leads the company's overall product vision and strategic direction. Prior to launching Logz.io in 2014, Asaf was co-founder and VP of product development at forex trading network provider Currensee, which was acquired by OANDA in 2013. At OANDA he served in the role of VP product management. Asaf holds an electrical engineering degree from the Israel Institute of Technology.

Published Friday, November 08, 2024 12:58 PM by David Marshall
Comments
There are no comments for this post.
To post a comment, you must be a registered user. Registration is free and easy! Sign up now!
Calendar
<November 2024>
SuMoTuWeThFrSa
272829303112
3456789
10111213141516
17181920212223
24252627282930
1234567