Virtualization Technology News and Information
Using Machine Learning to Catch Problems in a Distributed Kubernetes Deployment


By Ajay Singh - CEO and Co-founder, Zebrium.

It is especially tricky to identify software problems in the kinds of distributed applications typically deployed in k8s environments. There's usually a mix of home grown, third party and open source components - taking more effort to normalize, parse and filter log and metric data into a manageable state. In a more traditional world tailing or grepping logs might have worked to track down problems, but that doesn't work in a Kubernetes deployment that has a multitude of ephemeral containers. You need to centralize logs, but that comes with its own problems. The sheer volume can bog down the text indexes of traditional logging tools. Centralization also adds confusion by breaking up connected events (such as multi-line stack traces) in interleaved outputs from multiple sources.

The biggest problem is that logs are still treated as a passive repository of semi-structured text. The typical workflow is to rely on some other (monitoring tool) to detect a symptom, and then rely on human intuition to drill down and search logs until the problem is finally identified. This is unfortunate, because logs contain a rich and broad trail of events and embedded metrics. These include the leading indicators of many problems before they become severe enough to impact users (and trip monitoring alerts). The challenge is that their large volume and lack of consistent structure makes it hard to proactively extract insights, particularly in a fast-changing Kubernetes environment.

What is needed is a tool that automatically tells you if something anomalous (or known to be bad) is happening in the logs. When a bad problem ripples through multiple services, it should take you to the leading edge of the ripple - the root cause of the entire problem. And since you will always encounter new problems, all of this should work without having to manually instrument everything or pre-build queries.

This is where machine learning helps. First, machine learning can build the underlying event dictionary of all unique event types generated by a distributed app. Even for a large app that generates hundreds of millions of events a day, this foundational dictionary typically only consists of a few thousand unique event types. This makes it practical for a second layer of machine learning to understand the normal patterns and behavior of each event type - whether it has ever been seen before, and if so, it's normal frequency, periodicity, severity, correlations and typical values of metrics embedded in the events.

This means when something goes wrong, anomalous patterns based on event types are automatically detected with much higher signal to noise than brute force approaches that rely on detecting spikes in keywords or errors. Types of patterns detected include a normally occurring event that has stopped (as in the example mentioned here). Or when an extremely rare event is seen. Or when there was a sudden change in the frequency/periodicity/severity of some event types.

What if something bad happens to a foundational service, and the problem ripples through many dependent services, generating a plethora of rare errors or events? A simplistic approach might overwhelm a human with a swarm of alerts. This is where a third layer of machine learning comes into play - to understand the correlation between anomalies (including across multiple services), with the goal of catching the leading edge (the service that first went south and is the very likely culprit).

The graphic below shows the inner workings as a visual heatmap:

The horizontal dimension shows events happening in multiple different services, while the vertical dimension is time. The brightness of the dots represents the level of surprise - events occurring at a normal cadence show up as greys, while surprising events are blue and the most surprising of all are white. A vertical grey stripe that stops means an event that typically occurs in a healthy system stopped happening. And a horizontal stripe across services means a systemic problem cascading across the system. In the latter case - the leading edge of the problem is the service that first exhibited highly unusual anomalies. In the (real) picture above, this leading edge is the two events near the left edge of the picture, which were related to an outage in the underlying database service of a multi-service application. This approach is remarkably successful in catching problems that a human missed (and have not yet manifested in monitoring and APM tools). And it does an extremely good job of suppressing noise - typically only alerting an operator about one in several million events as problems deserving attention.

This technology has been tested with a wide spectrum of applications, catching 56% of critical problems autonomously (and improving with each release). In each case, it also significantly sped up root cause identification. A free beta version is available for Kubernetes users here.


About the Author

Ajay Singh 

Ajay Singh is a passionate advocate for creating products that address real-life market needs and deliver an exceptional customer experience. As CEO, he is committed to building a world class organization, focused on improving the way customers solve a common problem. Prior to Zebrium, Ajay was VP of Product at Nimble Storage where he led product strategy from concept to annual revenue of more than $500M. Ajay started his career as product development engineer and has also held senior product management roles at NetApp and Logitech.

Published Friday, October 25, 2019 9:42 AM by David Marshall
Filed under: , ,
There are no comments for this post.
To post a comment, you must be a registered user. Registration is free and easy! Sign up now!
<October 2019>