Virtualization Technology News and Information
Article
RSS
Scaling Kubernetes Monitoring without Blind Spots or Operations Burden

By Chris Goller and Daniella Pontes, InfluxData

Kubernetes has seized center stage in the DevOps world when building and migrating applications to a cloud-native environment. In reality, Kubernetes is a means to an end - to scale operations of containerized microservice architecture. Managing deployment, scaling, upgrading and updating of the various microservices of fragmented applications running on containers is not a trivial task, certainly already beyond manual processes. So automation is the only way to go. Kubernetes came to take the role of orchestrating containerized workloads.

InfluxDB Cloud 2.0 Kubernetes monitoring

At InfluxData, we embraced microservices and Kubernetes when building our InfluxDB Cloud 2.0. But we also know that what makes automation reliable is monitoring. Therefore, between the metrics exposed by SREs (to fulfill the need to keep resources and microservices healthy) and the metrics exposed by developers (resulting from their inclination to instrument the code to provide as much information as possible for a potential failure diagnosis), is an ever-growing number of endpoints exposed for scraping in our Kubernetes environment. The exact number of metric endpoints and the functional endpoints themselves are in a continuous state of change due to the ephemeral nature of Kubernetes atomic structure, Pods. So the first question is: how scalable is scraping metrics via these exposed endpoints in Kubernetes? The answer is: It depends on how scraping is implemented.

We know, from our own experience dealing with metrics generation from multiple teams in a large cloud deployment, that there is a turning point of a "label too far" and as a result, you get a blind spot, missing metrics. The scraper cannot cope with pulling an increasing number of the metrics exposed within the polling interval. Increasing the polling interval means reducing the frequency for obtaining metrics and that reduces the availability of critical information for troubleshooting. So lengthening the polling interval is no true solution. (For more details on our journey to master Kubernetes monitoring, please check Chris Goller's session at InfluxDays San Francisco Oct 2019.)

We then realized that we must think beyond centralized scraping, it proved not to be scalable for large environments - not even with a federated implementation. Since it presses operations to keep up optimum metrics distribution among federated servers, it burdens Ops with yet another monitoring and balancing act. In other words, it is not easily implemented and maintained.

As a principle, a good strategy for scraping shall neither increase the burden on operations nor impose a barrier to democratizing metrics to anyone who needs it. If the larger your system gets, the more complex it becomes to monitor, then fundamentally you don't have a scalable solution.

The answer to this apparent DevOps dilemma is actually quite simple and is in the heart of Kubernetes' containerized structure: you just need to contain (isolate) the impact of each exposed metrics. The scraper should be contained inside the pod of the service or workload, that it is going to scrape. The mechanism to implement this approach is called sidecar deployment.

In InfluxDB Cloud 2.0, we have Telegraf, a lightweight, plugin-based, metrics collection agent, deployed as a sidecar in every pod, so all the metrics exposed by the application, service or microservice are handled by that agent and will not impact the scraping of other workloads.

 

Each Telegraf agent sends its metrics to InfluxDB in the cloud or on-prem, without creating any burden on IT Operations, or a culture of pushing back on developers and data engineers on their needs for metrics.

Watching the watcher

Sidecar deployment addressed the issue of scaling scraping, but our journey to properly monitor our Kubernetes environment did not stop there. A second question still needs to be answered - we need to guarantee that we don't have blind spots, missing metrics. So we need to be able to watch the watcher.

The answer is in Telegraf one more time. It is actually a very good fit as a monitoring agent, because it takes nothing for granted. Telegraf monitors how well it is doing its job. It has an input plugin for its internal metrics, collecting data such as:

  • gather_errors
  • metrics_dropped
  • metrics_gathered
  • metrics_written

The screenshot below is from monitored measurements detecting missing metrics. The next step is just to dig into the Telegraf self-watch to find which had metrics dropped and to gather errors.

 

Other good reasons

Adding capabilities for self-control is also an important feature that a monitoring solution provides. Monitoring metrics is always a balance between cardinality and indexing. The more tags added for description and visualization grouping, the more series are created. This comes at a cost since having more series will take its toll on resources. Telegraf has a very useful feature to keep well-behaving "metrics" citizens behaving well by putting some guard-rails in the agent.

While in Telegraf, this can be easily configured to limit the number of tags taken as preferential ones. Changing a measurement description would not be as straightforward in the case of trying to make this change in a central Prometheus server. For instance, changing the central prometheus server would require a restart, impacting the metrics gathering of all microservicers being monitored. On the other hand, with a Telegraf sidecar deployment, a restart of a single pod with the single service and telegraf, there would mean no interruption to anyone else.

Configuring Telegraf can be done in runtime - with no need to recompile - and completely under the developer's control, freeing ops of additional overhead.

[[processors.tag_limit]]

  limit = 3

  ## List of tags to preferentially preserve

  keep = ["handler", "method", "status"]

Telegraf has 200+ open-source plugins for full-stack monitoring (infrastructure, network, & application), supporting pull, push and metrics streaming, as well as client libraries (C#, Go, Java, JavaScript/Node.js, Python...) for direct instrumentation. Telegraf also monitors kubelet API for metrics exposed in /summary endpoint and Kubernetes Inventory monitors system resources state as for:

  • daemonsets
  • deployments
  • nodes
  • persistentvolumes
  • persistentvolumeclaims
  • pods (containers)
  • Statefulsets

Key lessons learned

Scaling monitoring is not about adding more manual processes and controls. Scaling cannot be coupled with higher complexity, and surely, must embrace empowering developers - with observability, predictability and prescriptive means - to ensure that monitoring is doing its job.

Of course, Kubernetes is constantly evolving, and so are we. If you have an idea, we'd love to hear about it!

## 

About the Authors

Chris Goller, Architect

Chris Goller 

As an Architect with 20 years of experience, Chris Goller brings tremendous energy & expertise to InfluxData. In particular, Chris likes to try out using new technologies and sharing his learnings with the staff at InfluxData as well as with the community.

Daniella Pontes, Senior Manager Product Marketing

Daniella Pontes 

Daniella Pontes is part of the product marketing team in InfluxData, San Francisco. She started her career in telecommunications, wireless technology and global Internet service provisioning. As security became a major concern for enterprises, she worked on enterprise policy management, SaaS and data encryption solutions. Prior to joining InfluxData, she spent some years living in Japan, Germany, and Brazil, working for an online agency developing and managing the Brazilian market. Having worked in various market segments, from embedded smart antenna technology to Internet security and e-commerce doing product management, partnerships, marketing and business development, she has a broad experience working cross-functionally and with customers and partners.

Daniella is passionate about nature, art, and science. She dearly loves animals and plants, being a strong advocate for preservation, open and green spaces. In her free time, she loves reading, writing and chatting with friends about social fairness while enjoying a cup of coffee.
Published Thursday, November 07, 2019 7:36 AM by David Marshall
Comments
There are no comments for this post.
To post a comment, you must be a registered user. Registration is free and easy! Sign up now!
Calendar
<November 2019>
SuMoTuWeThFrSa
272829303112
3456789
10111213141516
17181920212223
24252627282930
1234567