In an era where cloud computing is
experiencing exponential growth, with Gartner projecting global end-user spending to
reach $600 billion this year, the industry finds itself at a critical juncture.
As cloud usage skyrockets, so do concerns about its carbon footprint. In fact, recent estimates suggest that data centers
have a greater carbon footprint than the entire commercial aviation industry.
These two increasingly important costs,
financial and environmental, share a similar goal: reduce resource
overallocation. This convergence is giving rise to new tools designed to
measure, monitor, and mitigate the energy consumption of cloud infrastructure.
The first step in optimizing cloud usage is gaining visibility into resource
allocation and utilization patterns.
At the heart of this optimization effort
lies Kubernetes. With 84% of organizations using or evaluating
Kubernetes, targeting sustainability in Kubernetes can make a significant
difference in cutting both costs and carbon emissions.
A common misconception is that managing
carbon emissions and energy use is only the cloud provider's responsibility.
While cloud providers should continue to work to reduce their environmental
footprint, they supply the computing infrastructure that customers request.
According to the shared responsibility model, once customers provision their
Kubernetes infrastructure, they are also responsible for the efficient use of
these resources.
There are many incentives to reduce waste
in Kubernetes clusters both from a cost and environmental perspective. There
are also many great patterns for building green software that may or may not
reduce cost such as demand-shifting, which involves moving compute to regions
with a lower carbon intensity. Here, the focus is on waste reduction. After
all, the greenest energy is the energy that we don't use.
Kepler:
Shedding Light on Cloud Energy Consumption
One of the most pressing challenges is
the measurement of energy consumption in cloud environments. Traditional
methods fall short in the complex, virtualized world of cloud computing. This
is where projects like Kepler (Kubernetes Efficient Power Level
Exporter) come into play. Kepler, a CNCF sandbox project, uses eBPF technology
to attribute power to processes and pods, providing engineers with crucial data
to optimize their workloads for energy efficiency.
Kepler works by aggregating energy
metrics, using either RAPL (Running Average Power Limit) or an estimation model
based on machine learning when RAPL is not available. This allows for the
collection of both pod-level and node-level energy metrics, which can then be
exported to Prometheus for further analysis using the open source Kubernetes
Monitoring Helm Chart.
Once deployed, engineers can use tools
like Grafana to visualize their cloud carbon footprint. You can track the
energy consumption of Kubernetes components, which makes it possible to convert
resource utilization into CO2 grams per kilowatt-hour per day for everything
from cluster to container. This allows you to monitor carbon emissions per
cluster or power usage per pod or tenant, showing how much power individual
applications or customers consume.
Metrics about cost, energy, and resource
utilization not only enable software engineers to contribute to environmental
sustainability but also offer potential cost savings. As global awareness and
environmental consciousness grow, monitoring the energy consumption and carbon
emissions of Kubernetes workloads is becoming an emerging practice. There are
many open source groups leading the innovation around this in a cloud native
context, such as the CNCF Environmental Sustainability Technical Advisory
Group.
Resource utilization metrics can be a
proxy for optimizing around both carbon and cost. Such metrics can help with
Kubernetes optimizations by measuring the ‘idle ratio' of fleets of Kubernetes
clusters so that we can measure the impact of our optimizations. In the
Platform team of Grafana Labs, we monitor the ‘idle ratio' of our fleet of
clusters for each cloud service provider. The idle ratio is calculated by
dividing the cost of the cluster's unused CPU and memory by the cost of its
full capacity. These metrics can help Kubernetes users identify and eliminate
resource waste through techniques such as right-sizing and bin-packing.
Practical
Steps Towards a More Sustainable Kubernetes
Beyond energy measurement, Kubernetes'
dynamic nature and ability to efficiently manage resources make it a powerful
tool for optimizing cloud usage. However, it can also introduce complexity that
requires sophisticated monitoring solutions. When done correctly, monitoring
Kubernetes clusters helps:
- Identify and eliminate resource
wastage: By monitoring CPU, memory, and energy utilization across pods and
nodes, teams can spot overprovisioned resources and rightsize their
deployments, potentially leveraging tools such as the Kubernetes Vertical Pod Autoscaler
or Descheduler.
- Optimize horizontal autoscaling:
Proper monitoring allows for fine-tuning of horizontal scaling through the
Horizontal Pod Autoscaler (HPA) and Kubernetes-based Event Driven Autoscaler
(KEDA), ensuring that resources scale efficiently based on actual demand.
- Detect and resolve performance
bottlenecks: Quick identification of issues like CPU throttling, Out of Memory
(OOM) errors or memory leaks helps balance optimal performance with minimal
resource usage.
- Implement intelligent workload
scheduling: With detailed metrics on node utilization and workload patterns at
hand, engineers can safely and reliably implement more energy-efficient
scheduling policies through bin-packing, potentially leveraging features like
the Kubernetes Scheduler's MostAllocated scoring strategy or Karpenter to consolidate workloads onto fewer
nodes.
This information enables organizations to
make informed decisions to optimize their infrastructure and reduce unnecessary
energy consumption. By fine-tuning resource allocation and identifying
inefficiencies, companies can significantly reduce their carbon footprint while
improving overall system performance.
The
Road Ahead: Platform Capacity Management
Platform Capacity Management optimizes
cloud infrastructure by enhancing resource utilization and cutting costs while
preventing incidents. Tools that help include those that right-size resources
at the workload level (e.g., VPA) and those that optimize Kubernetes scheduling
decisions through bin-packing at the cluster level (e.g., Karpenter, GKE
Autopilot, Kubernetes descheduler). These tools help to ensure resources align
with actual demand.
Right-sizing practices like setting CPU
and memory requests and limits are crucial for effective resource use,
impacting both reliability and efficiency. In fact, 37% of organizations have 50% or more
workloads that require container rightsizing. Simple adjustments, such as setting memory requests and limits, can reduce
costs and environmental impact. It is important to have monitoring and alerting
in place when optimizing around CPU and memory to be able to catch issues that
may arise from this, such as OOM errors.
Taking this a step further, engineering
teams can benefit from tools that automate right-sizing such as VPA. This tool
provides built-in mechanisms to be able to react to scenarios in which the
optimization went too far (such as in the case of OOM errors) and help prevent
or react faster to incident scenarios.
Don't forget that a resource utilization
of 60-80% is the realistic optimal range for managing cloud costs. Reaching
80-90% (for both CPU and memory) requires careful skill to avoid risks while
maximizing efficiency. At the cluster level, tools for bin-packing can help
with optimizing the Kubernetes scheduler to "pack" as many workloads on as few
nodes as possible.
The goal for us in our Platform is to
reduce this idle ratio across our fleet of Kubernetes clusters. Cluster-level
configurations that can help with this include the Kubernetes descheduler's HighNodeUtilization strategy, Karpenter's disruption strategies,
and the Kubernetes scheduler's MostAllocated scoring strategy.
For a deep dive into our journey with
bin-packing on EKS, make sure to read a previous post from our team: How Grafana Labs switched to Karpenter to reduce costs
and complexities in Amazon EKS.
A
Greener Cloud on the Horizon
Moving forward, we must embrace
"stubborn optimism" - acknowledging the enormity of the
sustainability challenge while maintaining the belief that collective efforts
can make a real difference. Stubborn optimism is the input we need to keep advocating
for environmental awareness in our industry. The future of cloud computing must
be sustainable, and it's up to the entire tech community to make it so.
Engineers/developers can take immediate
steps to contribute to this mission by adopting monitoring tools, visualizing
energy consumption and leveraging platform capacity tools but given the cost
benefits, it's also a smart business decision.
To
learn more about Kubernetes and the cloud native ecosystem, join us at KubeCon + CloudNativeCon North America, in Salt Lake City, Utah, on November
12-15, 2024.
##
ABOUT
THE AUTHOR
Niki Manoledaki,
Software Engineer at Grafana Labs
Niki Manoledaki is
a Software Engineer at Grafana Labs. In the open-source ecosystem, she
advocates for cloud-native environmental sustainability through the CNCF
Environmental Sustainability Technical Advisory Group, OpenGitOps, Kepler, and
SustainabilityCon.
++
Vasil Kaftandzhiev,
Staff Product Manager at Grafana Labs
Vasil Kaftandzhiev
is a staff product manager at Grafana Labs where he leads development of the
company's Kubernetes and AWS monitoring solutions.