Virtualization Technology News and Information
Using machine learning to tune your HPA for optimal performance

By Thibaut Perol PHD, Lead Machine Learning Scientist At Carbon Relay

Kubernetes users often rely on the Horizontal Pod Autoscaler (HPA) and cluster autoscaling to scale applications. We show how using Red Sky Ops to optimize the whole application alongside the HPA improves cost and performance using the example of a web-application.

What is kubernetes horizontal pod autoscaler?

The Horizontal Pod Autoscaler (HPA) in Kubernetes scales up and down the number of replicas in a deployment or a stateful set based on metrics prescribed by the user. The most common metrics are CPU and memory utilization of the target pods.

To deploy the HPA, the user sets target metrics for all replicas in a deployment as well as the minimum and maximum number of replicas. The HPA is responsible for adding or deleting replicas to keep the observed metrics lower than the target values while keeping the number of replicas within the prescribed bounds.

When scaling based on CPU and memory utilization, HPA uses the API implemented by the metrics-server. The HPA can also use custom metrics or external metrics (e.g. number of requests per second on the ingress) implemented by a third-party or the user.

HPA tuning challenges

Optimizing the target metrics of the HPA for all applications and their specific workloads can be frustrating. While newer versions of kubernetes support more in depth configuration of the HPA through policies*, many users are left with a minimal set of configuration options: namely a cpu/memory utilization target and the maximum number of replicas.

If the application is a web server for example, the speed at which the HPA adds replicas is critical to accommodate bursts in traffic. A simple fix would be to reduce the CPU utilization target to a small value (say 15%) so that the HPA adds replicas early on when the traffic increases. However, this increases your cloud cost because many replicas are underutilized. To limit the cost, one could use replicas with a high CPU utilization target. If the traffic increases, while the HPA is creating replicas waiting to be available, the current replicas experience CPU throttling and the HTTP request latency increases, impacting the clients experience.

A ML-powered experimentation engine such as Red Sky Ops can be used to design a highly available, scalable and cost-efficient application. Let's demonstrate this using an example web application.

Example web application

In the following example, we will optimize the Docker example voting app using Red Sky Ops. This app is a simple distributed application that allows the user to vote between two options - cats or dogs.


A Redis queue collects the votes. Workers consume the votes and insert them in a postgres database. Finally there is a node.js webapp that shows the results of the voting in real time. You can deploy this application in a dedicated namespace with

kustomize build | kubectl apply -f -

Red Sky Ops experiments

An "experiment" is made of multiple trials whereby the Red Sky Ops server patches the whole application to find the optimal configuration. For each trial, the application is tested using a scalability test. At the end of each test, the Red Sky Ops controller measures metrics to optimize for. In this experiment, we will optimize for two metrics of opposite goals: cost of running the application (in $/month) and p95 latency (in ms). While the performance increases with resources, cost becomes problematic. On the other hand, starving an application reduces user experience. Machine learning helps finding the best tradeoff. You can find detailed instructions to run the experiment yourself here with Locust.

We are using the HPA to scale up and down the number of replicas of the front-end deployment responsible for the user experience. For simplicity, we will tune HPA using a target utilization available via the metrics server. The parameters that we are optimizing are the minimum and maximum number of replicas, target utilization used by the HPA and CPU requests for the voting-service pod. Note that every pod runs with a guaranteed QoS (limits=requests).

We run the experiment for 400 trials, and consider a trial as failed if the response latency is greater than one second.

Scalability test

During each trial we load test the application with increasing requests per second (RPS): 100 RPS for one minute, 500 RPS for one minute, 1000 RPS for one minute, 2000 for one minute. This allows us to test the scalability of the application and make sure that the HPA is configured correctly. The application should be able to minimize the cost for a low level of traffic (100 RPS) but still able to scale up fast enough to 2000 RPS.

Experiment results

We first set a baseline configuration with:

Minimum replicas=3

Maximum replicas=7

CPU utilization target=65%

CPU per replica=2

In this case the cost is 481$/month to run this application for a p95 latency of 579 milliseconds.

Each dot represents the metric values measured at the end of each trial (see Figure 1). The red dots are the best trials found during the experiment, i.e. there is no better configuration for one metric without increasing the value of the other metric. After some exploration of the parameter space, the algorithm converges towards optimal configurations. We find a sharp transition in latency around a cost of $330 per month, where the most satisfying performance is achieved. We find that the best application is obtained for:

Minimum replicas=10

Maximum replicas=15

CPU utilization target=80%

CPU per replica=0.855


The cost of running this application is 365$/month (24% savings) while the latency is 27.8 milliseconds (95% increase in performance). The advantage of providing multiple best configurations is the ability for the user to pick based on experience. For example, if an experienced devops engineer wants a more scalable application in case of larger spikes of traffic than the ones created for the load test, the following configuration can be chosen:


Minimum replicas=3

Maximum replicas=9

CPU utilization target=10%

CPU per replica=0.849


The cost is 344$/month (28% savings) while the latency is 60 milliseconds (89% increase in performance). Because the CPU utilization target per replica is lower in this case, a sudden burst in traffic triggers a scale up from the HPA early allowing for the newly added replicas to be available.


Figure 1. Red Sky Ops experiment results


Using Red Sky Ops, we can deploy a web application using the HPA that efficiently scales and avoids overprovisioning for spikes in traffic.

We decided to tune the CPU target utilization of the HPA. This is more of an infrastructure monitoring approach that is made available quite easily by the metrics server. A different approach would have been to tune the HPA based on the number of requests-per-second on the ingress. Check out this great blog post on how to set up the external metrics server to work with the HPA.

Like the USE and RED methods for monitoring your infrastructure and user experience, the Red Sky Ops experiments can be written to optimize both of your infrastructure cost and usage and/or user experience on your deployed application.

Finally, Red Sky Ops allows you to efficiently find the optimal configurations when the correlations are too complicated for a human to understand. For this example, we have oversimplified the application to easily interpret the results, but in production one would tune the resources of all the deployments.

To try it for yourself, create a free Red Sky Ops account here.


*Note: For v1.18+ the HPA API will allow the scaling behavior to be configurable, allowing the user to design the scale up and scale down policies.


    stabilizationWindowSeconds: 300


    - type: Percent

      value: 100

      periodSeconds: 15


    stabilizationWindowSeconds: 0


    - type: Percent

      value: 100

      periodSeconds: 15

    - type: Pods

      value: 4

      periodSeconds: 15

    selectPolicy: Max


In order to check if your kubernetes cluster has the behavior field available run `kubectl explain --api-version=autoscaling/v2beta2 hpa.spec.metrics`


***To learn more about containerized infrastructure and cloud native technologies, consider joining us at KubeCon + CloudNativeCon NA Virtual, November 17-20.

About the Author

Thibaut Perol PHD, Lead Machine Learning Scientist At Carbon Relay

Thibaut Perol 

I am the Lead Machine Learning Scientist working on optimization of applications deployed with kubernetes. Before that, I did my Phd in Applied Mathematics at Harvard University.

Published Wednesday, November 11, 2020 7:35 AM by David Marshall
There are no comments for this post.
To post a comment, you must be a registered user. Registration is free and easy! Sign up now!
<November 2020>