Kubernetes Automatic Scaling
DevOps, Containers, Kubernetes, Docker, kubernetesio, optimize, K8s, Automatic Scaling


What is Scaling?

Scaling means the practice of adapting your infrastructure to new load conditions. If you have more load, you scale up to enable the environment to respond swiftly/on-time and avoid node-crash. When things cool down and there isn’t much load, you scale down to optimize your costs. Scaling can be thought of in two ways:

  • Vertical Scaling: this is when you increase your resources. For example, more memory, more CPU cores, faster disks, etc.
  • Horizontal scaling: this is when you add more instances to the environment with the same hardware specs. For example, a web application can have two instances at normal times and four at busy ones.

Notice that, depending on your scenario, you can use either or both of the approaches.

However, sometimes the problem is when to scale. Traditionally, how much resources the cluster should have or how many nodes should be spawned were design-time decisions. The decisions were a result of lots of trial and error. Once the application is launched, a human operator would watch over the different metrics, particularly the CPU, to decide whether or not a scaling action is required. With the advent of cloud computing, scaling became as easy as a mouse click or a command. But still, it had to be done manually. Kubernetes is capable of automatically scaling up or down based on CPU utilization as well as other custom application metrics that you can define. In this article, we will discuss how you can optimize your application for autoscaling using the Horizontal Pod Autoscaling. Also how you can use Kubernetes on a cloud provider to increase the number of worker nodes if necessary.

How Horizontal Pod Autoscaling (HPA) Works

Controllers like Deployments and ReplicaSets allow you to have more than one replica for the Pods they are managing. This number can be managed automatically by the Horizontal controller. You enable the Horizontal controller through the HorizontalPodAutoscaler resource. Like other controllers, the HPA periodically scans the Pod metrics and the current number of replicas. If there’s a need for more Pods, it increases the number of replicas for the target controller (Deployment, ReplicaSet, or StatefulSet). Let’s discuss this operation in a little more detail.

Gathering The Metrics

To obtain the necessary metrics, the HPA does not contact each node and queries it for data. Instead, there’s an agent called cAdvisor that runs on each node through the kubelet. This agent collects different node metrics that are then aggregated to a layer. The HPA then sends REST API calls to this aggregation layer to retrieve different metrics.

Previously, the aggregation layer had the Heapster daemon running. It is deprecated now in favor of the custom metrics API server since Kubernetes 1.13. Hence, we are referring to its role as the “aggregation layer”. Any current or future service that handles metrics aggregation would fit into this workflow. The following diagram depicts the metrics-gathering process:




Learn how to continuously optimize your k8s cluster

Determining The Number Of Pods To Change

Once the HPA obtains the metrics, it must perform the necessary calculations to determine the number of replicas it needs to change so that the load remains as close as possible to the desired value. The calculation process itself uses the following formula:

Desired Replicas = Total Current Metric / Valuestarget Metric Value



Let’s have a quick example to demonstrate:

If we have a single pod that has a CPU utilization of 90% while the target value is 60%, then the calculation goes like

90 / 60 =1.5  ≅ 2 



If we have two pods, one with 90% utilization and the other with 80%, the calculation goes as follows:

(80+90) / 60=2.83  ≅ 3



Notice that the resulting value is always rounded to the nearest integer. It represents the number of replicas that the target controller needs to have.

The actual calculation is slightly more complicated as more considerations need to be made. For example, the operation must account for rapidly-changing metric values, as it may lead to incorrect scaling decisions.

More often than not, there are multiple metrics that the HPA needs to calculate. For example, the CPU plus other custom metrics. In this case, the calculation is performed against each individual metric. The largest value among the result is selected as the proposed number of replicas.

Updating The Target Controller

The final step in the process is to update the replicas field of the target controller. Once the replicas field has been changed, the controller handles spawning or killing Pods in conforming to the desired replicas’ count. The HPA is not (and should not be) aware of the type of target controller it’s working with. Hence, each controller exposes a subprocess called Scale subresource. The Scale subresource exposes a unified endpoint for the controller by which it can be scaled through an HTTP request. This ensures that the HPA performs its tasks in a controller-agnostic manner. Currently, the following controllers expose the Scale subresource:

  • Deployments
  • ReplicaSets
  • StatefulSets

In addition, you can configure your custom resource to use the Scale subresource. The whole process can be demonstrated using the below figure:




LAB: Creating An HPA To Respond To High CPU Utilization

Assuming that you have a web application that uses a number of Pods to serve its content. You need to ensure that the application responds normally to client requests at all times. You create an HPA that scales out the application when the CPU load increases past a predefined threshold. But the question now is; which CPU load does the HPA consider? CPU can be measured at various levels. The overall CPU consumption on the node, the hard limit imposed on the Pod, and the requested limit (for more information about the importance of setting the request and the hard limits on your Pods, please refer to our article Kubernetes Patterns: Capacity Planning). When you set up an HPA, it monitors the requested limit of the Pod.

Our demo application is a Python flask API that displays a welcome message when its URL is hit. We’ll create a deployment and a service for our Flask application. The definition file looks as follows:

apiVersion: apps/v1
kind: Deployment
  name: frontend
    app: frontend
  replicas: 2
      app: frontend
        app: frontend
      - name: app
        image: magalixcorp/flaskapp
            cpu: 50m
apiVersion: v1
kind: Service
  name: flask-svc
    app: frontend
  - protocol: TCP
    port: 80
    targetPort: 80
  type: LoadBalancer


This lab was applied and tested on Google GCP. However, the same procedure can be equally applied to any k8s cluster.

The definition contains a deployment and a service. Apply the above definition to your cluster:

$ kubectl apply -f deployment.yml                                                                                                                                           
deployment.apps/frontend created
service/flask-svc created

This is a cluster that is hosted on a cloud provider, so we can access our application through the load balancer that is offered by the vendor:

$ kubectl get svc -o wide                                                                                                                                                   
NAME         TYPE           CLUSTER-IP    EXTERNAL-IP      PORT(S)        AGE     SELECTOR
flask-svc    LoadBalancer   80:32369/TCP   5m40s   app=frontend
kubernetes   ClusterIP                443/TCP        39m     

We can test our application by hitting the IP address (EXTERNAL-IP) of the load balancer:

$ curl                                                                                                                                                       
Welcome to Flask

The deployment contains two replicas of the application, and we set the requested CPU limit for those Pods to be 50 millicores. So far, there is no autoscaling in place. We need to create an HPA. You can create the HPA through the command line as follows:

$ kubectl autoscale deployment frontend --cpu-percent=50 --min=1 --max=5

The command specifies that we need to apply autoscaling to a target deployment ‘frontend’. We need the CPU to retain about 50%. If the load increases and CPU utilization is more than 50%, then we should spawn more Pods up to a maximum of five. On the other hand, if the load is below the threshold, we can scale down to one pod.

We can confirm that the HPA is doing its job by running the following command:

$ kubectl get hpa                                                                                                                                                           
frontend   Deployment/frontend   2%/50%    1         5         2          66s

The Target field is displaying the minimum and maximum CPU thresholds between which the Autoscaler operates.

Our application does not have any hits, which means very little or no load at all. Hence, the HPA will scale down the deployment to the minimum specified (notice that it may take a few minutes for the HPA to scale up or down):

$ kubectl get pods                                                                                                                                                          
NAME                        READY   STATUS    RESTARTS   AGE
frontend-6b488ff45b-2q955   1/1     Running   0          93m
frontend-6b488ff45b-pltld   1/1     Running   0          93m
$ kubectl get pods                                                                                                                                                          
NAME                        READY   STATUS    RESTARTS   AGE
frontend-6b488ff45b-pltld   1/1     Running   0          94m

Side Note: You Should Not Target The ReplicaSet

Notice that the Deployment controller creates a child ReplicaSet behind the scenes. Therefore, if you specified the ReplicaSet as a target, the new ReplicaSet that the Deployment creates will not be managed by the Autoscaler. The HPA is configured to monitor another ReplicaSet that the Deployment has probably deleted by now when it was scaling its replicas.

Back to our lab, we need to watch the Deployment and the HPA while we are stress-testing it. Open a terminal window and run the following command:

$ watch -n 1 kubectl get hpa,deployment

Note: if you’re running this on a mac, watch is not installed by default. You’ll need to either install it using brew install watch or run the above command through a loop.

Open another terminal window and run the following command:

while true                                                                                                                                                                                    
curl -s > /dev/null
done &

This is ‘while loop’ that hammers our load balancer with HTTP requests. The curl command directs its output to /dev/null so as not to pollute your terminal. Notice that you may need to run this command several times to ensure that thousands of requests hit the Service in a very short period of time to force the CPU level to rise.

Now have a look at the first terminal window. Within some time, you’ll notice that the Target column will display this sequentially: 30%,50%, and then 48%,50%, then %100,50%. Don’t forget that, we’re currently utilizing only one Pod.

In a few minutes, you’ll notice that the Deployment has spawned an extra Pod, then a second, perhaps a third (depending on how many times you executed the load-stress command). In my case, the eventual output looked like this:

Every 1.0s: kubectl get hpa,deployment                                                                                                                                   

NAME                                           REFERENCE             TARGETS   MINPODS   MAXPODS   REPLICAS   AGE
horizontalpodautoscaler.autoscaling/frontend   Deployment/frontend   30%/50%   1         5         3          66m

NAME                             READY   UP-TO-DATE   AVAILABLE   AGE
deployment.extensions/frontend   3/3     3            3           156m

As soon as the load dropped to 30%, the HPA stopped spawning new Pods. We are now within the acceptable limit. Let’s terminate all our load-stress commands and let the HPA scale down the Pods back to 1:

ps -ef | grep curl | grep -v grep | awk '{print $3}' | xargs kill

Notice that the above command was used against a zsh shell on a macOS. Your shell and environment may be different.

Within some moments, you will notice that the load dropped from 30% to 2%. Accordingly, the HPA decreased the number of Pods to 1.

A Few Points To Note On The Lab

  • Even if the CPU spikes to 100% or more, the Autoscaler has a limit regarding how many additional Pod it spawns in one run. It will at most double the number of existing replicas.
  • The Autoscaler also has a limit of how frequent it will scale up or down. For scaling up, that is 3 minutes between every scale operation and the other. When scaling down, the period becomes 5 minutes.
  • The HPA is no more than another Kubernetes resource. So, you can edit the HPA once it has been deployed by using the kubectl edit subcommand. For example kubectl edit hpa frontend

Using Other Metrics For Automatic Scaling

While an important indication of how much load an application is experiencing, the CPU may not be the only indication of whether or not we should scale out our app. Other resource metrics may be involved; like memory or even user-specified metrics.

Taking a scaling decision based on memory consumption is much more complex than when based on CPU. The reason is for “memory” is among the incompressible resource types. That is, it cannot be shared among different Pods. Hence, it cannot be throttled; the application instance on each Pod should be forced somehow to release the memory it was using once the limit has been reached. Otherwise, new Pods get spawned until the upper limit is reached. This behavior is largely dependent on the application design.

If you need to scale based on other metrics than CPU, memory or disk I/O, you can use one of the following metric types:

Custom metrics: you can use the Pod or the Object metric types. Let’s have a brief overview of both:

  • Pod metrics: they refer to any metrics that are directly related to the Pod. For example, the number of key-value pairs if the Pod is hosting a Redis instance, and the number of messages if we’re dealing with a message queue system and so on.
  • Object Metrics: you use those metrics when you need the Autoscaler to act based on a metric obtained from a source other than the Pods. For example, you may have a component that measures the load time of a number of important web pages in your app. You need the HPA to spawn new Pods if the latency crosses a certain level.

Notice that to implement custom metrics, you’ll need to extend the API server. Custom metrics are stored under the custom.metrics.k8s.io path. To implement them, you’ll need custom metric adapters like Prometheus, Stakdriver, Datadog, and others.

External Metrics: sometimes, you want the autoscaling decision to be based on a metric that comes from outside the cluster. For example, if your Pods are consuming messages from a remote message queueing system, then depending on the total number of queued messages, you may want to increase the number of consumer Pods. Again, you’ll need an external plugin that extends the API server and enable this functionality.


  • Scaling is an important operational practice that used to be done manually for a long period of time. With the introduction of HorizontalPodAutoscaler (HPA), Kubernetes can take intelligent scaling decisions automatically.
  • There are two types of scaling: horizontal scaling, which refers to increasing the number of Pods serving the application, and Vertical scaling which refers to expanding the resources of the Pods.
  • HPA can be easily applied to resources that can be throttled like CPU. However, when you need to use non-throttable resources like memory, the process becomes more complex as the application needs to be aware that more Pods were spawned and start to release memory. Otherwise, the HPA continues spawning Pods until the upper limit is reached.
  • You can use custom metrics with HPA. Those custom metrics can be Pod-related, object, or external. In all cases, custom metrics require extending the API server with a third-party plugin or adapter.
  • Vertical Pod Autoscaling is still experimental at the time of this writing. When implemented, it can modify the requests and limits value of Pods based on their consumption levels. However, this needs much consideration than HPA because changing the requests and limits value requires deleting and recreating Pods. Additionally, the application must be carefully designed so that it utilizes the new resources. For example, JVM defines the maximum and minimum memory that it would consume through command line arguments.

Magalix trial

Comments and Responses

Related Articles

DevOps, Kubernetes, cost saving, K8s
Kubernetes Cost Optimization 101

Over the past two years at Magalix, we have focused on building our system, introducing new features, and

Read more
The Importance of Using Labels in Your Kubernetes Specs: A Guide

Even a small Kubernetes cluster may have hundreds of Containers, Pods, Services and many other Kubernetes API

Read more
How to Deploy a React App to a Kubernetes Cluster

Kubernetes is a gold standard in the industry for deploying containerized applications in the cloud. Many

Read more

start your 14-day free trial today!

Automate your Kubernetes cluster optimization in minutes.

Get started View Pricing
No Card Required