14-days FREE Trial

 

Right-size Kubernetes cluster, boost app performance and lower cloud infrastructure cost in 5 minutes or less

 

GET STARTED

  Blog

How to Save Up To 80% on Google Kubernetes Engine Using Magalix KubeAdvisor

 

 

Google Cloud Platform (GCP) recently announced the beta launch of Google Cloud Recommenders, and the features and functionality are pretty exciting. The short of it is that with Recommenders in GCP, you can now:

 

  • Automatically get analysis of usage patterns to help you determine if resources and policies within Google Cloud are optimally configured
  • Automatically detect if overly permissive access policies are present, and adjust them based on access patterns of similar users in your organization
  • Choose the optimal virtual machine size for your workload, because most GCP customers initially provision machines that are too small or too large.

 

This is all pretty awesome, but in actuality, this is old news.

 

The Magalix Agent (which can be found in the Google Cloud Marketplace) has been doing this since day 1. Just last week with the release of KubeAdvisor, we now enable developers and DevOps engineers to get continuously generated Recommendations specifically to save money on any cloud provider (including GCP), enact best practices for cluster configuration, and optimize resource usage with application performance.

 

KubeAdvisor helps you select the right VM type, the right capacity limits and resources, and can even suggest how to optimize your K8s clusters based on the billing model of your cloud provider.

 

Just like Google Cloud, we’ve found that Kubernetes users are often way over or under-provisioned, and can save as much as 80% on their cloud costs by using KubeAdvisor. In this article, we’ll learn how KubeAdvisor works, and how it can help save on cloud costs while optimizing for app performance on Kubernetes.

 

Step 1: Connect a Google Kubernetes Engine Cluster to Magalix

Make sure you’re logged in to Google Cloud Platform, and have a cluster enabled in Google Kubernetes Engine (GKE).

 

Navigate to the Magalix Agent listing in the Google Cloud Marketplace. Once there, click the blue “CONFIGURE” button and you’ll get to the Deployment screen.

 

Google Cloud Platform Deployment Screen

 

 

Select a cluster (don’t worry, all new Magalix accounts start on a 14 day free trial, with no credit card required), select a namespace, enter your account credentials (enter your existing password if you have a Magalix account), and click deploy. Easy peasy!

 

It will take just a few moments to deploy your application components - check your inbox in the meantime, and log in to the Magalix console from here on out.

 

 

The Magalix Agent starting up from within the Google Cloud Platform interface.

 

Step 2: Log in to The Magalix Console With Your New or Existing Account Credentials, and Check to See That Your Cluster is Connected

You should see messages about cluster analysis in your home screen dashboard. If you don’t, or see that your cluster is showing as not connected or suspended, you may need to validate that the agent has deployed. You can do that by using the command $ kubectl get pods -n kube-system, and as shown below, for something like magalix-agent-7cf556c576-w246v in the listed pods.

$ kubectl get pods -n kube-system                                                                                                           
NAME                                                        READY   STATUS    RESTARTS   AGE
event-exporter-v0.2.5-7df89f4b8f-fj77d                      2/2     Running   0          7m27s
fluentd-gcp-scaler-54ccb89d5-8r7g8                          1/1     Running   0          7m23s
fluentd-gcp-v3.1.1-p6zw4                                    2/2     Running   0          4m45s
fluentd-gcp-v3.1.1-pfmxk                                    2/2     Running   0          4m45s
fluentd-gcp-v3.1.1-vlqdb                                    2/2     Running   0          4m45s
heapster-7bb44859fc-76f2n                                   3/3     Running   0          7m26s
kube-dns-5877696fb4-6vr9r                                   4/4     Running   0          7m3s
kube-dns-5877696fb4-h4mm4                                   4/4     Running   0          7m28s
kube-dns-autoscaler-57d56b4f56-8vdnp                        1/1     Running   0          7m22s
kube-proxy-gke-flask-test-default-pool-cb20e758-bfg1        1/1     Running   0          7m17s
kube-proxy-gke-flask-test-default-pool-cb20e758-pnx8        1/1     Running   0          7m15s
kube-proxy-gke-flask-test-default-pool-cb20e758-vkf5        1/1     Running   0          7m15s
l7-default-backend-8f479dd9-ghwz9                           1/1     Running   0          7m28s
magalix-agent-7cf556c576-w246v                              1/1     Running   0          2m59s
metrics-server-v0.3.1-8d4c5db46-jf6r8                       2/2     Running   0          7m
prometheus-to-sd-vj9fv                                      1/1     Running   0          7m15s
prometheus-to-sd-zfg2x                                      1/1     Running   0          7m17s
prometheus-to-sd-zgckq                                      1/1     Running   0          7m15s
stackdriver-metadata-agent-cluster-level-6b467659d8-7qdnt   1/1     Running   0          7m27s

 

Step 3: Give KubeAdvisor 5 minutes to analyze your cluster and generate Recommendations

 

via GIPHY

 

KubeAdvisor will run its models to analyze your cluster resource utilization, node configuration, container performance metrics, and cost. Give it just a few moments, and then it will display something like the following results:

 

 

 

Recommendations analysis, showing KubeAdvisor’s recommendations to improve Cost, Reliability, Performance, and Utilization metrics.

 

Step 4: Example -Selecting the right VM type and node configuration for your clusters

Clicking into your Recommendations tab on the left navigation, will display the entire list of KubeAdvisor’s recommendations. Here, we’ve clicked on a recommendation under cost analysis, showing how this user can save $798.04 this year on cloud costs.

 

 

Detail view of a Cost Recommendation from KubeAdvisor

 

 

 

 

Each recommendation states, in plain language:

 

  • What the expected impact of taking action is
  • What’s wrong
  • What the evidence for the task is (how we know it’s an issue)
  • How to resolve the issue
  • Further documentation on reading up about the issue, and how it relates to Kubernetes in general

 

Let’s take a closer look at the table under “What’s Wrong” to understand the analysis and next steps.

 

 

 

 

Here, KubeAdvisor has analyzed the user’s current VMs, nodes, and utilization benchmarks (CPU and Memory, set with limits and resources in K8s), and modeled the cost on an annual basis.

 

KubeAdvisor has made a recommendation to move from 3x n1-standard-1 machine types to 2 each of g1-small, with 1 core at 1.7GB each. By changing your limits and resources, you can achieve the 85% CPU and 42% Mem, thereby achieving equivalent performance, with safe levels of compute and mem that keep you from being throttled.

 

KubeAdvisor does not yet automatically apply changes to VM selection or nodes, users will have to manually change those, but if you happen to enact it...

 

No under-provisioning, no over-provisioning - just right provisioning 👌.

 

via GIPHY

Step 5: How KubeAdvisor can help you pick the right billing model given your cloud provider

 

Let’s revisit the table from Step 4, but this time, we’ll take a closer look at the Cost Analysis column.

 

 

 

While KubeAdvisor analyzes your stack on GCP, it can access billing model data, whether on-demand or preemptible, and model out the cost on an annual basis. Here in the screenshot above, if the user is on an on-demand billing model on GCP, they will save nearly $798 this year just by changing their VMs and cores, and putting their CPU limits and resources (utilization) on Autopilot with Magalix. If the user is on the preemptible billing model, they’ll save $140. Here, KubeAdvisor has helped the user make two choices - first, which billing model makes the most sense given their current cloud setup, and second, which VMs and cores to pick in order to minimize the costs in the billing model, whatever it is.

 

Now let’s address these recommendations directly, step-by-step.

Enacting KubeAdvisor Recommendations on GKE

Our application

First, let’s have a look at our application running on GKE. It’s a very simple Python Flask app that displays a message whenever it receives a GET request. The message is retrieved from a configmap. The following file contains all the combined resources required to create the application on the cluster:

apiVersion: v1
kind: ConfigMap
metadata:
  name: "frontend-config"
data:
  config.cfg: 
    MSG="Welcome to Kubernetes!"
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: frontend
  labels:
    app: frontend
spec:
  replicas: 3
  selector:
    matchLabels:
      app: frontend
  template:
    metadata:
      labels:
        app: frontend
    spec:
      containers:
      - name: app
        image: magalixcorp/flask:cuscon
        volumeMounts:
        - name: config-vol
          mountPath: /config
      volumes:
      - name: config-vol
        configMap:
          name: frontend-config
---
apiVersion: v1
kind: Service
metadata:
  name: frontend-svc
spec:
  selector:
    app: frontend
  ports:
  - name: http
    port: 80
    targetPort: 80
    nodePort: 32000
    protocol: TCP
  type: NodePort

If you want to apply and test the above definition on your cluster running on GCP, don’t forget to enable port 32000 on the firewall to enable NodePort access. This can be done with the following command:

gcloud compute firewall-rules create test-node-port --allow tcp:32000

Now you can access the application through any node as follows: http://node_ip:32000. You should see “Welcome to Kubernetes” as the response.

The machine type

The advisor is recommending that we change the machine type from n1-standard-1 to g1-small to save money. However, we want to avoid application downtime.

 

GCP uses node pools to provision Kubernetes workers. The node pool is a group of machines with the same specs. So, in order to avoid downtime, the procedure goes as follows:

 

  • Create a new node pool with the new machine type (g1-small)
  • Mark the existing node pool as unschedulable. This will deny placing any new pods on this node pool
  • Drain the workloads that are running on the first node pool
  • Delete the first node pool

 

The workflow will now make it such that pods get transferred gradually from the old node pool to the new node pool without any application downtime.

 

** NOTE: the following procedure assumes that you’ve already enabled API access to your cluster and that you have gcloud and kubectl installed and configured to use your project and cluster. If you need help setting this up, please refer to Google Cloud Documentation.

Step 1: Create two new node pools, each containing one node of type g1-small

gcloud container node-pools create pool1 --cluster=flask-test --machine-type=g1-small   --num-nodes=1 --zone us-central1-a
gcloud container node-pools create pool2 --cluster=flask-test --machine-type=g1-small   --num-nodes=1 --zone us-central1-a

 

Notice that after running the above commands, GKE updates the master node. The process could take up to several minutes through which you will be denied issuing any commands against the cluster. However, the running application is not affected.

Step 2: Avoid scheduling new pods to the node pool

Now we need to mark the current node pool as unscheduable so that Kubernetes avoids placing new pods to it. This can be done by listing the nodes in the default pool and cordoning it:

 

$ kubectl get nodes -l cloud.google.com/gke-nodepool=default-pool
NAME                                        STATUS   ROLES    AGE   VERSION
gke-flask-test-default-pool-cb20e758-bfg1   Ready       45m   v1.14.6-gke.13
gke-flask-test-default-pool-cb20e758-pnx8   Ready       45m   v1.14.6-gke.13
gke-flask-test-default-pool-cb20e758-vkf5   Ready       45m   v1.14.6-gke.13

 

The above command lists all the nodes labeled cloud.google.com/gke-nodepool=default-pool. That is nodes that are in the default pool (current one) only.

 

Now, let’s cordon each of them:

 

$ kubectl cordon gke-flask-test-default-pool-cb20e758-bfg1                                                                                    
node/gke-flask-test-default-pool-cb20e758-bfg1 cordoned
$ kubectl cordon gke-flask-test-default-pool-cb20e758-pnx8                                                                                    
node/gke-flask-test-default-pool-cb20e758-pnx8 cordoned
$ kubectl cordon gke-flask-test-default-pool-cb20e758-vkf5                                                                                    
node/gke-flask-test-default-pool-cb20e758-vkf5 cordoned

 

We can confirm by running the following command:

 

$ kubectl get nodes
NAME                                        STATUS                     ROLES    AGE     VERSION
gke-flask-test-default-pool-cb20e758-bfg1   Ready,SchedulingDisabled      47m     v1.14.6-gke.13
gke-flask-test-default-pool-cb20e758-pnx8   Ready,SchedulingDisabled      47m     v1.14.6-gke.13
gke-flask-test-default-pool-cb20e758-vkf5   Ready,SchedulingDisabled      47m     v1.14.6-gke.13
gke-flask-test-pool1-725dd695-qxb3          Ready                         3m56s   v1.14.6-gke.13
gke-flask-test-pool2-81d9ceae-5x6w          Ready                         2m44s   v1.14.6-gke.13

 

We have five nodes, three of which cannot have new pods. Those belong to the current pool.

Step 3: Drain pods from the existing nodes

The following commands gracefully evict pods from the default pool nodes. Notice that pods must be managed by a controller that respawns them like Deployment, ReplicaSet, StatefulSet, etc. Otherwise, you will have to run them again yourself, and application downtime can occur. Also, notice that we are giving a ten second grace period for the application to shutdown cleanly. Finally, we are adding the --force option to evict some the GCP system pods:

 

$ kubectl drain --force --ignore-daemonsets --delete-local-data --grace-period=10 gke-flask-test-default-pool-cb20e758-bfg1
node/gke-flask-test-default-pool-cb20e758-bfg1 already cordoned
WARNING: ignoring DaemonSet-managed Pods: kube-system/fluentd-gcp-v3.1.1-pfmxk, kube-system/prometheus-to-sd-zfg2x
evicting pod "heapster-7bb44859fc-76f2n"
evicting pod "l7-default-backend-8f479dd9-ghwz9"
evicting pod "event-exporter-v0.2.5-7df89f4b8f-fj77d"
evicting pod "stackdriver-metadata-agent-cluster-level-6b467659d8-7qdnt"
evicting pod "kube-dns-5877696fb4-h4mm4"
evicting pod "kube-dns-autoscaler-57d56b4f56-8vdnp"
evicting pod "fluentd-gcp-scaler-54ccb89d5-8r7g8"
pod/l7-default-backend-8f479dd9-ghwz9 evicted
pod/heapster-7bb44859fc-76f2n evicted
pod/stackdriver-metadata-agent-cluster-level-6b467659d8-7qdnt evicted
pod/kube-dns-autoscaler-57d56b4f56-8vdnp evicted
pod/event-exporter-v0.2.5-7df89f4b8f-fj77d evicted
pod/fluentd-gcp-scaler-54ccb89d5-8r7g8 evicted
pod/kube-dns-5877696fb4-h4mm4 evicted
node/gke-flask-test-default-pool-cb20e758-bfg1 evicted
$ kubectl drain --force --ignore-daemonsets --delete-local-data --grace-period=10 gke-flask-test-default-pool-cb20e758-pnx8
node/gke-flask-test-default-pool-cb20e758-pnx8 already cordoned
WARNING: ignoring DaemonSet-managed Pods: kube-system/fluentd-gcp-v3.1.1-p6zw4, kube-system/prometheus-to-sd-vj9fv
evicting pod "kube-dns-5877696fb4-6vr9r"
evicting pod "frontend-5989794999-flpj8"
pod/frontend-5989794999-flpj8 evicted
pod/kube-dns-5877696fb4-6vr9r evicted
node/gke-flask-test-default-pool-cb20e758-pnx8 evicted
$ kubectl drain --force --ignore-daemonsets --delete-local-data --grace-period=10 gke-flask-test-default-pool-cb20e758-vkf5
node/gke-flask-test-default-pool-cb20e758-vkf5 already cordoned
WARNING: ignoring DaemonSet-managed Pods: kube-system/fluentd-gcp-v3.1.1-vlqdb, kube-system/prometheus-to-sd-zgckq
evicting pod "metrics-server-v0.3.1-8d4c5db46-jf6r8"
evicting pod "frontend-5989794999-vsw2l"
evicting pod "frontend-5989794999-g29pq"
evicting pod "magalix-agent-7cf556c576-w246v"
pod/metrics-server-v0.3.1-8d4c5db46-jf6r8 evicted
pod/frontend-5989794999-vsw2l evicted
pod/magalix-agent-7cf556c576-w246v evicted
pod/frontend-5989794999-g29pq evicted
node/gke-flask-test-default-pool-cb20e758-vkf5 evicted

 

Step 4: Delete the old pool

Once all the pods have been reallocated to the new pool, you can safely delete the old one to reduce costs:

 

$ gcloud container node-pools delete default-pool --cluster flask-test --zone us-central1-a                                                   
The following node pool will be deleted.
[default-pool] in cluster [flask-test] in [us-central1-a]

Do you want to continue (Y/n)?  Y

Deleting node pool default-pool...done.
Deleted [https://container.googleapis.com/v1/projects/fakharany/zones/us-central1-a/clusters/flask-test/nodePools/default-pool]

 

Once you complete these steps, in a few minutes, occasionally up to an hour, the cluster recommendation in Magalix will show that there are no longer any issues with the node sizes:

 

 

 

Save 80% or more on your Google Kubernetes Engine costs with Magalix KubeAdvisor

 

Thanks for reading! If you’re on Kubernetes and need help balancing between cost, app performance, reliability, and utilization, try Magalix KubeAdvisor today on a 14-day free trial, no credit card needed!

Mohamed Ahmed

Oct 30, 2019