Google Cloud Platform (GCP) recently announced the beta launch of Google Cloud Recommenders, and the features and functionality are pretty exciting. The short of it is that with Recommenders in GCP, you can now:
- Automatically get analysis of usage patterns to help you determine if resources and policies within Google Cloud are optimally configured
- Automatically detect if overly permissive access policies are present, and adjust them based on access patterns of similar users in your organization
- Choose the optimal virtual machine size for your workload, because most GCP customers initially provision machines that are too small or too large.
This is all pretty awesome, but in actuality, this is old news.
The Magalix Agent (which can be found in the Google Cloud Marketplace) has been doing this since day 1. Just last week with the release of KubeAdvisor, we now enable developers and DevOps engineers to get continuously generated Recommendations specifically to save money on any cloud provider (including GCP), enact best practices for cluster configuration, and optimize resource usage with application performance.
KubeAdvisor helps you select the right VM type, the right capacity limits and resources, and can even suggest how to optimize your K8s clusters based on the billing model of your cloud provider.
Just like Google Cloud, we’ve found that Kubernetes users are often way over or under-provisioned, and can save as much as 80% on their cloud costs by using KubeAdvisor. In this article, we’ll learn how KubeAdvisor works, and how it can help save on cloud costs while optimizing for app performance on Kubernetes.
Step 1: Connect a Google Kubernetes Engine Cluster to Magalix
Make sure you’re logged in to Google Cloud Platform, and have a cluster enabled in Google Kubernetes Engine (GKE).
Navigate to the Magalix Agent listing in the Google Cloud Marketplace. Once there, click the blue “CONFIGURE” button and you’ll get to the Deployment screen.
Google Cloud Platform Deployment Screen
Select a cluster (don’t worry, all new Magalix accounts start on a 14 day free trial, with no credit card required), select a namespace, enter your account credentials (enter your existing password if you have a Magalix account), and click deploy. Easy peasy!
It will take just a few moments to deploy your application components - check your inbox in the meantime, and log in to the Magalix console from here on out.
The Magalix Agent starting up from within the Google Cloud Platform interface.
Step 2: Log in to The Magalix Console With Your New or Existing Account Credentials, and Check to See That Your Cluster is Connected
You should see messages about cluster analysis in your home screen dashboard. If you don’t, or see that your cluster is showing as not connected or suspended, you may need to validate that the agent has deployed. You can do that by using the command $ kubectl get pods -n kube-system, and as shown below, for something like magalix-agent-7cf556c576-w246v in the listed pods.
$ kubectl get pods -n kube-system NAME READY STATUS RESTARTS AGE event-exporter-v0.2.5-7df89f4b8f-fj77d 2/2 Running 0 7m27s fluentd-gcp-scaler-54ccb89d5-8r7g8 1/1 Running 0 7m23s fluentd-gcp-v3.1.1-p6zw4 2/2 Running 0 4m45s fluentd-gcp-v3.1.1-pfmxk 2/2 Running 0 4m45s fluentd-gcp-v3.1.1-vlqdb 2/2 Running 0 4m45s heapster-7bb44859fc-76f2n 3/3 Running 0 7m26s kube-dns-5877696fb4-6vr9r 4/4 Running 0 7m3s kube-dns-5877696fb4-h4mm4 4/4 Running 0 7m28s kube-dns-autoscaler-57d56b4f56-8vdnp 1/1 Running 0 7m22s kube-proxy-gke-flask-test-default-pool-cb20e758-bfg1 1/1 Running 0 7m17s kube-proxy-gke-flask-test-default-pool-cb20e758-pnx8 1/1 Running 0 7m15s kube-proxy-gke-flask-test-default-pool-cb20e758-vkf5 1/1 Running 0 7m15s l7-default-backend-8f479dd9-ghwz9 1/1 Running 0 7m28s magalix-agent-7cf556c576-w246v 1/1 Running 0 2m59s metrics-server-v0.3.1-8d4c5db46-jf6r8 2/2 Running 0 7m prometheus-to-sd-vj9fv 1/1 Running 0 7m15s prometheus-to-sd-zfg2x 1/1 Running 0 7m17s prometheus-to-sd-zgckq 1/1 Running 0 7m15s stackdriver-metadata-agent-cluster-level-6b467659d8-7qdnt 1/1 Running 0 7m27s
Step 3: Give KubeAdvisor 5 minutes to analyze your cluster and generate Recommendations
KubeAdvisor will run its models to analyze your cluster resource utilization, node configuration, container performance metrics, and cost. Give it just a few moments, and then it will display something like the following results:
Recommendations analysis, showing KubeAdvisor’s recommendations to improve Cost, Reliability, Performance, and Utilization metrics.
Step 4: Example -Selecting the right VM type and node configuration for your clusters
Clicking into your Recommendations tab on the left navigation, will display the entire list of KubeAdvisor’s recommendations. Here, we’ve clicked on a recommendation under cost analysis, showing how this user can save $798.04 this year on cloud costs.
Detail view of a Cost Recommendation from KubeAdvisor
Each recommendation states, in plain language:
- What the expected impact of taking action is
- What’s wrong
- What the evidence for the task is (how we know it’s an issue)
- How to resolve the issue
- Further documentation on reading up about the issue, and how it relates to Kubernetes in general
Let’s take a closer look at the table under “What’s Wrong” to understand the analysis and next steps.
Here, KubeAdvisor has analyzed the user’s current VMs, nodes, and utilization benchmarks (CPU and Memory, set with limits and resources in K8s), and modeled the cost on an annual basis.
KubeAdvisor has made a recommendation to move from 3x n1-standard-1 machine types to 2 each of g1-small, with 1 core at 1.7GB each. By changing your limits and resources, you can achieve the 85% CPU and 42% Mem, thereby achieving equivalent performance, with safe levels of compute and mem that keep you from being throttled.
KubeAdvisor does not yet automatically apply changes to VM selection or nodes, users will have to manually change those, but if you happen to enact it...
No under-provisioning, no over-provisioning - just right provisioning 👌.
Step 5: How KubeAdvisor can help you pick the right billing model given your cloud provider
Let’s revisit the table from Step 4, but this time, we’ll take a closer look at the Cost Analysis column.
While KubeAdvisor analyzes your stack on GCP, it can access billing model data, whether on-demand or preemptible, and model out the cost on an annual basis. Here in the screenshot above, if the user is on an on-demand billing model on GCP, they will save nearly $798 this year just by changing their VMs and cores, and putting their CPU limits and resources (utilization) on Autopilot with Magalix. If the user is on the preemptible billing model, they’ll save $140. Here, KubeAdvisor has helped the user make two choices - first, which billing model makes the most sense given their current cloud setup, and second, which VMs and cores to pick in order to minimize the costs in the billing model, whatever it is.
Now let’s address these recommendations directly, step-by-step.
Enacting KubeAdvisor Recommendations on GKE
First, let’s have a look at our application running on GKE. It’s a very simple Python Flask app that displays a message whenever it receives a GET request. The message is retrieved from a configmap. The following file contains all the combined resources required to create the application on the cluster:
apiVersion: v1 kind: ConfigMap metadata: name: "frontend-config" data: config.cfg: MSG="Welcome to Kubernetes!" --- apiVersion: apps/v1 kind: Deployment metadata: name: frontend labels: app: frontend spec: replicas: 3 selector: matchLabels: app: frontend template: metadata: labels: app: frontend spec: containers: - name: app image: magalixcorp/flask:cuscon volumeMounts: - name: config-vol mountPath: /config volumes: - name: config-vol configMap: name: frontend-config --- apiVersion: v1 kind: Service metadata: name: frontend-svc spec: selector: app: frontend ports: - name: http port: 80 targetPort: 80 nodePort: 32000 protocol: TCP type: NodePort
If you want to apply and test the above definition on your cluster running on GCP, don’t forget to enable port 32000 on the firewall to enable NodePort access. This can be done with the following command:
gcloud compute firewall-rules create test-node-port --allow tcp:32000
Now you can access the application through any node as follows: http://node_ip:32000. You should see “Welcome to Kubernetes” as the response.
The machine type
The advisor is recommending that we change the machine type from n1-standard-1 to g1-small to save money. However, we want to avoid application downtime.
GCP uses node pools to provision Kubernetes workers. The node pool is a group of machines with the same specs. So, in order to avoid downtime, the procedure goes as follows:
- Create a new node pool with the new machine type (g1-small)
- Mark the existing node pool as unschedulable. This will deny placing any new pods on this node pool
- Drain the workloads that are running on the first node pool
- Delete the first node pool
The workflow will now make it such that pods get transferred gradually from the old node pool to the new node pool without any application downtime.
** NOTE: the following procedure assumes that you’ve already enabled API access to your cluster and that you have gcloud and kubectl installed and configured to use your project and cluster. If you need help setting this up, please refer to Google Cloud Documentation.
Step 1: Create two new node pools, each containing one node of type g1-small
gcloud container node-pools create pool1 --cluster=flask-test --machine-type=g1-small --num-nodes=1 --zone us-central1-a gcloud container node-pools create pool2 --cluster=flask-test --machine-type=g1-small --num-nodes=1 --zone us-central1-a
Notice that after running the above commands, GKE updates the master node. The process could take up to several minutes through which you will be denied issuing any commands against the cluster. However, the running application is not affected.
Step 2: Avoid scheduling new pods to the node pool
Now we need to mark the current node pool as unscheduable so that Kubernetes avoids placing new pods to it. This can be done by listing the nodes in the default pool and cordoning it:
$ kubectl get nodes -l cloud.google.com/gke-nodepool=default-pool NAME STATUS ROLES AGE VERSION gke-flask-test-default-pool-cb20e758-bfg1 Ready 45m v1.14.6-gke.13 gke-flask-test-default-pool-cb20e758-pnx8 Ready 45m v1.14.6-gke.13 gke-flask-test-default-pool-cb20e758-vkf5 Ready 45m v1.14.6-gke.13
The above command lists all the nodes labeled cloud.google.com/gke-nodepool=default-pool. That is nodes that are in the default pool (current one) only.
Now, let’s cordon each of them:
$ kubectl cordon gke-flask-test-default-pool-cb20e758-bfg1 node/gke-flask-test-default-pool-cb20e758-bfg1 cordoned $ kubectl cordon gke-flask-test-default-pool-cb20e758-pnx8 node/gke-flask-test-default-pool-cb20e758-pnx8 cordoned $ kubectl cordon gke-flask-test-default-pool-cb20e758-vkf5 node/gke-flask-test-default-pool-cb20e758-vkf5 cordoned
We can confirm by running the following command:
$ kubectl get nodes NAME STATUS ROLES AGE VERSION gke-flask-test-default-pool-cb20e758-bfg1 Ready,SchedulingDisabled 47m v1.14.6-gke.13 gke-flask-test-default-pool-cb20e758-pnx8 Ready,SchedulingDisabled 47m v1.14.6-gke.13 gke-flask-test-default-pool-cb20e758-vkf5 Ready,SchedulingDisabled 47m v1.14.6-gke.13 gke-flask-test-pool1-725dd695-qxb3 Ready 3m56s v1.14.6-gke.13 gke-flask-test-pool2-81d9ceae-5x6w Ready 2m44s v1.14.6-gke.13
We have five nodes, three of which cannot have new pods. Those belong to the current pool.
Step 3: Drain pods from the existing nodes
The following commands gracefully evict pods from the default pool nodes. Notice that pods must be managed by a controller that respawns them like Deployment, ReplicaSet, StatefulSet, etc. Otherwise, you will have to run them again yourself, and application downtime can occur. Also, notice that we are giving a ten second grace period for the application to shutdown cleanly. Finally, we are adding the --force option to evict some the GCP system pods:
$ kubectl drain --force --ignore-daemonsets --delete-local-data --grace-period=10 gke-flask-test-default-pool-cb20e758-bfg1 node/gke-flask-test-default-pool-cb20e758-bfg1 already cordoned WARNING: ignoring DaemonSet-managed Pods: kube-system/fluentd-gcp-v3.1.1-pfmxk, kube-system/prometheus-to-sd-zfg2x evicting pod "heapster-7bb44859fc-76f2n" evicting pod "l7-default-backend-8f479dd9-ghwz9" evicting pod "event-exporter-v0.2.5-7df89f4b8f-fj77d" evicting pod "stackdriver-metadata-agent-cluster-level-6b467659d8-7qdnt" evicting pod "kube-dns-5877696fb4-h4mm4" evicting pod "kube-dns-autoscaler-57d56b4f56-8vdnp" evicting pod "fluentd-gcp-scaler-54ccb89d5-8r7g8" pod/l7-default-backend-8f479dd9-ghwz9 evicted pod/heapster-7bb44859fc-76f2n evicted pod/stackdriver-metadata-agent-cluster-level-6b467659d8-7qdnt evicted pod/kube-dns-autoscaler-57d56b4f56-8vdnp evicted pod/event-exporter-v0.2.5-7df89f4b8f-fj77d evicted pod/fluentd-gcp-scaler-54ccb89d5-8r7g8 evicted pod/kube-dns-5877696fb4-h4mm4 evicted node/gke-flask-test-default-pool-cb20e758-bfg1 evicted $ kubectl drain --force --ignore-daemonsets --delete-local-data --grace-period=10 gke-flask-test-default-pool-cb20e758-pnx8 node/gke-flask-test-default-pool-cb20e758-pnx8 already cordoned WARNING: ignoring DaemonSet-managed Pods: kube-system/fluentd-gcp-v3.1.1-p6zw4, kube-system/prometheus-to-sd-vj9fv evicting pod "kube-dns-5877696fb4-6vr9r" evicting pod "frontend-5989794999-flpj8" pod/frontend-5989794999-flpj8 evicted pod/kube-dns-5877696fb4-6vr9r evicted node/gke-flask-test-default-pool-cb20e758-pnx8 evicted $ kubectl drain --force --ignore-daemonsets --delete-local-data --grace-period=10 gke-flask-test-default-pool-cb20e758-vkf5 node/gke-flask-test-default-pool-cb20e758-vkf5 already cordoned WARNING: ignoring DaemonSet-managed Pods: kube-system/fluentd-gcp-v3.1.1-vlqdb, kube-system/prometheus-to-sd-zgckq evicting pod "metrics-server-v0.3.1-8d4c5db46-jf6r8" evicting pod "frontend-5989794999-vsw2l" evicting pod "frontend-5989794999-g29pq" evicting pod "magalix-agent-7cf556c576-w246v" pod/metrics-server-v0.3.1-8d4c5db46-jf6r8 evicted pod/frontend-5989794999-vsw2l evicted pod/magalix-agent-7cf556c576-w246v evicted pod/frontend-5989794999-g29pq evicted node/gke-flask-test-default-pool-cb20e758-vkf5 evicted
Step 4: Delete the old pool
Once all the pods have been reallocated to the new pool, you can safely delete the old one to reduce costs:
$ gcloud container node-pools delete default-pool --cluster flask-test --zone us-central1-a The following node pool will be deleted. [default-pool] in cluster [flask-test] in [us-central1-a] Do you want to continue (Y/n)? Y Deleting node pool default-pool...done. Deleted [https://container.googleapis.com/v1/projects/fakharany/zones/us-central1-a/clusters/flask-test/nodePools/default-pool]
Once you complete these steps, in a few minutes, occasionally up to an hour, the cluster recommendation in Magalix will show that there are no longer any issues with the node sizes:
Save 80% or more on your Google Kubernetes Engine costs with Magalix KubeAdvisor
Thanks for reading! If you’re on Kubernetes and need help balancing between cost, app performance, reliability, and utilization, try Magalix KubeAdvisor today on a 14-day free trial, no credit card needed!