Kubernetes capacity management is a core competency of teams shipping cloud-native applications. Proper capacity management enables great customer experience, teams to innovate faster and maximizes the ROI in your cloud infrastructure. Capacity Management, however, is a challenge for many teams for three main reasons:
- Capacity Management is impacted by many moving parts, such as users workloads, application architecture, and underlying cloud infrastructure,
- You need the engagement of multiple team members to balance performance, with resources, and the cost running cloud-native applications,
- It is hard to have a common picture of effective Kubernetes and application capacity management.
Developers, DevOps, and engineering managers are the three main roles directly impacting the effectiveness of capacity management. Having them to agree on effective capacity management is a challenge. Each role has its own motivations to get their job done. Team members may conflict in their requirements. For example, developers are motivated to ship features quickly. They have no time to analyze and study needed resources or to improve the efficiency of their code. Let’s dig deeper into the motivations of each role.
Developers ship features and fix bugs. Throughout my experience seeing others adopting Kubernetes, we’ve seen developers motivated by these factors:
- I want our containers’ CI/CD pipeline fast and reliable, For example:
- In a few minutes, my code is deployed in our Kubernetes cluster.
- If my deployment fails, the system is still functional on the previous version.
- I can get a detailed and meaningful report about why my deployment failed.
- Our Kubernetes cluster is resilient enough to recover from any transient failures or resources issues. For example
- I don’t have to tweak resources requests and limits too frequently.
- I don’t need to go through CPU and memory resources budgeting exercises.
- VMs failure is expected, I want our cluster to recover quickly before alarms go off.
- Our observability pipeline (Prometheus + Grafana) provides me all metrics to diagnose issues. For example:
- I can relate user experience with the performance of my containers or microservices.
- I can see what’s taking place at the infrastructure level.
- I’m always improving shipping cloud-native applications or services. For example
- I can see from one release to another the improvements in performance and resources utilization.
DevOps or infrastructure engineers are at the core of making sure that products are delivering their SLA. They are in the middle of an on-going storm of evolving infrastructure, application architecture and business requirements. DevOps are usually motivated by these requirements:
- I want my Kubernetes cluster to be stable and secure. Moving too fast may break our infrastructure or open security gaps. Having control of these is critical to the stability of our infrastructure:
- Network policies and exposure of endpoints to the internet
- Full audit logs
- Deployed version
- Monitoring and observability pipeline
- Cluster and containers logs
- Developers access to pods, containers, and volumes
- Cluster secrets and configurations
- Some of the worker nodes or even master nodes will fail at a certain point. I should have all the redundancy I need to avoid large interruptions.
- I want developers to deploy their pods and containers independently.
- I want to make sure that our infrastructure is properly utilized without jeopardizing the users’ experience.
- With predictable performance. I get the best out of it and I know that it will keep up with changes in users workloads.
Engineering managers enable teams to innovate fast to meet business goals as efficiently as possible. They are usually motivated by these requirements:
- I want the adoption of Kubernetes to enable my team to be nimble and agile. Shipping features and handling any issues quickly is critical to our products or services.
- We can measure and improve the team’s effectiveness and Highest ROI out of our applications and infrastructure.
- I want my team to spend most of its time innovating in core business requirements/areas.
- I want to eliminate any friction between infrastructure maintenance/growth and application development.
Business Owners and Product Managers in many cases impact also capacity management and planning in case of significant business events. A marketing campaign, for example, may drive unusual traffic. The corresponding business owner should warn developers and DevOps of the expected traffic. The challenge here is when there is a rough estimate of the number of users or traffic. It becomes hard to map this to specific system requirements. Many teams end up over provisioning to be at the safe side.
When Does Contention Build Up?
Getting into the vicious cycle of poor capacity Management. Teams get quickly into the vicious cycle of poor capacity management when they become reactive most of the time. Reacting to bad performance, Live Site Incidents (LSIs), or the monthly cloud bill puts your team constantly in fire fighting mode. You have to cut this cycle at a certain point. Make sure your team has the right KPIs and priorities to proactively tackle each dimension of capacity management.
Lack of a common view about capacity management. Each member look at their point of view of the world. For example, developers look at microservices and ignore or don’t understand well the limits of their infrastructure. Also, focusing on one set of metrics without considering the impact on the rest of the system is a dangerous practice. We have seen DevOps taking applications down when they want to improve CPU utilization. They usually overlook the impact of these changes on the application’s performance and usage patterns.
Triggers and Indicators of Poor Capacity Management
So, how do you know if you have room to improve how your team manages the capacity of your Kubernetes clusters? I broke it down in below table to the three areas that any team should keep an eye on. To properly assess your team’s effectiveness, answer these questions:
- How frequently does your team get these triggers?
- How much of your team’s time spent reacting to these triggers?
- Do you have a few team members always act to these triggers or is it distributed across the whole team?
Can It Be More Collaborative?
We learned that capacity management inside Kubernetes is a collaborative effort. Kubernetes provides a good abstraction of the infrastructure. Your team, however, still have a lot of interaction points. The team still needs to collaborate on capacity allocation, application performance tuning, and of course, saving on the cost of cloud infrastructure. You can more read about this topic here.
If you are a Developer, you need to:
- Declare the ownership of the proper points inside the Kubernetes cluster
- Own the observability of your application and microservices
- Be sure about the resource you need for your application or microservices
If you are a DevOps engineer, you need to:
- Establish a clear interaction workflow between you and the rest of the team.
- Own the observability of the infrastructure and identify how pods are utilizing available capacity.
- Understand different billing options in your public cloud provider to reduce the cost of infrastructure as much as possible.
If you are an engineering manager, you need to:
- Make sure that your software engineers and DevOps engineers performed the above-mentioned steps :)
- Keep an eye on the contention that may build up when you feel your capacity management is going sideways.
- Build clear KPIs to track the health of your capacity. You will find some tips in this article.
At Magalix we can help you in your kubernetes adoption journey. You can see in one dashboard the performance of your containers, kubernetes cluster utilization, and detailed cost analysis. Connect your Kubernetes cluster for free today and get an in-depth analysis of your kubernetes cluster. You can also run your cluster on Autopilot mode to keep adjusting to your capacity proactively based on anticipated workloads.