Magalix is about helping companies and developers find the right balance between performance and capacity inside their kubernetes clusters. So, we are big Kubernetes fans. We went through lots of pains and learning cycles to make Kubernetes work properly for our needs. Those experiences also helped us a lot to empathize with our customers. Building fully containerized and building fully elastic Kubernetes managed microservices is hard and still requires a lot of legwork.
We have been using Kubernetes for more than a year now and what I’m sharing here are some highlights. Feel free to ask questions about your specific situation. I’ll be happy to share how we solved similar problems in our infrastructure.
High-Level Goals and Architecture
Our service provides resources management and recommendations using a sophisticated AI. Our AI pipeline consists of time series predictions, scalability decision analysis, optimization, and a feedback loop to learn from these decisions. These are mostly offline systems, but we also have real-time systems that interact with our customers' clusters via installed pods. Our systems must have high availability with very low latency for a quick response to scalability needs. We also provide one-stop service where users can manage many distributed clusters at different geographies and cloud providers.
So, with those goals in mind, we decided to have multiple Kubernetes clusters organized in a hierarchical way with these goals in mind.
To achieve these goals, our infrastructure ended up having multiple Kubernetes clusters with only one secure endpoint publicly accessible to deployed agents.
Regional Backend (BE) cluster collects metrics data and sends back recommendations to agents for execution. These are distributed and highly available clusters across different regions and different providers.
Regional AI clusters are set close to the regional BE clusters to guarantee low latency of metrics predictions, and ability to generate scalability decisions fairly quickly. It imposes challenges for sharing training data and deploying back updated models. I will discuss these challenges it in a separate post.
Global Backend (BE) cluster is responsible for two main tasks: (1) aggregation of account level data across different customer clusters, which include performance, decisions accuracy, and billing data. (2) A single endpoint for our users to access APIs and the web console. This cluster is also geo-replicated with very low sync latency.
We had a decent start when our team consisted of only 3 engineers. But as we grew the last 12 months, we had a lot of surprises and learned how the delicate balance between pods, containers, VMs can be challenging to manage in a medium sized team.
Clean abstraction of infrastructure in Kubernetes world is a myth.Dependencies between workloads, containers, application architecture, capacity, and infrastructure will be always there but revealed and managed in different ways. You must be aware and have a clear mental model of how your team should manage these dependencies.
We thought at the beginning that containers and kubernetes will provide an easy and clean separation between infrastructure and backend teams. Unfortunately, we discovered that this is not the case. We had lots of contention between infrastructure engineers and BE engineers on resources management, access rights, and security management. I’ll share with you the roller coaster of resources management here and share the experience with access rights & security of Kubernetes in the next few posts.
We wanted to be super efficient from the get-go. So, we had nodes allocation done through a budgeting process to avoid having idle or underutilized instances. That was a big shock for most team members when we decided to take this approach. After applying restricted resources allocation, we started to have containers OOM terminated, inaccessible (now smaller) nodes due to containers highjacking their CPU, unexpected eviction of pods, failed deployments, and of course frustrated customers, and team members.
We thought we knew the resources needed for each container and microservice, but we got really surprised upon further investigation by how much our configurations were loose. We realized that we didn’t set the right request (minimum) and limit (maximum) of CPU and memory in many cases, which caused containers to unconditionally use nodes resources. We also discovered that we didn’t have a clear way to budget resources and manage trends of resources usage. For example, we were setting the requests of some pods much higher than what they use for most of the day. We wanted to be ready for the spikes that happen once or twice a day, based on users patterns and behaviors. Sound familiar?! We are back again to the VM allocation patterns, but at a more granular scale. It was a joke and not a funny one I might add. Part of our mission is, to help others efficiently manage their infrastructure, yet we found ourselves challenged to achieve that efficiency internally
We decided to start with the dummy way. We created a sheet to list how much resources each service would need, the minimum resources “request” and the max “limit” of CPU and memory. I was skeptical about it but it was an eye-opener on multiple fronts. First of all, we discovered that the majority of our pods and containers did not have any CPU requests or limits set. Engineers were focusing only on memory. That was expected since memory is the largest source of crashes and failures inside their containers. However, our infrastructure team and I were interested in the CPU limits. CPU misconfigurations where causing system-wide issues, the typical noisy neighbor issue, but the uncertainty of pods scheduling made it worse and random. We were having infrastructure live site incidents (LSIs) caused by one microservice monopolizing the CPU and suffocating other services sharing the node with it. This impacted our infrastructure services, such as CoreDNS and Calico, which exacerbated the problem.
We left it to our developers to estimate their CPU needs. So, they went back to our dashboards to estimate how much CPU they would need, but it wasn’t easy because the millicore concept inside Kubernetes does not consider the CPU type or power. 100 millicores on m5.large is much less powerful than 100 millicores on c5.large. This is part of a future work in Kubernetes resources management. We decided to get started by just setting arbitrary CPU requestand limits based on the history of the service — see below simplified budgeting snapshot.
We hit the same issue again when we deploy our services to production. Instance types and sizes are different in production env and across providers. We wanted to make sure that we double estimated the right amount of millicores and memory when we promote our services from dev to production environments. So, to normalize from what we have in dev and project it to workload needs and available capacity in prod we used AWS ECUs as the base to calculate millicores. So rather than expressing the needs in constant millicores value, we measured in the dev environment the reasonable millicores usage of each container and converted these to ECUs to know how many millicores we need as the container moves from one instance family to another. Remember, the goal is to have consistent performance and to make sure we don’t over or under provision resources. For example, as shown in the below snapshot the eventer container needed around 750 millicores on the m5.large instance. This is translated to around 3 ECUs. To have a relatively consistent performance as we move it to production we know we need to its memory, which is easy, but we don’t need to really increase the CPU it uses. The eventer’s instance group is m5.xlarge. To give the eventer the same number of ECUs, this gets translated to 50% fewer millicores on the m5.xlarge instance. It is tricky when it comes to moving containers to instances of a different family. For example, it's not straightforward to adjust millicores and memory if run your container on t2 instances inside your dev environment, but want to deploy it to c5 instance type in production. But if we use ECUs as a reference point, we can model it and create a simple calculator. The problem with such solution that we have to keep up manually with each microservice and update it everytime we either change instances types or resources allocated to pods.
Note: Drop me a line if you want me to share our millicore estimator sheet. But we are working on a more elegant solution to this problem that we will share publicly soon. I wanted to share to see if someone else solved it a different way.
The next issue we faced is the fluctuations in CPU and memory needs throughout the day, which is really a tricky problem. Changing the CPU may require pods to be restarted and rescheduled. We wanted to control two factors concurrently: (1) When vertical scalability takes place, and (2) whether the pod should be restarted or not. We also wanted to have more control over when a pod got rescheduled, and at which node it was going to run. The pod should be scheduled to the best node family based on the pod’s profile. For example, one of our pods/services is balanced in CPU and memory most of the day; however, it does a nightly batch job that requires a lot of memory. This particular pod needs to be moved once or twice a day from m5.large to r4.xlarge. Doing so will save us more than 30% of our EC2 spending for only this one pod.
The first thing that came to mind to fix this particular issue is, to use our own AI pipeline to get the future workloads predictions and estimated memory and CPU needs. We did some tweaks and started feeding in the CPU and memory metrics into our forecasting and decision maker components. We were actually happy with the predictions being generated in a fairly short period. Our AI was able to predict memory and CPU needs for next 2–4 hours for each pod with 80% accuracy. This is based on recurring usage patterns we saw for CPU and memory — see snapshot below of predicted memory and CPU of one of our pods.
A couple of hours after, we started getting decisions/recommendations to scale memory and CPU. Our AI models started to adapt to expected changes of the expected 2–4 hours workloads — see below snapshot showing multiple decisions to scale memory only. CPU allocation was fine at that point.
Note: the impact column is an estimated monthly saving if the decision is carried over and the agent change instance type. This is still an experimental feature. I’ll talk about it in a different blog post.
Note: VPA wasn’t really reliable enough to use at the time of this writing. Also, we wanted to have more control on when and how pods will be scaled, which is something VPA and HPA do not provide.
Magalix eliminates the complexity of balancing performance with infrastructure capacity, using AI. It is a low-touch service that makes infrastructure self-healing to deliver the maximum value.