Balance innovation and agility with security and compliance
risks using a 3-step process across all cloud infrastructure.
Step up business agility without compromising
security or compliance
Everything you need to become a Kubernetes expert.
Always for free!
Everything you need to know about Magalix
culture and much more
A set of practices that combines software development and Information Technology operations. Earlier, these responsibilities were distributed among two teams: one that owned the development cycle, another for operation management.
The DevOps conceptual practice forces us to break significant problems into smaller problems. Microservices fit perfectly here, where small services build-up a component, and these components make up an application. The microservice architecture utilizes small teams to develop functional components one by one. By using a microservice architecture, it’s easy to market faster, scale up and down without impacting the whole system. It’s also better for fault tolerance and is platform and language agnostic. Everything has pros and cons when it comes to microservice architecture though - in this case, it’s harder to maintain testing and monitoring.
As you know, in microservice architecture, we deploy microservices as a container and for orchestration of containers, we rely heavily on tools like Kubernetes or Docker Swarm. Once deployed, microservice architecture has thousands of services talking to each other through networking, which can make it very challenging to monitor. We have to monitor independent services more than service to service communication. Thanks to the massive community behind the monitoring tools, it’s easier to monitor the cluster.
Next, we’ll discuss Kubernetes monitoring tools, with overviews for each.
The following are some of the more popular tools used to monitor Kubernetes clusters that we’ll take a look at:
Prometheus is an open-source event monitoring tool for containers or microservices. Prometheus gathers time-series-based numerical data. Prometheus server works by scraping its data for you. This invokes the metrics endpoint of the various nodes that have been configured to monitor. These metrics collect at regular timestamps and are stored locally. The endpoint that has been used to scrape is exposed in the node.
By default, the data retention period is 15 days and the lowest supported retention period is 2 hours. Bear in mind, the higher the retention period, the larger the amount of storage is required. Also, the lowest retention period can be used when configuring remote storage for Prometheus.
Prometheus has a central, main component called the Prometheus server. The Prometheus server is a thing that monitors a particular thing. Prometheus server could monitor an entire Linux server, a stand-alone Apache server, a single process, a database service, or some other system unit that you want it to monitor.
Prometheus servers monitor the target, and targets can refer to an array of things. It could be a single server, or target for probing of multiple endpoints. The CPU memory of these units can be used as a metric. Prometheus server collects these metrics from specific targets and stores them in a time-series database. The targets to scrape and the time interval for scraping are defined in the YAML file.
Grafana is a multi-platform visualization software that’s been available since 2014. It provides us with a graph, essentially a chart web-connected to the data source. Prometheus has its own built-in browser expression, but Grafana is the industry’s most powerful visualization software and has out of the box integration with Prometheus.
Prometheus represents data in key-value pairs, and is how Kubernetes organizes infrastructure metadata using labels. Metrics are human-readable, in a self-explanatory format, and are published over HTTP transport. You can check that the metrics are correctly exposed just by using your web browser, or use Grafana for more powerful visualization.
Grafana is a multi-platform visualization software available since 2014. Grafana provides us with a graph, a chart that’s web-connected to the data source. It can query, or visualize your data source, and it doesn’t matter where the data is stored.
Swift and extensible client-side graphs with a number of options. There are also many plugins to expand your options further and help visualize any desired metrics and logs.
Create dynamic and reusable dashboards with template variables that appear as dropdowns at the top of your dashboard.
Explore your data through ad-hoc queries. Split view and compare different time ranges, queries, and data sources.
Experience the magic of switching from metrics to logs with preserved label filters. Quickly search through all your logs or stream them live.
Grafana has a built-in alerting engine, it allows users to attach conditions to a metric, and when these metrics meet certain conditions, it alerts you via social communication, chat tools (e.g., Slack), email, and custom webhook.
You can use multiple data sources, even custom data sources, on a query basis. Grafana supports and is used for monitoring and analyzing CPU, storage, and memory metrics, etc.
Fluentd is an open-source project used as a unified logging layer and is a member project of cloud-native computing foundation (CNCF). Logs are important in the cluster -- from logs you’ll be able to understand what’s happening inside your instance. Logs have to be collected from multiple sources, and Fluentd provides an easy solution for a centralized logging system. Fluentd runs on approx 40mb of memory, and it can process 10000 events per second.
Fluentd is a standard log aggregator for Kubernetes. Fluentd has its own docker image and edge for testing. Fluentd is the 8th most used image on DockerHub. Fluentd has to be running on each node of the cluster and Kubernetes provides an object Daemon set that’s used to deploy one service to run on each node of the cluster.
Centralizing Apache/Nginx Log: Fluentd used to access, or error log, and shift them to the remote server
Syslog Alerting: Fluentd can "grep" for events and send out alerts
Mobile/Web Application Logging: Fluentd can be used as middleware to enable asynchronous, scalable logging for user action events
Jaeger is an implementation Kubernetes operator and a distributed tracing platform. An Operator is a method of packaging, deploying, and managing a Kubernetes application. The Jaeger operator can be installed on Kubernetes-based clusters and can search for new Jaeger custom resources (CR) in specific namespaces or across the entire cluster. Typically, there’s only one Jaeger Operator per cluster, but there can be a maximum of one Jaeger Operator per namespace in multi-tenant scenarios. When a new Jaeger CR is detected, an operator will try to establish itself as the owner of the resource, setting a jaegertracing.io/operated-by tag on the new CR, with the namespace and operator name as the value of the tag. Jaeger can be run as a sidecar container, or yet another approach is to use it as Daemon Set.
Microservice architecture is so vast, and in this architecture, there are many calls going outside the cluster and many calls inside to the services. Jaeger easily allows us to trace calls from users to services. It also enables us to track application latency, trace the lifecycle of network calls, and also identify performance issues.
It’s difficult to monitor distributed systems, and when working Kubernetes or microservices architecture, we’re dealing with a comprehensive, distributed system. This system spans multiple nodes, and multiple services are responsible for single outputs, which makes it hard to monitor the entire system.
We can’t monitor every node or pod by manually logging and retrieving its metric -- so, here are some practices to follow to make the most of monitoring:
DaemonSet is the Kubernetes object used to deploy pods on each node of the cluster. DaemonSet can be used by multiple monitoring software/apps like Fluentd, Jaeger, or Appdynamic agent. Like in Jaeger, jaeger agent can be deployed as a DaemonSet to trace calls and services. This way, users can easily gather data from all the nodes in the cluster.
Tags and labels are used for filtering objects in Kubernetes and used to interact with Kubernetes objects such as pods, jobs, or cron jobs. It can help make your metrics more useful for debugging.
Kubernetes deploys services according to scheduling policies, we don’t know where or which node our app will be deployed on. You'll want to use a monitoring system with service discovery, which automatically adapts metric collection to mobile containers. This will allow you to continuously monitor your applications without interruption.
The most complex issues occur within the Kubernetes cluster -- this can be the result of DNS bottlenecks, network overload, and, the sum of all fears – Etcd. It’s critical to track the degradation of master nodes and identify issues before they happen, particularly load average, memory, and disk size. We need to monitor Kube-system patterns as closely as possible.
There is no automatic healing within our stateful set, so High Disk Usage always requires attention. Make sure to monitor all disk and root volume systems. Kubernetes Node Exporter provides a highly recommended, nice metric for tracking these devices.
Prevent Kubernetes NetworkPolicy misconfigurations by enforcing policy as code