14-days FREE Trial

 

Right-size Kubernetes cluster, boost app performance and lower cloud infrastructure cost in 5 minutes or less

 

GET STARTED

  Blog

Kubernetes and Containers Best Practices - Health Probes

TL;DR

  • Logs and basic metrics are not enough to achieve high-observability of your containers and microservices.
  • For faster recovery and higher resilience applications must apply the High Observability Principle (HOP).
  • HOP at application level requires: proper logging, detailed monitoring, health probes, and (performance/logical) tracing.
  • Use Kubernetes readinessProbe and the livenessProbe as part of high-observability principle.

What Is The Health Probe Pattern?

When you’re designing a mission-critical, highly-available application, resiliency is one of the most important aspects to take into consideration. An application is resilient when it can quickly recover from failures. A cloud-native application is typically designed to use the microservices architecture where each component lives in a container. To ensure that a Kubernetes-hosted application is highly-available, there are specific patterns that you need to follow when designing the cluster. Among those patterns is the Health Probe Pattern. The Health Probe Pattern defines how the application reports its health state to Kubernetes. The health state is not only about whether the pod is up and running, but also whether it is capable of receiving and responding to requests. As Kubernetes gains more insight into the pod’s health status, it can take more intelligent decisions about traffic-routing and load balancing. Thus, applying the High Observability Principle (HOP) makes sure that every request your application receives finds a timely response.

The High Observability Principle (HOP)

The High Observability Principle is one of the container-based application design principles. The microservices architecture entails that each service does not (and should not) care how its request gets processed and responded to by the recipient services. For example, a container issuing an HTTP request to another container to authenticate a user is expecting the response in a specific format, and that’s it. The request could be coming from NodeJS and the response handled by Python Flask. Both containers treat each other like black boxes where the internals are hidden. However, the HOP principle dictates that each service must expose several API endpoints that reveal its health state, readiness, and liveness statuses. Kubernetes makes calls to those endpoints and, hence, decide about the routing and load balancing next steps.

A better-designed cloud-native application should also log essential events to the standard error (STDERR) and standard output (STDOUT) channels. Further, a helper service like filebeat, logstash, or fluentd ships those logs to centralized monitoring, such as Prometheus, and log aggregation system, such as the ELK stack. The following diagram illustrates how a cloud-native application abides by the Health Probe Pattern and the High Observability Principle.

 

High observability principle (HOP)

 

How to Apply Health Probe Pattern in Kubernetes?

Out of the box, Kubernetes monitors the state of the Pods using one of its controllers (Deployments, ReplicaSets, DaemonSets, StatefulSets, etc.). If the controller detects that the pod crashed for some reason, it attempts to restart it or reschedule it to another node. However, a pod may report that it is up and running; nevertheless, it is not working. Let’s have an example: you have an application that uses Apache as its web server. You deployed the component on several pods in your cluster. Due to a misconfigured library, all requests to this application are responded to with HTTP code 500 (Internal Server Error). When the Deployment checks the status of the pod, it detects that it’s running. But, that’s not what your clients think. Let’s depict this undesired situation as follows:

 

Health Probe pattern inside Kubernetes

In our example, Kubernetes is performing the process of health check. In this type of check, the kubelet is continuously probing the container process. If it detects that the process is down, it restarts it. If the error gets resolved by merely rebooting the application, and if the program is designed to shut itself down when any failure occurs, then process health checks is all that you need to follow the HOP and the Health Probe Pattern. Unfortunately, not all errors disappear by rebooting. For this reason, Kubernetes offers two more thorough ways of detecting pod failures: liveness probe and readiness probe.

Learn how to continuously optimize your k8s cluster

Liveness Probes

Through the Liveness Probe, the kubelet can do three types of checks to ensure that the pod is not only running but also ready to receive and adequately respond to requests:

  • Establish an HTTP request to the pod. The response should have anHTTP rresponse code ranging from 200 to 399. This way, 5xxand4xx codes signal that the pod is having issues although the process is running.
  • For pods that expose non-HTTP services (for example, Postfix mail server), the check is to establish a successful TCP connection.
  • Executing an arbitrary command against the pod (internally). The check is successful is the command exit code is 0.

An example helps explain how this works. The following pod defi

nition hosts a NodeJS application that responds to HTTP requests with a 500 error code. We use livenessProbe parameter to ensure that the container gets restarted it receives this error:

apiVersion: v1
kind: Pod
metadata:
 name: node500
spec:
 containers:
   - image: magalix/node500
     name: node500
     ports:
       - containerPort: 3000
         protocol: TCP
     livenessProbe:
       httpGet:
         path: /
         port: 3000
       initialDelaySeconds: 5

This is no different than any Pod definition except that we add the .spec.containers.livenessProbe object. The httpGet parameter accepts the path to which it sends the HTTP GET request (in our example, it’s / but in real-world scenarios, it may be something like /api/v1/status). The livenessProbe also accepts initialDelaySeconds parameter which instructs the probe operation to wait for 

a specified number of seconds before starting. This is useful when your container needs some time to start and be fully functional and restarting it prematurely will make it indefinitely unavailable.

Apply this configuration to the cluster using:

kubectl apply -f pod.yaml

Wait for a few seconds then investigate what’s happening inside the pod using the kubectl describe subcommand:

kubectl describe pods node500

At the end of the output, you find the following part:


Events:
  Type 	Reason             	Age                	From                     	Message
  ---- 	------             	----               	----                     	-------
  Normal   Scheduled          	5m30s              	default-scheduler        	Successfully assigned node500 to docker-for-desktop
  Normal   SuccessfulMountVolume  5m29s              	kubelet, docker-for-desktop  MountVolume.SetUp succeeded for volume "default-token-ddpbc"
  Normal   Created            	3m35s (x3 over 5m24s)  kubelet, docker-for-desktop  Created container
  Normal   Started            	3m35s (x3 over 5m24s)  kubelet, docker-for-desktop  Started container
  Warning  Unhealthy          	3m18s (x7 over 5m18s)  kubelet, docker-for-desktop  Liveness probe failed: HTTP probe failed with statuscode: 500
  Normal   Pulling            	2m48s (x4 over 5m29s)  kubelet, docker-for-desktop  pulling image "afakharany/node500"
  Normal   Pulled             	2m46s (x4 over 5m24s)  kubelet, docker-for-desktop  Successfully pulled image "afakharany/node500"
  Normal   Killing            	18s (x6 over 4m28s)	kubelet, docker-for-desktop  Killing container with id docker://node500:Container failed liveness probe.. Container will be killed and recreated.

As you see, the Liveness Probe initiated the HTTP GET request. When the container responded with 500 (as it is programmed), the kubelet restarted the container.

If you are interested in knowing how the NodeJS application was programmed, here is the app.js file and the Dockerfile that was used:

app.js

var http = require('http');

var server = http.createServer(function(req, res) {
	res.writeHead(500, { "Content-type": "text/plain" });
	res.end("We have run into an error\n");
});

server.listen(3000, function() {
	console.log('Server is running at 3000')
})

Dockerfile

FROM node
COPY app.js /
EXPOSE 3000
ENTRYPOINT [ "node","/app.js" ]

An important thing to notice here is that livenessProbe will only restart the container when it fails. The kubelet cannot take corrective actions to resolve the error that container ran into if a restart does not clear it.

Readiness Probes

Readiness Probes do the same kind of checks as the Liveness Probes (GET requests, TCP connections, and command executions). However, corrective action differs. Instead o

check if kubernetes is production ready in 5-minutes or less with magalix-3

f restarting the failing container, it temporarily isolates it from incoming traffic. Now, one of the containers may be doing massive calculations or undergoing heavy operations that increase their response latency. A Liveness Probe will detect (through the timeoutSeconds probe parameter) that the GET request is taking too long to get answered. In response, the kubelet will restart the container. The container becomes overloaded again as it resumes its resource-intensive tasks. In time-critical applications, failing to sen

d a quick response may have drastic effects. In our example application, the car may be waiting for instructions from the server while on the road. When that response is delayed, the car may crash.

 

Let’s write a Readiness Probe definition that will ensure that the GET request gets a response in no more than two seconds, while the application will respond to GET requests after five seconds. Our pod.yaml file should look as follows:

apiVersion: v1
kind: Pod
metadata:
 name: nodedelayed
spec:
 containers:
   - image: afakharany/node_delayed
     name: nodedelayed
     ports:
       - containerPort: 3000
         protocol: TCP
     readinessProbe:
       httpGet:
         path: /
         port: 3000
       timeoutSeconds: 2

Let’s deploy the pod using kubectl:

kubectl apply -f pod.yaml

Give it a few seconds then let’s see how the readiness probe worked:

kubectl describe pods nodedelayed

At the end of the output, you can see that the events part look similar to the following:

Events:
  Type 	Reason             	Age           	From                     	Message
  ---- 	------             	----          	----                     	-------
  Normal   Scheduled          	58s           	default-scheduler        	Successfully assigned nodedelayed to docker-for-desktop
  Normal   SuccessfulMountVolume  58s           	kubelet, docker-for-desktop  MountVolume.SetUp succeeded for volume "default-token-ddpbc"
  Normal   Pulling            	57s           	kubelet, docker-for-desktop  pulling image "afakharany/node_delayed"
  Normal   Pulled             	53s           	kubelet, docker-for-desktop  Successfully pulled image "afakharany/node_delayed"
  Normal   Created            	52s           	kubelet, docker-for-desktop  Created container
  Normal   Started            	52s           	kubelet, docker-for-desktop  Started container
  Warning  Unhealthy          	8s (x5 over 48s)  kubelet, docker-for-desktop  Readiness probe failed: Get http://10.1.0.83:3000/: net/http: request canceled (Client.Timeout exceeded while awaiting headers

As you can see, kubelet is not restarting the pod when the probe exceeded two seconds, instead, it cancelled the request. Incoming connections are routed to other healthy pods.

Notice that once the pod is no longer overloaded, the kubelet will start routing requests back to it; as the GET request now does not have delayed responses.

For reference, the following is the modified app.js file:

var http = require('http');

var server = http.createServer(function(req, res) {
   const sleep = (milliseconds) => {
       return new Promise(resolve => setTimeout(resolve, milliseconds))
   }
   sleep(5000).then(() => {
       res.writeHead(200, { "Content-type": "text/plain" });
       res.end("Hello\n");
   })
});

server.listen(3000, function() {
   console.log('Server is running at 3000')
})

TL;DR

Before cloud-native applications, logs were the primary way to monitor and analyze applications health. However, there was no means to take corrective actions against issues. Logs are still useful and should be collected and shipped to a log aggregation system for postmortem analysis and decision making.

Currently, corrective actions must happen almost real-time. Hence, applications should no longer behave as black boxes. Instead, they must expose endpoints that allow the monitoring system to query and gain valuable health data, which enables it to respond immediately when needed. This is referred to as the Health Probe design pattern, which follows the High Observability Principle (HOP).

By default, Kubernetes offers two kinds of health checks: the readinessProbe and the livenessProbe. Both of them use the same types of probes (HTTP GET requests, TCP connections, and command execution). They differ in the decision they take in response to a failing pod. The livenessProbe restarts the container in anticipation that the error will no longer happen. The readinessProbe isolates the pod from incoming traffic until the cause of the failure is gone.

Good application design should involve both logging enough information, in particular when an exception is thrown. It should also expose the necessary API endpoints that convey important health and status metrics for monitoring systems (like Prometheus) to consume.

Mohamed Ahmed

Jul 21, 2019