TL;DR
- Logs and basic metrics are not enough to achieve high-observability of your containers and microservices.
- For faster recovery and higher resilience applications must apply the High Observability Principle (HOP).
- HOP at application level requires: proper logging, detailed monitoring, health probes, and (performance/logical) tracing.
- Use Kubernetes readinessProbe and the livenessProbe as part of high-observability principle.
What Is The Health Probe Pattern?
When you’re designing a mission-critical, highly-available application, resiliency is one of the most important aspects to take into consideration. An application is resilient when it can quickly recover from failures. A cloud-native application is typically designed to use the microservices architecture where each component lives in a container. To ensure that a Kubernetes-hosted application is highly-available, there are specific patterns that you need to follow when designing the cluster. Among those patterns is the Health Probe Pattern. The Health Probe Pattern defines how the application reports its health state to Kubernetes. The health state is not only about whether the pod is up and running, but also whether it is capable of receiving and responding to requests. As Kubernetes gains more insight into the pod’s health status, it can take more intelligent decisions about traffic-routing and load balancing. Thus, applying the High Observability Principle (HOP) makes sure that every request your application receives finds a timely response.
The High Observability Principle (HOP)
The High Observability Principle is one of the container-based application design principles. The microservices architecture entails that each service does not (and should not) care how its request gets processed and responded to by the recipient services. For example, a container issuing an HTTP request to another container to authenticate a user is expecting the response in a specific format, and that’s it. The request could be coming from NodeJS and the response handled by Python Flask. Both containers treat each other like black boxes where the internals are hidden. However, the HOP principle dictates that each service must expose several API endpoints that reveal its health state, readiness, and liveness statuses. Kubernetes makes calls to those endpoints and, hence, decide about the routing and load balancing next steps.
A better-designed cloud-native application should also log essential events to the standard error (STDERR) and standard output (STDOUT) channels. Further, a helper service like filebeat, logstash, or fluentd ships those logs to centralized monitoring, such as Prometheus, and log aggregation system, such as the ELK stack. The following diagram illustrates how a cloud-native application abides by the Health Probe Pattern and the High Observability Principle.
.jpg?width=1200&name=High%20observability%20principle%20(HOP).jpg)
How to Apply Health Probe Pattern in Kubernetes?
Out of the box, Kubernetes monitors the state of the Pods using one of its controllers (Deployments, ReplicaSets, DaemonSets, StatefulSets, etc.). If the controller detects that the pod crashed for some reason, it attempts to restart it or reschedule it to another node. However, a pod may report that it is up and running; nevertheless, it is not working. Let’s have an example: you have an application that uses Apache as its web server. You deployed the component on several pods in your cluster. Due to a misconfigured library, all requests to this application are responded to with HTTP code 500 (Internal Server Error). When the Deployment checks the status of the pod, it detects that it’s running. But, that’s not what your clients think. Let’s depict this undesired situation as follows:

In our example, Kubernetes is performing the process of health check. In this type of check, the kubelet is continuously probing the container process. If it detects that the process is down, it restarts it. If the error gets resolved by merely rebooting the application, and if the program is designed to shut itself down when any failure occurs, then process health checks is all that you need to follow the HOP and the Health Probe Pattern. Unfortunately, not all errors disappear by rebooting. For this reason, Kubernetes offers two more thorough ways of detecting pod failures: liveness probe and readiness probe.
Liveness Probes
Through the Liveness Probe, the kubelet can do three types of checks to ensure that the pod is not only running but also ready to receive and adequately respond to requests:
- Establish an HTTP request to the pod. The response should have anHTTP rresponse code ranging from 200 to 399. This way, 5xxand4xx codes signal that the pod is having issues although the process is running.
- For pods that expose non-HTTP services (for example, Postfix mail server), the check is to establish a successful TCP connection.
- Executing an arbitrary command against the pod (internally). The check is successful is the command exit code is 0.
An example helps explain how this works. The following pod defi
nition hosts a NodeJS application that responds to HTTP requests with a 500 error code. We use livenessProbe parameter to ensure that the container gets restarted it receives this error:
apiVersion: v1
kind: Pod
metadata:
name: node500
spec:
containers:
- image: magalix/node500
name: node500
ports:
- containerPort: 3000
protocol: TCP
livenessProbe:
httpGet:
path: /
port: 3000
initialDelaySeconds: 5
This is no different than any Pod definition except that we add the .spec.containers.livenessProbe object. The httpGet parameter accepts the path to which it sends the HTTP GET request (in our example, it’s / but in real-world scenarios, it may be something like /api/v1/status). The livenessProbe also accepts initialDelaySeconds parameter which instructs the probe operation to wait for
a specified number of seconds before starting. This is useful when your container needs some time to start and be fully functional and restarting it prematurely will make it indefinitely unavailable.
Apply this configuration to the cluster using:
kubectl apply -f pod.yaml
Wait for a few seconds then investigate what’s happening inside the pod using the kubectl describe subcommand:
kubectl describe pods node500
At the end of the output, you find the following part:
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 5m30s default-scheduler Successfully assigned node500 to docker-for-desktop
Normal SuccessfulMountVolume 5m29s kubelet, docker-for-desktop MountVolume.SetUp succeeded for volume "default-token-ddpbc"
Normal Created 3m35s (x3 over 5m24s) kubelet, docker-for-desktop Created container
Normal Started 3m35s (x3 over 5m24s) kubelet, docker-for-desktop Started container
Warning Unhealthy 3m18s (x7 over 5m18s) kubelet, docker-for-desktop Liveness probe failed: HTTP probe failed with statuscode: 500
Normal Pulling 2m48s (x4 over 5m29s) kubelet, docker-for-desktop pulling image "afakharany/node500"
Normal Pulled 2m46s (x4 over 5m24s) kubelet, docker-for-desktop Successfully pulled image "afakharany/node500"
Normal Killing 18s (x6 over 4m28s) kubelet, docker-for-desktop Killing container with id docker://node500:Container failed liveness probe.. Container will be killed and recreated.
As you see, the Liveness Probe initiated the HTTP GET request. When the container responded with 500 (as it is programmed), the kubelet restarted the container.
If you are interested in knowing how the NodeJS application was programmed, here is the app.js file and the Dockerfile that was used:
app.js
var http = require('http');
var server = http.createServer(function(req, res) {
res.writeHead(500, { "Content-type": "text/plain" });
res.end("We have run into an error\n");
});
server.listen(3000, function() {
console.log('Server is running at 3000')
})
Dockerfile
FROM node
COPY app.js /
EXPOSE 3000
ENTRYPOINT [ "node","/app.js" ]
An important thing to notice here is that livenessProbe will only restart the container when it fails. The kubelet cannot take corrective actions to resolve the error that container ran into if a restart does not clear it.
Readiness Probes
Readiness Probes do the same kind of checks as the Liveness Probes (GET requests, TCP connections, and command executions). However, corrective action differs. Instead o

f restarting the failing container, it temporarily isolates it from incoming traffic. Now, one of the containers may be doing massive calculations or undergoing heavy operations that increase their response latency. A Liveness Probe will detect (through the timeoutSeconds probe parameter) that the GET request is taking too long to get answered. In response, the kubelet will restart the container. The container becomes overloaded again as it resumes its resource-intensive tasks. In time-critical applications, failing to sen
d a quick response may have drastic effects. In our example application, the car may be waiting for instructions from the server while on the road. When that response is delayed, the car may crash.
Let’s write a Readiness Probe definition that will ensure that the GET request gets a response in no more than two seconds, while the application will respond to GET requests after five seconds. Our pod.yaml file should look as follows:
apiVersion: v1
kind: Pod
metadata:
name: nodedelayed
spec:
containers:
- image: afakharany/node_delayed
name: nodedelayed
ports:
- containerPort: 3000
protocol: TCP
readinessProbe:
httpGet:
path: /
port: 3000
timeoutSeconds: 2
Let’s deploy the pod using kubectl:
kubectl apply -f pod.yaml
Give it a few seconds then let’s see how the readiness probe worked:
kubectl describe pods nodedelayed
At the end of the output, you can see that the events part look similar to the following:
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 58s default-scheduler Successfully assigned nodedelayed to docker-for-desktop
Normal SuccessfulMountVolume 58s kubelet, docker-for-desktop MountVolume.SetUp succeeded for volume "default-token-ddpbc"
Normal Pulling 57s kubelet, docker-for-desktop pulling image "afakharany/node_delayed"
Normal Pulled 53s kubelet, docker-for-desktop Successfully pulled image "afakharany/node_delayed"
Normal Created 52s kubelet, docker-for-desktop Created container
Normal Started 52s kubelet, docker-for-desktop Started container
Warning Unhealthy 8s (x5 over 48s) kubelet, docker-for-desktop Readiness probe failed: Get http://10.1.0.83:3000/: net/http: request canceled (Client.Timeout exceeded while awaiting headers
As you can see, kubelet is not restarting the pod when the probe exceeded two seconds, instead, it cancelled the request. Incoming connections are routed to other healthy pods.
Notice that once the pod is no longer overloaded, the kubelet will start routing requests back to it; as the GET request now does not have delayed responses.
For reference, the following is the modified app.js file:
var http = require('http');
var server = http.createServer(function(req, res) {
const sleep = (milliseconds) => {
return new Promise(resolve => setTimeout(resolve, milliseconds))
}
sleep(5000).then(() => {
res.writeHead(200, { "Content-type": "text/plain" });
res.end("Hello\n");
})
});
server.listen(3000, function() {
console.log('Server is running at 3000')
})
TL;DR
Before cloud-native applications, logs were the primary way to monitor and analyze applications health. However, there was no means to take corrective actions against issues. Logs are still useful and should be collected and shipped to a log aggregation system for postmortem analysis and decision making.
Currently, corrective actions must happen almost real-time. Hence, applications should no longer behave as black boxes. Instead, they must expose endpoints that allow the monitoring system to query and gain valuable health data, which enables it to respond immediately when needed. This is referred to as the Health Probe design pattern, which follows the High Observability Principle (HOP).
By default, Kubernetes offers two kinds of health checks: the readinessProbe and the livenessProbe. Both of them use the same types of probes (HTTP GET requests, TCP connections, and command execution). They differ in the decision they take in response to a failing pod. The livenessProbe restarts the container in anticipation that the error will no longer happen. The readinessProbe isolates the pod from incoming traffic until the cause of the failure is gone.
Good application design should involve both logging enough information, in particular when an exception is thrown. It should also expose the necessary API endpoints that convey important health and status metrics for monitoring systems (like Prometheus) to consume.