Balance innovation and agility with security and compliance
risks using a 3-step process across all cloud infrastructure.
Step up business agility without compromising
security or compliance
Everything you need to become a Kubernetes expert.
Always for free!
Everything you need to know about Magalix
culture and much more
Kubernetes features several controllers for managing pods. We have ReplicaSets, DaemonSets, StatefulSets, and Deployments. Each one of those has its own scenario and use case. However, they all share one common property: they ensure that their pods are always running. If a pod fails, the controller restarts it or reschedules it to another node to make sure the application the pods is hosting keeps running.
What if we do want the pod to terminate? There are many scenarios when you don’t want the process to keep running indefinitely. Think of a log rotation command. Log rotation is the process of archiving (compressing) logs files that are older than a particular time threshold and deleting ancient ones. Such a process should not be running continuously. Instead, it gets executed, and once the task is complete, it returns the appropriate exit status that reports whether the result is a success or failure.
Kubernetes Jobs ensure that one or more pods execute their commands and exit successfully. When all the pods have exited without errors, the Job gets completed. When the Job gets deleted, any created pods get deleted as well.
Creating a Kubernetes Job, like other Kubernetes resources, is through a definition file. Open a new file; you can name it my_job.yaml. Add the following content to the file:
apiVersion: batch/v1 kind: Job metadata: name: say-something spec: template: metadata: name: say-something spec: containers: - name: say-something image: busybox command: ["echo", "Running a job"] restartPolicy: OnFailure
As with other Kubernetes resources, we can apply this definition to a running Kubernetes cluster using kubectl as follows:
$ kubectl apply -f my_job.yaml job.batch/say-something created
Let’s see what pods got created for us:
$ kubectl get pods NAME READY STATUS RESTARTS AGE say-something-fqjfd 0/1 ContainerCreating 0 2s
Give it a few seconds and run the same command again:
$ kubectl get pods NAME READY STATUS RESTARTS AGE say-something-fqjfd 0/1 Completed 0 9s
The pod status is not running; it is “Completed” as the job ran and exited successfully. The job we’ve just defined had an effortless task: echo “Running a job” to the standard output.
Before moving any further, let’s ensure that the job indeed did what we instructed it to do:
$ kubectl logs say-something-fqjfdRunning a job
The logs show that this pod echoed “Running a job”. The job was successful.
The definition starts with the apiVersion, kind, and metadata, similar to other Kubernetes config. The spec part contains the pod template. The pod template looks precisely like a pod definition without the kind and apiVersion fields. In our example, we are basing our container on the busybox image. We’re instructing it to execute a command that prints “Running a job.”
The restartPolicy cannot be set to always. By definition, a Job should not restart a pod when it terminates successfully. Thus, options available for restartPolicy are “Never” and “OnFailure.”
Notice that we didn’t specify a pod selector like in other pod controllers (Deployments, ReplicaSets, etc.).
A Job does not need a pod selector because the controller automatically creates a label for its pods. It ensures that this label is not in use by other jobs or controllers, and it uses it to match and manage its pods.
So far, we’ve seen how we can run one task defined inside a Job object, more commonly known as the “run-once” pattern. However, real-world scenarios involve other patterns as well.
For example, we may have a queue of messages that needs processing. We must spawn consumer jobs that pull messages from the queue until it’s empty. To implement this pattern in Kubernetes Jobs, we set the .spec.completions parameter to a number (must be a non-zero, positive number). The Job starts spawning pods up till the completions number. The Job regards itself as complete when all the pods terminate with a successful exit code. Let’s have an example. Modify our definition file to look as follows:
apiVersion: batch/v1 kind: Job metadata: name: consumer spec: completions: 5 template: metadata: name: consumer spec: containers: - name: consumer image: busybox command: ["/bin/sh","-c"] args: ["echo 'consuming a message'; sleep 5"] restartPolicy: OnFailure
This definition is very similar to the one we used before with some differences:
Issue the following command:
kubectl apply -f my_job.yaml && kubectl get pods --watch
This command applies the new definition file to the cluster and immediately starts displaying the pods and their status. The --watch flag saves us from having to type the command over and over as it automatically displays any changes in the pod statuses:
job.batch/consumer created NAME READY STATUS RESTARTS AGE consumer-kwwxs 0/1 ContainerCreating 0 0s consumer-kwwxs 1/1 Running 0 2s consumer-kwwxs 0/1 Completed 0 7s consumer-xvb2h 0/1 Pending 0 0s consumer-xvb2h 0/1 Pending 0 0s consumer-xvb2h 0/1 ContainerCreating 0 0s consumer-xvb2h 1/1 Running 0 2s consumer-xvb2h 0/1 Completed 0 7s consumer-g58l5 0/1 Pending 0 0s consumer-g58l5 0/1 Pending 0 0s consumer-g58l5 0/1 ContainerCreating 0 0s consumer-g58l5 1/1 Running 0 2s consumer-g58l5 0/1 Completed 0 7s consumer-595bl 0/1 Pending 0 0s consumer-595bl 0/1 Pending 0 0s consumer-595bl 0/1 ContainerCreating 0 0s consumer-595bl 1/1 Running 0 2s consumer-595bl 0/1 Completed 0 7s consumer-whtmp 0/1 Pending 0 0s consumer-whtmp 0/1 Pending 0 0s consumer-whtmp 0/1 ContainerCreating 0 0s consumer-whtmp 1/1 Running 0 2s consumer-whtmp 0/1 Completed 0 7s
As you can see from the above output, the Job created the first pod. When the pod terminated without failure, the Job spawned the next one all long till the last of the ten pods were created and terminated with no failure.
Another pattern may involve the need to run multiple jobs, but instead of running them one after another, we need to run several of them in parallel. Parallel processing decreases the overall execution time. It has its application in many domains, like data science and AI.
Modify the definition file to look as follows:
apiVersion: batch/v1 kind: Job metadata: name: consumer spec: parallelism: 5 template: metadata: name: consumer spec: containers: - name: consumer image: busybox command: ["/bin/sh","-c"] args: ["echo 'consuming a message'; sleep $(shuf -i 5-10 -n 1)"] restartPolicy: OnFailure
Here we didn’t set the .spec.completions parameter. Instead, we specified the parallelism one. The completions parameter in our case defaults to parallelism (5). The Job now has the following behavior: five pods will get launched at the same time; all of them are execution the same Job. When one of the pods terminates successfully, this means that the whole Job is done. No more pods get spawned, and the Job eventually terminates. Let’s apply this definition:
$ kubectl apply -f my_job.yaml && kubectl get pods --watch job.batch/consumer created NAME READY STATUS RESTARTS AGE consumer-q99zs 0/1 Pending 0 0s consumer-9k6fs 0/1 Pending 0 0s consumer-5htz9 0/1 Pending 0 0s consumer-9v6l6 0/1 Pending 0 0s consumer-sb6wp 0/1 Pending 0 0s consumer-9k6fs 0/1 Pending 0 1s consumer-5htz9 0/1 Pending 0 1s consumer-sb6wp 0/1 Pending 0 1s consumer-9v6l6 0/1 Pending 0 1s consumer-q99zs 0/1 Pending 0 1s consumer-9k6fs 0/1 ContainerCreating 0 1s consumer-5htz9 0/1 ContainerCreating 0 1s consumer-sb6wp 0/1 ContainerCreating 0 1s consumer-9v6l6 0/1 ContainerCreating 0 1s consumer-q99zs 0/1 ContainerCreating 0 1s consumer-q99zs 1/1 Running 0 11s consumer-9k6fs 1/1 Running 0 14s consumer-sb6wp 1/1 Running 0 17s consumer-q99zs 0/1 Completed 0 19s consumer-9k6fs 0/1 Completed 0 19s consumer-9v6l6 1/1 Running 0 21s consumer-5htz9 1/1 Running 0 25s consumer-sb6wp 0/1 Completed 0 25s consumer-9v6l6 0/1 Completed 0 27s consumer-5htz9 0/1 Completed 0 33s
In this scenario, Kubernetes Job is spawning five pods at the same time. It is the responsibility of the pods to know whether or not their peers have finished. In our example, we assume that we are consuming messages from a message queue (like RabbitMQ). When there are no more messages to consume, the job receives a notification that it should exit. Once the first pod exits successfully:
In the above example, we changed the command that the pod executes to make it sleep for a random number of seconds (from five to ten) before it terminates. This way, we are roughly simulating how multiple pods can work together on an external data source like a message queue or an API.
A process running through a Kubernetes Job is different than a daemon. A daemon represents some service with a web interface (for example, an API). Traditionally, a daemon is programmed to be stopped or restarted without issues. On the other hand, a process that is designed to run once may run into errors when it exits prematurely. Kubernetes Jobs restarts the container inside the pod if it fails. It may also restart the whole pod or reschedule it to another node for multiple reasons. For example:
Even if you set .spec.parallelism = 1, .spec.completions = 1 and .spec.template.spec.restartPolicy = "Never", there is no guarantee that the Kubernetes Job runs the process more than once.
The bottom line is: the process that runs through a Job must be able to handle premature exit (lock files, cached data, etc.). It must also be capable of surviving while multiple instances of it are running.
If you set the restartPolicy = "OnFailure" and your Pod had a problem that makes it always fail (a configuration error, an unreachable database, API...etc.) does this mean that the Kubernetes Job keep on restarting the Pod indefinitely? Fortunately, no. A Kubernetes Job will retry running a failed pod every time interval. This time interval starts at ten seconds then it doubles, i.e. 10,20,40,80...etc. As soon as six minutes have passed, the Job will no longer restart the failing Pod.
You can override this behavior by setting the spec.backoffLimit to the number of times the Kubernetes Job should restart the failing Pod.
Notice that setting restartPolicy = "OnFailure" terminates the container when the backoff limit is reached. This way, it may be more challenging when you need to trace back what caused the Job to fail. In such a case, you may want to set the restartPolicy = "Never" and start debugging the problem.
Sometimes, you are more interested in running your Job for a specific amount of time regardless of whether or not the process completes successfully. Think of an AI application that needs to consume data from Twitter. You’re using a cloud instance, and the provider charges you for the amount of CPU and network resources you are utilizing. You are using a Kubernetes Job for data consumption, and you don’t want it to run for more than one hour.
Kubernetes Jobs offer the spec.activeDeadlineSeconds parameter. Setting this parameter to a number will terminate the Job immediately once this number of seconds is reached.
Notice that this setting overrides .spec.backoffLimit, which means that if the pod fails and the Job reaches its deadline limit, it will not restart the failing pod. It will stop immediately.
In the following example, we are creating a Job that has both a backoff limit and a deadline:
apiVersion: batch/v1 kind: Job metadata: name: twitter-consumer spec: backoffLimit: 5 activeDeadlineSeconds: 20 template: spec: containers: - name: consumer image: busybox command: ["/bin/sh", "-c"] args: ["echo 'Consuming data'; sleep 1; exit 1"] restartPolicy: OnFailure
In this definition, we are instructing the container to echo some text, sleep for one second, and then exit with a non-zero status (unsuccessful exit). Notice that we intentionally set the deadline to be twenty seconds while the backofflimit is 5. Let’s see what happens when we deploy this job to the cluster:
$ kubectl apply -f my_job.yaml && kubectl get pods --watch job.batch/twitter-consumer created NAME READY STATUS RESTARTS AGE twitter-consumer-kfvrj 0/1 ContainerCreating 0 0s NAME READY STATUS RESTARTS AGE twitter-consumer-kfvrj 1/1 Running 0 4s twitter-consumer-kfvrj 0/1 Error 0 6s twitter-consumer-kfvrj 1/1 Running 1 10s twitter-consumer-kfvrj 0/1 Error 1 11s twitter-consumer-kfvrj 0/1 Terminating 1 20s twitter-consumer-kfvrj 0/1 Terminating 1 20s
As you can see from the above output, the Kubernetes Job started the pod, but since the pod shortly fails, it restarts it. It keeps on restarting the failing pod. But, once the activeDeadlineSeconds is reached (twenty seconds), the job exits immediately although it restarted the pod twice only and hasn’t gotten to the backoffLimit of five.
When a Kubernetes Job finishes, neither the Job nor the pods that it created get deleted automatically. You have to remove them manually. This feature ensures that you are still able to view the logs and the status of the finished Job and its pods.
A job can be deleted by using kubectl as follows
kubectl delete jobs job_name
The above command deletes the specified Job and all its child pods. Like other Kubernetes controllers, you can opt to remove the Job only leaving its pods by passing the cascade=false flag. For example:
kubectl delete jobs job_name cascade=false
It’s worth noting that there is a new feature in Kubernetes that allows you to specify the number of seconds after which a completed Job gets deleted together with its pods. This feature uses the TTL controller. Notice that this feature is still in alpha state. To use it, we can modify our definition file to look as follows:
apiVersion: batch/v1 kind: Job metadata: name: twitter-consumer spec: backoffLimit: 5 activeDeadlineSeconds: 20 ttlSecondsAfterFinished: 60 template: spec: containers: - name: consumer image: busybox command: ["/bin/sh", "-c"] args: ["echo 'Consuming data'; sleep 1; exit 1"] restartPolicy: OnFailure
This new definition will make sure the finished Job objects will get deleted with their pods after one minute of their completion.