Kubernetes Jobs Use Cases
Kubernetes features several controllers for managing pods. We have ReplicaSets, DaemonSets, StatefulSets, and Deployments. Each one of those has its own scenario and use case. However, they all share one common property: they ensure that their pods are always running. If a pod fails, the controller restarts it or reschedules it to another node to make sure the application the pods is hosting keeps running.
What if we do want the pod to terminate? There are many scenarios when you don’t want the process to keep running indefinitely. Think of a log rotation command. Log rotation is the process of archiving (compressing) logs files that are older than a particular time threshold and deleting ancient ones. Such a process should not be running continuously. Instead, it gets executed, and once the task is complete, it returns the appropriate exit status that reports whether the result is a success or failure.
Kubernetes Jobs ensure that one or more pods execute their commands and exit successfully. When all the pods have exited without errors, the Job gets completed. When the Job gets deleted, any created pods get deleted as well.
Your First Kubernetes Job
Creating a Kubernetes Job, like other Kubernetes resources, is through a definition file. Open a new file; you can name it my_job.yaml. Add the following content to the file:
apiVersion: batch/v1 kind: Job metadata: name: say-something spec: template: metadata: name: say-something spec: containers: - name: say-something image: busybox command: ["echo", "Running a job"] restartPolicy: OnFailure
As with other Kubernetes resources, we can apply this definition to a running Kubernetes cluster using kubectl as follows:
$ kubectl apply -f my_job.yaml job.batch/say-something created
Let’s see what pods got created for us:
$ kubectl get pods NAME READY STATUS RESTARTS AGE say-something-fqjfd 0/1 ContainerCreating 0 2s
Give it a few seconds and run the same command again:
$ kubectl get pods NAME READY STATUS RESTARTS AGE say-something-fqjfd 0/1 Completed 0 9s
The pod status is not running; it is “Completed” as the job ran and exited successfully. The job we’ve just defined had an effortless task: echo “Running a job” to the standard output.
Before moving any further, let’s ensure that the job indeed did what we instructed it to do:
$ kubectl logs say-something-fqjfdRunning a job
The logs show that this pod echoed “Running a job”. The job was successful.
The Kubernetes Job definition file
The definition starts with the apiVersion, kind, and metadata, similar to other Kubernetes config. The spec part contains the pod template. The pod template looks precisely like a pod definition without the kind and apiVersion fields. In our example, we are basing our container on the busybox image. We’re instructing it to execute a command that prints “Running a job.”
Kubernetes Job restartPolicy
The restartPolicy cannot be set to always. By definition, a Job should not restart a pod when it terminates successfully. Thus, options available for restartPolicy are “Never” and “OnFailure.”
Kubernetes Job Needs no Pod Selector
Notice that we didn’t specify a pod selector like in other pod controllers (Deployments, ReplicaSets, etc.).
A Job does not need a pod selector because the controller automatically creates a label for its pods. It ensures that this label is not in use by other jobs or controllers, and it uses it to match and manage its pods.
Jobs Completions and Parallelism
So far, we’ve seen how we can run one task defined inside a Job object, more commonly known as the “run-once” pattern. However, real-world scenarios involve other patterns as well.
Multiple Single Job
For example, we may have a queue of messages that needs processing. We must spawn consumer jobs that pull messages from the queue until it’s empty. To implement this pattern in Kubernetes Jobs, we set the .spec.completions parameter to a number (must be a non-zero, positive number). The Job starts spawning pods up till the completions number. The Job regards itself as complete when all the pods terminate with a successful exit code. Let’s have an example. Modify our definition file to look as follows:
apiVersion: batch/v1 kind: Job metadata: name: consumer spec: completions: 5 template: metadata: name: consumer spec: containers: - name: consumer image: busybox command: ["/bin/sh","-c"] args: ["echo 'consuming a message'; sleep 5"] restartPolicy: OnFailure
This definition is very similar to the one we used before with some differences:
- We specify the completions parameter to be 5.
- We change the command that the container inside the pod used to include a five-second delay, which ensures that we can see the pods created and terminated.
Issue the following command:
kubectl apply -f my_job.yaml && kubectl get pods --watch
This command applies the new definition file to the cluster and immediately starts displaying the pods and their status. The --watch flag saves us from having to type the command over and over as it automatically displays any changes in the pod statuses:
job.batch/consumer created NAME READY STATUS RESTARTS AGE consumer-kwwxs 0/1 ContainerCreating 0 0s consumer-kwwxs 1/1 Running 0 2s consumer-kwwxs 0/1 Completed 0 7s consumer-xvb2h 0/1 Pending 0 0s consumer-xvb2h 0/1 Pending 0 0s consumer-xvb2h 0/1 ContainerCreating 0 0s consumer-xvb2h 1/1 Running 0 2s consumer-xvb2h 0/1 Completed 0 7s consumer-g58l5 0/1 Pending 0 0s consumer-g58l5 0/1 Pending 0 0s consumer-g58l5 0/1 ContainerCreating 0 0s consumer-g58l5 1/1 Running 0 2s consumer-g58l5 0/1 Completed 0 7s consumer-595bl 0/1 Pending 0 0s consumer-595bl 0/1 Pending 0 0s consumer-595bl 0/1 ContainerCreating 0 0s consumer-595bl 1/1 Running 0 2s consumer-595bl 0/1 Completed 0 7s consumer-whtmp 0/1 Pending 0 0s consumer-whtmp 0/1 Pending 0 0s consumer-whtmp 0/1 ContainerCreating 0 0s consumer-whtmp 1/1 Running 0 2s consumer-whtmp 0/1 Completed 0 7s
As you can see from the above output, the Job created the first pod. When the pod terminated without failure, the Job spawned the next one all long till the last of the ten pods were created and terminated with no failure.
Multiple Parallel Jobs (Work Queue)
Another pattern may involve the need to run multiple jobs, but instead of running them one after another, we need to run several of them in parallel. Parallel processing decreases the overall execution time. It has its application in many domains, like data science and AI.
Modify the definition file to look as follows:
apiVersion: batch/v1 kind: Job metadata: name: consumer spec: parallelism: 5 template: metadata: name: consumer spec: containers: - name: consumer image: busybox command: ["/bin/sh","-c"] args: ["echo 'consuming a message'; sleep $(shuf -i 5-10 -n 1)"] restartPolicy: OnFailure
Here we didn’t set the .spec.completions parameter. Instead, we specified the parallelism one. The completions parameter in our case defaults to parallelism (5). The Job now has the following behavior: five pods will get launched at the same time; all of them are execution the same Job. When one of the pods terminates successfully, this means that the whole Job is done. No more pods get spawned, and the Job eventually terminates. Let’s apply this definition:
$ kubectl apply -f my_job.yaml && kubectl get pods --watch job.batch/consumer created NAME READY STATUS RESTARTS AGE consumer-q99zs 0/1 Pending 0 0s consumer-9k6fs 0/1 Pending 0 0s consumer-5htz9 0/1 Pending 0 0s consumer-9v6l6 0/1 Pending 0 0s consumer-sb6wp 0/1 Pending 0 0s consumer-9k6fs 0/1 Pending 0 1s consumer-5htz9 0/1 Pending 0 1s consumer-sb6wp 0/1 Pending 0 1s consumer-9v6l6 0/1 Pending 0 1s consumer-q99zs 0/1 Pending 0 1s consumer-9k6fs 0/1 ContainerCreating 0 1s consumer-5htz9 0/1 ContainerCreating 0 1s consumer-sb6wp 0/1 ContainerCreating 0 1s consumer-9v6l6 0/1 ContainerCreating 0 1s consumer-q99zs 0/1 ContainerCreating 0 1s consumer-q99zs 1/1 Running 0 11s consumer-9k6fs 1/1 Running 0 14s consumer-sb6wp 1/1 Running 0 17s consumer-q99zs 0/1 Completed 0 19s consumer-9k6fs 0/1 Completed 0 19s consumer-9v6l6 1/1 Running 0 21s consumer-5htz9 1/1 Running 0 25s consumer-sb6wp 0/1 Completed 0 25s consumer-9v6l6 0/1 Completed 0 27s consumer-5htz9 0/1 Completed 0 33s
In this scenario, Kubernetes Job is spawning five pods at the same time. It is the responsibility of the pods to know whether or not their peers have finished. In our example, we assume that we are consuming messages from a message queue (like RabbitMQ). When there are no more messages to consume, the job receives a notification that it should exit. Once the first pod exits successfully:
- No more pods are spawned.
- Existing pods finish their work and exit as well.
In the above example, we changed the command that the pod executes to make it sleep for a random number of seconds (from five to ten) before it terminates. This way, we are roughly simulating how multiple pods can work together on an external data source like a message queue or an API.
Kubernetes Job Failure and Concurrency Considerations
A process running through a Kubernetes Job is different than a daemon. A daemon represents some service with a web interface (for example, an API). Traditionally, a daemon is programmed to be stopped or restarted without issues. On the other hand, a process that is designed to run once may run into errors when it exits prematurely. Kubernetes Jobs restarts the container inside the pod if it fails. It may also restart the whole pod or reschedule it to another node for multiple reasons. For example:
- The node crashed.
- The node got rebooted or upgraded.
- The pod was consuming more resources than the node has.
Even if you set .spec.parallelism = 1, .spec.completions = 1 and .spec.template.spec.restartPolicy = "Never", there is no guarantee that the Kubernetes Job runs the process more than once.
The bottom line is: the process that runs through a Job must be able to handle premature exit (lock files, cached data, etc.). It must also be capable of surviving while multiple instances of it are running.
The Pod Failure Limit
If you set the restartPolicy = "OnFailure" and your Pod had a problem that makes it always fail (a configuration error, an unreachable database, API...etc.) does this mean that the Kubernetes Job keep on restarting the Pod indefinitely? Fortunately, no. A Kubernetes Job will retry running a failed pod every time interval. This time interval starts at ten seconds then it doubles, i.e. 10,20,40,80...etc. As soon as six minutes have passed, the Job will no longer restart the failing Pod.
You can override this behavior by setting the spec.backoffLimit to the number of times the Kubernetes Job should restart the failing Pod.
Notice that setting restartPolicy = "OnFailure" terminates the container when the backoff limit is reached. This way, it may be more challenging when you need to trace back what caused the Job to fail. In such a case, you may want to set the restartPolicy = "Never" and start debugging the problem.
Limiting the Kubernetes Job Execution Time
Sometimes, you are more interested in running your Job for a specific amount of time regardless of whether or not the process completes successfully. Think of an AI application that needs to consume data from Twitter. You’re using a cloud instance, and the provider charges you for the amount of CPU and network resources you are utilizing. You are using a Kubernetes Job for data consumption, and you don’t want it to run for more than one hour.
Kubernetes Jobs offer the spec.activeDeadlineSeconds parameter. Setting this parameter to a number will terminate the Job immediately once this number of seconds is reached.
Notice that this setting overrides .spec.backoffLimit, which means that if the pod fails and the Job reaches its deadline limit, it will not restart the failing pod. It will stop immediately.
In the following example, we are creating a Job that has both a backoff limit and a deadline:
apiVersion: batch/v1 kind: Job metadata: name: twitter-consumer spec: backoffLimit: 5 activeDeadlineSeconds: 20 template: spec: containers: - name: consumer image: busybox command: ["/bin/sh", "-c"] args: ["echo 'Consuming data'; sleep 1; exit 1"] restartPolicy: OnFailure
In this definition, we are instructing the container to echo some text, sleep for one second, and then exit with a non-zero status (unsuccessful exit). Notice that we intentionally set the deadline to be twenty seconds while the backofflimit is 5. Let’s see what happens when we deploy this job to the cluster:
$ kubectl apply -f my_job.yaml && kubectl get pods --watch job.batch/twitter-consumer created NAME READY STATUS RESTARTS AGE twitter-consumer-kfvrj 0/1 ContainerCreating 0 0s NAME READY STATUS RESTARTS AGE twitter-consumer-kfvrj 1/1 Running 0 4s twitter-consumer-kfvrj 0/1 Error 0 6s twitter-consumer-kfvrj 1/1 Running 1 10s twitter-consumer-kfvrj 0/1 Error 1 11s twitter-consumer-kfvrj 0/1 Terminating 1 20s twitter-consumer-kfvrj 0/1 Terminating 1 20s
As you can see from the above output, the Kubernetes Job started the pod, but since the pod shortly fails, it restarts it. It keeps on restarting the failing pod. But, once the activeDeadlineSeconds is reached (twenty seconds), the job exits immediately although it restarted the pod twice only and hasn’t gotten to the backoffLimit of five.
Kubernetes Job Deletion and Cleanup
When a Kubernetes Job finishes, neither the Job nor the pods that it created get deleted automatically. You have to remove them manually. This feature ensures that you are still able to view the logs and the status of the finished Job and its pods.
A job can be deleted by using kubectl as follows
kubectl delete jobs job_name
The above command deletes the specified Job and all its child pods. Like other Kubernetes controllers, you can opt to remove the Job only leaving its pods by passing the cascade=false flag. For example:
kubectl delete jobs job_name cascade=false
It’s worth noting that there is a new feature in Kubernetes that allows you to specify the number of seconds after which a completed Job gets deleted together with its pods. This feature uses the TTL controller. Notice that this feature is still in alpha state. To use it, we can modify our definition file to look as follows:
apiVersion: batch/v1 kind: Job metadata: name: twitter-consumer spec: backoffLimit: 5 activeDeadlineSeconds: 20 ttlSecondsAfterFinished: 60 template: spec: containers: - name: consumer image: busybox command: ["/bin/sh", "-c"] args: ["echo 'Consuming data'; sleep 1; exit 1"] restartPolicy: OnFailure
This new definition will make sure the finished Job objects will get deleted with their pods after one minute of their completion.
- Kubernetes Jobs are used when you want to create pods that will do a specific task and then exit.
- Kubernetes Jobs do not need pod selectors by default as pods; their labels and selectors are handled automatically by the Job.
- The restartPolicy for a Job accepts “Never” and “OnFailure”
- Jobs use completion and parallelism parameters to control the patterns the pods run through. Job pods can run as a single task, several sequential tasks, or some parallel tasks in which the first task that finishes instructs the rest of the pods to complete and exit.
- You can control how many times a Job attempts to restart a failing pod using the .spec.backoffLimit. This limit defaults to six.
- You can control how long a job will run using the .spec.activeDeadlineSeconds. This limit overrules the backoffLimit. So the Job does not attempt to restart a failing pod if the deadline is reached.
- Jobs and their pods do not get deleted automatically when they finish. You have to manually delete them or use the ttlSecondsAfterFinished controller, which is still in alpha stage as of the time of this writing.