14-days FREE Trial

 

Right-size Kubernetes cluster, boost app performance and lower cloud infrastructure cost in 5 minutes or less

 

GET STARTED

  Blog

Kubernetes Jobs 101 - The Task Jobs

Kubernetes Jobs Use Cases

Kubernetes features several controllers for managing pods. We have ReplicaSets, DaemonSets, StatefulSets, and Deployments. Each one of those has its own scenario and use case. However, they all share one common property: they ensure that their pods are always running. If a pod fails, the controller restarts it or reschedules it to another node to make sure the application the pods is hosting keeps running.

What if we do want the pod to terminate? There are many scenarios when you don’t want the process to keep running indefinitely. Think of a log rotation command. Log rotation is the process of archiving (compressing) logs files that are older than a particular time threshold and deleting ancient ones. Such a process should not be running continuously. Instead, it gets executed, and once the task is complete, it returns the appropriate exit status that reports whether the result is a success or failure.

Kubernetes Jobs ensure that one or more pods execute their commands and exit successfully. When all the pods have exited without errors, the Job gets completed. When the Job gets deleted, any created pods get deleted as well.

Your First Kubernetes Job

Creating a Kubernetes Job, like other Kubernetes resources, is through a definition file. Open a new file; you can name it my_job.yaml. Add the following content to the file:

apiVersion: batch/v1
kind: Job
metadata:
 name: say-something
spec:
 template:
   metadata:
     name: say-something
   spec:
     containers:
     - name: say-something
       image: busybox
       command: ["echo", "Running a job"]
     restartPolicy: OnFailure

As with other Kubernetes resources, we can apply this definition to a running Kubernetes cluster using kubectl as follows:

$ kubectl apply -f my_job.yaml 
job.batch/say-something created

Let’s see what pods got created for us:

$ kubectl get pods 
NAME              	READY   STATUS          	RESTARTS   AGE
say-something-fqjfd   0/1 	ContainerCreating   0      	2s

Give it a few seconds and run the same command again:

$ kubectl get pods 
NAME              	READY   STATUS  	RESTARTS   AGE
say-something-fqjfd   0/1 	Completed   0      	9s

The pod status is not running; it is “Completed” as the job ran and exited successfully. The job we’ve just defined had an effortless task: echo “Running a job” to the standard output.

Before moving any further, let’s ensure that the job indeed did what we instructed it to do:

$ kubectl logs say-something-fqjfdRunning a job

The logs show that this pod echoed “Running a job”. The job was successful.

The Kubernetes Job definition file

The definition starts with the apiVersion, kind, and metadata, similar to other Kubernetes config. The spec part contains the pod template. The pod template looks precisely like a pod definition without the kind and apiVersion fields. In our example, we are basing our container on the busybox image. We’re instructing it to execute a command that prints “Running a job.”

Kubernetes Job restartPolicy

The restartPolicy cannot be set to always. By definition, a Job should not restart a pod when it terminates successfully. Thus, options available for restartPolicy are “Never” and “OnFailure.”

Kubernetes Job Needs no Pod Selector

Notice that we didn’t specify a pod selector like in other pod controllers (Deployments, ReplicaSets, etc.).

A Job does not need a pod selector because the controller automatically creates a label for its pods. It ensures that this label is not in use by other jobs or controllers, and it uses it to match and manage its pods.

Jobs Completions and Parallelism

So far, we’ve seen how we can run one task defined inside a Job object, more commonly known as the “run-once” pattern. However, real-world scenarios involve other patterns as well.

Multiple Single Job

For example, we may have a queue of messages that needs processing. We must spawn consumer jobs that pull messages from the queue until it’s empty. To implement this pattern in Kubernetes Jobs, we set the .spec.completions parameter to a number (must be a non-zero, positive number). The Job starts spawning pods up till the completions number. The Job regards itself as complete when all the pods terminate with a successful exit code. Let’s have an example. Modify our definition file to look as follows:

apiVersion: batch/v1
kind: Job
metadata:
name: consumer
spec:
completions: 5
template:
  metadata:
    name: consumer
  spec:
    containers:
    - name: consumer
      image: busybox
      command: ["/bin/sh","-c"]
      args: ["echo 'consuming a message'; sleep 5"]
    restartPolicy: OnFailure

This definition is very similar to the one we used before with some differences:

  • We specify the completions parameter to be 5.
  • We change the command that the container inside the pod used to include a five-second delay, which ensures that we can see the pods created and terminated.

Issue the following command:

kubectl apply -f my_job.yaml && kubectl get pods --watch

This command applies the new definition file to the cluster and immediately starts displaying the pods and their status. The --watch flag saves us from having to type the command over and over as it automatically displays any changes in the pod statuses:

job.batch/consumer created
NAME             READY   STATUS              RESTARTS   AGE
consumer-kwwxs   0/1     ContainerCreating   0          0s
consumer-kwwxs   1/1     Running             0          2s
consumer-kwwxs   0/1     Completed           0          7s
consumer-xvb2h   0/1     Pending             0          0s
consumer-xvb2h   0/1     Pending             0          0s
consumer-xvb2h   0/1     ContainerCreating   0          0s
consumer-xvb2h   1/1     Running             0          2s
consumer-xvb2h   0/1     Completed           0          7s
consumer-g58l5   0/1     Pending             0          0s
consumer-g58l5   0/1     Pending             0          0s
consumer-g58l5   0/1     ContainerCreating   0          0s
consumer-g58l5   1/1     Running             0          2s
consumer-g58l5   0/1     Completed           0          7s
consumer-595bl   0/1     Pending             0          0s
consumer-595bl   0/1     Pending             0          0s
consumer-595bl   0/1     ContainerCreating   0          0s
consumer-595bl   1/1     Running             0          2s
consumer-595bl   0/1     Completed           0          7s
consumer-whtmp   0/1     Pending             0          0s
consumer-whtmp   0/1     Pending             0          0s
consumer-whtmp   0/1     ContainerCreating   0          0s
consumer-whtmp   1/1     Running             0          2s
consumer-whtmp   0/1     Completed           0          7s

As you can see from the above output, the Job created the first pod. When the pod terminated without failure, the Job spawned the next one all long till the last of the ten pods were created and terminated with no failure.

Multiple Parallel Jobs (Work Queue)

Another pattern may involve the need to run multiple jobs, but instead of running them one after another, we need to run several of them in parallel. Parallel processing decreases the overall execution time. It has its application in many domains, like data science and AI.

Modify the definition file to look as follows:

apiVersion: batch/v1
kind: Job
metadata:
name: consumer
spec:
parallelism: 5
template:
  metadata:
    name: consumer
  spec:
    containers:
    - name: consumer
      image: busybox
      command: ["/bin/sh","-c"]
      args: ["echo 'consuming a message'; sleep $(shuf -i 5-10 -n 1)"]
    restartPolicy: OnFailure

Here we didn’t set the .spec.completions parameter. Instead, we specified the parallelism one. The completions parameter in our case defaults to parallelism (5). The Job now has the following behavior: five pods will get launched at the same time; all of them are execution the same Job. When one of the pods terminates successfully, this means that the whole Job is done. No more pods get spawned, and the Job eventually terminates. Let’s apply this definition:

$ kubectl apply -f my_job.yaml && kubectl get pods --watch
job.batch/consumer created
NAME                             	READY   STATUS	RESTARTS   AGE
consumer-q99zs                   	0/1 	Pending   0      	0s
consumer-9k6fs                   	0/1 	Pending   0      	0s
consumer-5htz9                   	0/1 	Pending   0      	0s
consumer-9v6l6                   	0/1 	Pending   0      	0s
consumer-sb6wp                   	0/1 	Pending   0      	0s
consumer-9k6fs                   	0/1 	Pending   0      	1s
consumer-5htz9                   	0/1 	Pending   0      	1s
consumer-sb6wp                   	0/1 	Pending   0      	1s
consumer-9v6l6                   	0/1 	Pending   0      	1s
consumer-q99zs                   	0/1 	Pending   0      	1s
consumer-9k6fs                   	0/1 	ContainerCreating   0      	1s
consumer-5htz9                   	0/1 	ContainerCreating   0      	1s
consumer-sb6wp                   	0/1 	ContainerCreating   0      	1s
consumer-9v6l6                   	0/1 	ContainerCreating   0      	1s
consumer-q99zs                   	0/1 	ContainerCreating   0      	1s
consumer-q99zs                   	1/1 	Running         	0      	11s
consumer-9k6fs                   	1/1 	Running         	0      	14s
consumer-sb6wp                   	1/1 	Running         	0      	17s
consumer-q99zs                   	0/1 	Completed       	0      	19s
consumer-9k6fs                   	0/1 	Completed       	0      	19s
consumer-9v6l6                   	1/1 	Running         	0      	21s
consumer-5htz9                   	1/1 	Running         	0      	25s
consumer-sb6wp                   	0/1 	Completed       	0      	25s
consumer-9v6l6                   	0/1 	Completed       	0      	27s
consumer-5htz9                   	0/1 	Completed       	0      	33s

In this scenario, Kubernetes Job is spawning five pods at the same time. It is the responsibility of the pods to know whether or not their peers have finished. In our example, we assume that we are consuming messages from a message queue (like RabbitMQ). When there are no more messages to consume, the job receives a notification that it should exit. Once the first pod exits successfully:

  • No more pods are spawned.
  • Existing pods finish their work and exit as well.

In the above example, we changed the command that the pod executes to make it sleep for a random number of seconds (from five to ten) before it terminates. This way, we are roughly simulating how multiple pods can work together on an external data source like a message queue or an API.

Kubernetes Job Failure and Concurrency Considerations

A process running through a Kubernetes Job is different than a daemon. A daemon represents some service with a web interface (for example, an API). Traditionally, a daemon is programmed to be stopped or restarted without issues. On the other hand, a process that is designed to run once may run into errors when it exits prematurely. Kubernetes Jobs restarts the container inside the pod if it fails. It may also restart the whole pod or reschedule it to another node for multiple reasons. For example:

  • The node crashed.
  • The node got rebooted or upgraded.
  • The pod was consuming more resources than the node has.

Even if you set .spec.parallelism = 1, .spec.completions = 1 and .spec.template.spec.restartPolicy = "Never", there is no guarantee that the Kubernetes Job runs the process more than once.

The bottom line is: the process that runs through a Job must be able to handle premature exit (lock files, cached data, etc.). It must also be capable of surviving while multiple instances of it are running.

The Pod Failure Limit

If you set the restartPolicy = "OnFailure" and your Pod had a problem that makes it always fail (a configuration error, an unreachable database, API...etc.) does this mean that the Kubernetes Job keep on restarting the Pod indefinitely? Fortunately, no. A Kubernetes Job will retry running a failed pod every time interval. This time interval starts at ten seconds then it doubles, i.e. 10,20,40,80...etc. As soon as six minutes have passed, the Job will no longer restart the failing Pod.

You can override this behavior by setting the spec.backoffLimit to the number of times the Kubernetes Job should restart the failing Pod.

Notice that setting restartPolicy = "OnFailure" terminates the container when the backoff limit is reached. This way, it may be more challenging when you need to trace back what caused the Job to fail. In such a case, you may want to set the restartPolicy = "Never" and start debugging the problem.

Limiting the Kubernetes Job Execution Time

Sometimes, you are more interested in running your Job for a specific amount of time regardless of whether or not the process completes successfully. Think of an AI application that needs to consume data from Twitter. You’re using a cloud instance, and the provider charges you for the amount of CPU and network resources you are utilizing. You are using a Kubernetes Job for data consumption, and you don’t want it to run for more than one hour.

Kubernetes Jobs offer the spec.activeDeadlineSeconds parameter. Setting this parameter to a number will terminate the Job immediately once this number of seconds is reached.

Notice that this setting overrides .spec.backoffLimit, which means that if the pod fails and the Job reaches its deadline limit, it will not restart the failing pod. It will stop immediately.

In the following example, we are creating a Job that has both a backoff limit and a deadline:

apiVersion: batch/v1
kind: Job
metadata:
  name: twitter-consumer
spec:
  backoffLimit: 5
  activeDeadlineSeconds: 20
  template:
	spec:
  	  containers:
  	  - name: consumer
    	    image: busybox
    	    command: ["/bin/sh", "-c"]
    	    args: ["echo 'Consuming data'; sleep 1; exit 1"]
  	  restartPolicy: OnFailure

In this definition, we are instructing the container to echo some text, sleep for one second, and then exit with a non-zero status (unsuccessful exit). Notice that we intentionally set the deadline to be twenty seconds while the backofflimit is 5. Let’s see what happens when we deploy this job to the cluster:

$ kubectl apply -f my_job.yaml && kubectl get pods --watch
job.batch/twitter-consumer created
NAME                 	READY   STATUS          	RESTARTS   AGE
twitter-consumer-kfvrj   0/1 	ContainerCreating   0      	0s
NAME                 	READY   STATUS          	RESTARTS   AGE
twitter-consumer-kfvrj   1/1 	Running         	0      	4s
twitter-consumer-kfvrj   0/1 	Error           	0      	6s
twitter-consumer-kfvrj   1/1 	Running         	1      	10s
twitter-consumer-kfvrj   0/1 	Error           	1      	11s
twitter-consumer-kfvrj   0/1 	Terminating     	1      	20s
twitter-consumer-kfvrj   0/1 	Terminating     	1      	20s

As you can see from the above output, the Kubernetes Job started the pod, but since the pod shortly fails, it restarts it. It keeps on restarting the failing pod. But, once the activeDeadlineSeconds is reached (twenty seconds), the job exits immediately although it restarted the pod twice only and hasn’t gotten to the backoffLimit of five.

Kubernetes Job Deletion and Cleanup

When a Kubernetes Job finishes, neither the Job nor the pods that it created get deleted automatically. You have to remove them manually. This feature ensures that you are still able to view the logs and the status of the finished Job and its pods.

A job can be deleted by using kubectl as follows

kubectl delete jobs job_name

The above command deletes the specified Job and all its child pods. Like other Kubernetes controllers, you can opt to remove the Job only leaving its pods by passing the cascade=false flag. For example:

kubectl delete jobs job_name cascade=false

It’s worth noting that there is a new feature in Kubernetes that allows you to specify the number of seconds after which a completed Job gets deleted together with its pods. This feature uses the TTL controller. Notice that this feature is still in alpha state. To use it, we can modify our definition file to look as follows:

apiVersion: batch/v1
kind: Job
metadata:
  name: twitter-consumer
spec:
  backoffLimit: 5
  activeDeadlineSeconds: 20
  ttlSecondsAfterFinished: 60
  template:
	spec:
  	containers:
  	- name: consumer
    	  image: busybox
    	  command: ["/bin/sh", "-c"]
    	  args: ["echo 'Consuming data'; sleep 1; exit 1"]
  	restartPolicy: OnFailure

This new definition will make sure the finished Job objects will get deleted with their pods after one minute of their completion.

TL;DR

  • Kubernetes Jobs are used when you want to create pods that will do a specific task and then exit.
  • Kubernetes Jobs do not need pod selectors by default as pods; their labels and selectors are handled automatically by the Job.
  • The restartPolicy for a Job accepts “Never” and “OnFailure”
  • Jobs use completion and parallelism parameters to control the patterns the pods run through. Job pods can run as a single task, several sequential tasks, or some parallel tasks in which the first task that finishes instructs the rest of the pods to complete and exit.
  • You can control how many times a Job attempts to restart a failing pod using the .spec.backoffLimit. This limit defaults to six.
  • You can control how long a job will run using the .spec.activeDeadlineSeconds. This limit overrules the backoffLimit. So the Job does not attempt to restart a failing pod if the deadline is reached.
  • Jobs and their pods do not get deleted automatically when they finish. You have to manually delete them or use the ttlSecondsAfterFinished controller, which is still in alpha stage as of the time of this writing.
Mohamed Ahmed

Aug 12, 2019