job

Job Controller

Abstract

A proposal for implementing a new controller - Job controller - which will be responsible for managing pod(s) that require running once to completion even if the machine the pod is running on fails, in contrast to what ReplicationController currently offers.

Several existing issues and PRs were already created regarding that particular subject: * Job Controller #1624 * New Job resource #7380

Use Cases

  1. Be able to start one or several pods tracked as a single entity.
  2. Be able to run batch-oriented workloads on Kubernetes.
  3. Be able to get the job status.
  4. Be able to specify the number of instances performing a job at any one time.
  5. Be able to specify the number of successfully finished instances required to finish a job.
  6. Be able to specify a backoff policy, when job is continuously failing.

Motivation

Jobs are needed for executing multi-pod computation to completion; a good example here would be the ability to implement any type of batch oriented tasks.

Backoff policy and failed pod limit

By design, Jobs do not have any notion of failure, other than a pod’s restartPolicy which is mistakenly taken as Job’s restart policy (#30243, [#43964]). There are situation where one wants to fail a Job after some amount of retries over a certain period of time, due to a logical error in configuration etc. To do so we are going to introduce the following fields, which will control the backoff policy: a number of retries and an initial time of retry. The two fields will allow fine-grained control over the backoff policy. Each of the two fields will use a default value if none is provided, BackoffLimit is set by default to 6 and BackoffSeconds to 10s. This will result in the following retry sequence: 10s, 20s, 40s, 1m20s, 2m40s, 5m20s. After which the job will be considered failed.

Additionally, to help debug the issue with a Job, and limit the impact of having too many failed pods left around (as mentioned in #30243), we are going to introduce a field which will allow specifying the maximum number of failed pods to keep around. This number will also take effect if none of the limits described above are set. By default it will take value of 1, to allow debugging job issues, but not to flood the cluster with too many failed jobs and their accompanying pods.

All of the above fields will be optional and will apply when restartPolicy is set to Never on a PodTemplate. With restart policy OnFailure only BackoffLimit applies. The reason for that is that failed pods are already restarted by the kubelet with an exponential backoff. Additionally, failures are counted differently depending on restartPolicy setting. For Never we count actual pod failures (reflected in .status.failed field). With OnFailure, we take an approximate value of pod restarts (as reported in .status.containerStatuses[*].restartCount). When .spec.parallelism is set to a value higher than 1, the failures are an overall number (as coming from .status.failed) because the controller does not hold information about failures coming from separate pods.

Implementation

Job controller is similar to replication controller in that they manage pods. This implies they will follow the same controller framework that replication controllers already defined. The biggest difference between a Job and a ReplicationController object is the purpose; ReplicationController ensures that a specified number of Pods are running at any one time, whereas Job is responsible for keeping the desired number of Pods to a completion of a task. This difference will be represented by the RestartPolicy which is required to always take value of RestartPolicyNever or RestartOnFailure.

The new Job object will have the following content:

// Job represents the configuration of a single job.
type Job struct {
    TypeMeta
    ObjectMeta

    // Spec is a structure defining the expected behavior of a job.
    Spec JobSpec

    // Status is a structure describing current status of a job.
    Status JobStatus
}

// JobList is a collection of jobs.
type JobList struct {
    TypeMeta
    ListMeta

    Items []Job
}

JobSpec structure is defined to contain all the information how the actual job execution will look like.

// JobSpec describes how the job execution will look like.
type JobSpec struct {

    // Parallelism specifies the maximum desired number of pods the job should
    // run at any given time. The actual number of pods running in steady state will
    // be less than this number when ((.spec.completions - .status.successful) < .spec.parallelism),
    // i.e. when the work left to do is less than max parallelism.
    Parallelism *int32

    // Completions specifies the desired number of successfully finished pods the
    // job should be run with. Defaults to 1.
    Completions *int32

    // Optional duration in seconds relative to the startTime that the job may be active
    // before the system tries to terminate it; value must be a positive integer.
    // It applies to overall job run time, no matter of the value of completions
    // or parallelism parameters.
    ActiveDeadlineSeconds *int64

    // Optional number of retries before marking this job failed.
    // Defaults to 6.
    BackoffLimit *int32

    // Optional time (in seconds) specifying how long the initial backoff will last.
    // Defaults to 10s.
    BackoffSeconds *int64

    // Optional number of failed pods to retain.
    FailedPodsLimit *int32

    // Selector is a label query over pods running a job.
    Selector LabelSelector

    // Template is the object that describes the pod that will be created when
    // executing a job.
    Template *PodTemplateSpec
}

JobStatus structure is defined to contain information about pods executing specified job. The structure holds information about pods currently executing the job.

// JobStatus represents the current state of a Job.
type JobStatus struct {
    Conditions []JobCondition

    // CreationTime represents time when the job was created
    CreationTime unversioned.Time

    // StartTime represents time when the job was started
    StartTime unversioned.Time

    // CompletionTime represents time when the job was completed
    CompletionTime unversioned.Time

    // Active is the number of actively running pods.
    Active int32

    // Succeeded is the number of pods successfully completed their job.
    Succeeded int32

    // Failed is the number of pods failures, this applies only to jobs
    // created with RestartPolicyNever, otherwise this value will always be 0.
    Failed int32
}

type JobConditionType string

// These are valid conditions of a job.
const (
    // JobComplete means the job has completed its execution.
    JobComplete JobConditionType = "Complete"
)

// JobCondition describes current state of a job.
type JobCondition struct {
    Type               JobConditionType
    Status             ConditionStatus
    LastHeartbeatTime  unversioned.Time
    LastTransitionTime unversioned.Time
    Reason             string
    Message            string
}

Events

Job controller will be emitting the following events: * JobStart * JobFinish

Future evolution

Below are the possible future extensions to the Job controller: * Be able to limit the execution time for a job, similarly to ActiveDeadlineSeconds for Pods. now implemented * Be able to create a chain of jobs dependent one on another. will be implemented in a separate type called Workflow * Be able to specify the work each of the workers should execute (see type 1 from this comment) * Be able to inspect Pods running a Job, especially after a Job has finished, e.g. by providing pointers to Pods in the JobStatus (see comment). * help users avoid non-unique label selectors (see this proposal)