container-init

Pod initialization

@smarterclayton

March 2016

Proposal and Motivation

Within a pod there is a need to initialize local data or adapt to the current cluster environment that is not easily achieved in the current container model. Containers start in parallel after volumes are mounted, leaving no opportunity for coordination between containers without specialization of the image. If two containers need to share common initialization data, both images must be altered to cooperate using filesystem or network semantics, which introduces coupling between images. Likewise, if an image requires configuration in order to start and that configuration is environment dependent, the image must be altered to add the necessary templating or retrieval.

This proposal introduces the concept of an init container, one or more containers started in sequence before the pod’s normal containers are started. These init containers may share volumes, perform network operations, and perform computation prior to the start of the remaining containers. They may also, by virtue of their sequencing, block or delay the startup of application containers until some precondition is met. In this document we refer to the existing pod containers as app containers.

This proposal also provides a high level design of volume containers, which initialize a particular volume, as a feature that specializes some of the tasks defined for init containers. The init container design anticipates the existence of volume containers and highlights where they will take future work

Design Points

  • Init containers should be able to:
    • Perform initialization of shared volumes
    • Download binaries that will be used in app containers as execution targets
    • Inject configuration or extension capability to generic images at startup
    • Perform complex templating of information available in the local environment
    • Initialize a database by starting a temporary execution process and applying schema info.
    • Delay the startup of application containers until preconditions are met
    • Register the pod with other components of the system
  • Reduce coupling:
    • Between application images, eliminating the need to customize those images for Kubernetes generally or specific roles
    • Inside of images, by specializing which containers perform which tasks (install git into init container, use filesystem contents in web container)
    • Between initialization steps, by supporting multiple sequential init containers
  • Init containers allow simple start preconditions to be implemented that are decoupled from application code
    • The order init containers start should be predictable and allow users to easily reason about the startup of a container
    • Complex ordering and failure will not be supported - all complex workflows can if necessary be implemented inside of a single init container, and this proposal aims to enable that ordering without adding undue complexity to the system. Pods in general are not intended to support DAG workflows.
  • Both run-once and run-forever pods should be able to use init containers
  • As much as possible, an init container should behave like an app container to reduce complexity for end users, for clients, and for divergent use cases. An init container is a container with the minimum alterations to accomplish its goal.
  • Volume containers should be able to:
    • Perform initialization of a single volume
    • Start in parallel
    • Perform computation to initialize a volume, and delay start until that volume is initialized successfully.
    • Using a volume container that does not populate a volume to delay pod start (in the absence of init containers) would be an abuse of the goal of volume containers.
  • Container pre-start hooks are not sufficient for all initialization cases:
    • They cannot easily coordinate complex conditions across containers
    • They can only function with code in the image or code in a shared volume, which would have to be statically linked (not a common pattern in wide use)
    • They cannot be implemented with the current Docker implementation - see #140

Alternatives

  • Any mechanism that runs user code on a node before regular pod containers should itself be a container and modeled as such - we explicitly reject creating new mechanisms for running user processes.
  • The container pre-start hook (not yet implemented) requires execution within the container’s image and so cannot adapt existing images. It also cannot block startup of containers
  • Running a “pre-pod” would defeat the purpose of the pod being an atomic unit of scheduling.

Design

Each pod may have 0..N init containers defined along with the existing 1..M app containers.

On startup of the pod, after the network and volumes are initialized, the init containers are started in order. Each container must exit successfully before the next is invoked. If a container fails to start (due to the runtime) or exits with failure, it is retried according to the pod RestartPolicy. RestartPolicyNever pods will immediately fail and exit. RestartPolicyAlways pods will retry the failing init container with increasing backoff until it succeeds. To align with the design of application containers, init containers will only support “infinite retries” (RestartPolicyAlways) or “no retries” (RestartPolicyNever).

A pod cannot be ready until all init containers have succeeded. The ports on an init container are not aggregated under a service. A pod that is being initialized is in the Pending phase but should have a distinct condition. Each app container and all future init containers should have the reason PodInitializing. The pod should have a condition Initializing set to false until all init containers have succeeded, and true thereafter. If the pod is restarted, the Initializing condition should be set to false.

If the pod is “restarted” all containers stopped and started due to a node restart, change to the pod definition, or admin interaction, all init containers must execute again. Restartable conditions are defined as:

  • An init container image is changed
  • The pod infrastructure container is restarted (shared namespaces are lost)
  • The Kubelet detects that all containers in a pod are terminated AND no record of init container completion is available on disk (due to GC)

Changes to the init container spec are limited to the container image field. Altering the container image field is equivalent to restarting the pod.

Because init containers can be restarted, retried, or reexecuted, container authors should make their init behavior idempotent by handling volumes that are already populated or the possibility that this instance of the pod has already contacted a remote system.

Each init container has all of the fields of an app container. The following fields are prohibited from being used on init containers by validation:

  • readinessProbe - init containers must exit for pod startup to continue, are not included in rotation, and so cannot define readiness distinct from completion.

Init container authors may use activeDeadlineSeconds on the pod and livenessProbe on the container to prevent init containers from failing forever. The active deadline includes init containers.

Because init containers are semantically different in lifecycle from app containers (they are run serially, rather than in parallel), for backwards compatibility and design clarity they will be identified as distinct fields in the API:

pod:
  spec:
    containers: ...
    initContainers:
    - name: init-container1
      image: ...
      ...
    - name: init-container2
    ...
  status:
    containerStatuses: ...
    initContainerStatuses:
    - name: init-container1
      ...
    - name: init-container2
      ...

This separation also serves to make the order of container initialization clear - init containers are executed in the order that they appear, then all app containers are started at once.

The name of each app and init container in a pod must be unique - it is a validation error for any container to share a name.

While init containers are in alpha state, they will be serialized as an annotation on the pod with the name pod.alpha.kubernetes.io/init-containers and the status of the containers will be stored as pod.alpha.kubernetes.io/init-container-statuses. Mutation of these annotations is prohibited on existing pods.

Resources

Given the ordering and execution for init containers, the following rules for resource usage apply:

  • The highest of any particular resource request or limit defined on all init containers is the effective init request/limit
  • The pod’s effective request/limit for a resource is the higher of:
    • sum of all app containers request/limit for a resource
    • effective init request/limit for a resource
  • Scheduling is done based on effective requests/limits, which means init containers can reserve resources for initialization that are not used during the life of the pod.
  • The lowest QoS tier of init containers per resource is the effective init QoS tier, and the highest QoS tier of both init containers and regular containers is the effective pod QoS tier.

So the following pod:

pod:
  spec:
    initContainers:
    - limits:
        cpu: 100m
        memory: 1GiB
    - limits:
        cpu: 50m
        memory: 2GiB
    containers:
    - limits:
        cpu: 10m
        memory: 1100MiB
    - limits:
        cpu: 10m
        memory: 1100MiB

has an effective pod limit of cpu: 100m, memory: 2200MiB (highest init container cpu is larger than sum of all app containers, sum of container memory is larger than the max of all init containers). The scheduler, node, and quota must respect the effective pod request/limit.

In the absence of a defined request or limit on a container, the effective request/limit will be applied. For example, the following pod:

pod:
  spec:
    initContainers:
    - limits:
        cpu: 100m
        memory: 1GiB
    containers:
    - request:
        cpu: 10m
        memory: 1100MiB

will have an effective request of 10m / 1100MiB, and an effective limit of 100m / 1GiB, i.e.:

pod:
  spec:
    initContainers:
    - request:
        cpu: 10m
        memory: 1GiB
    - limits:
        cpu: 100m
        memory: 1100MiB
    containers:
    - request:
        cpu: 10m
        memory: 1GiB
    - limits:
        cpu: 100m
        memory: 1100MiB

and thus have the QoS tier Burstable (because request is not equal to limit).

Quota and limits will be applied based on the effective pod request and limit.

Pod level cGroups will be based on the effective pod request and limit, the same as the scheduler.

Kubelet and container runtime details

Container runtimes should treat the set of init and app containers as one large pool. An individual init container execution should be identical to an app container, including all standard container environment setup (network, namespaces, hostnames, DNS, etc).

All app container operations are permitted on init containers. The logs for an init container should be available for the duration of the pod lifetime or until the pod is restarted.

During initialization, app container status should be shown with the reason PodInitializing if any init containers are present. Each init container should show appropriate container status, and all init containers that are waiting for earlier init containers to finish should have the reason PendingInitialization.

The container runtime should aggressively prune failed init containers. The container runtime should record whether all init containers have succeeded internally, and only invoke new init containers if a pod restart is needed (for Docker, if all containers terminate or if the pod infra container terminates). Init containers should follow backoff rules as necessary. The Kubelet must preserve at least the most recent instance of an init container to serve logs and data for end users and to track failure states. The Kubelet should prefer to garbage collect completed init containers over app containers, as long as the Kubelet is able to track that initialization has been completed. In the future, container state checkpointing in the Kubelet may remove or reduce the need to preserve old init containers.

For the initial implementation, the Kubelet will use the last termination container state of the highest indexed init container to determine whether the pod has completed initialization. During a pod restart, initialization will be restarted from the beginning (all initializers will be rerun).

API Behavior

All APIs that access containers by name should operate on both init and app containers. Because names are unique the addition of the init container should be transparent to use cases.

A client with no knowledge of init containers should see appropriate container status reason and message fields while the pod is in the Pending phase, and so be able to communicate that to end users.

Example init containers

  • Wait for a service to be created

    pod:
      spec:
        initContainers:
        - name: wait
          image: centos:centos7
          command: ["/bin/sh", "-c", "for i in {1..100}; do sleep 1; if dig myservice; then exit 0; fi; exit 1"]
        containers:
        - name: run
          image: application-image
          command: ["/my_application_that_depends_on_myservice"]
    
  • Register this pod with a remote server

    pod:
      spec:
        initContainers:
        - name: register
          image: centos:centos7
          command: ["/bin/sh", "-c", "curl -X POST http://$MANAGEMENT_SERVICE_HOST:$MANAGEMENT_SERVICE_PORT/register -d 'instance=$(POD_NAME)&ip=$(POD_IP)'"]
          env:
          - name: POD_NAME
            valueFrom:
              field: metadata.name
          - name: POD_IP
            valueFrom:
              field: status.podIP
        containers:
        - name: run
          image: application-image
          command: ["/my_application_that_depends_on_myservice"]
    
  • Wait for an arbitrary period of time

    pod:
      spec:
        initContainers:
        - name: wait
          image: centos:centos7
          command: ["/bin/sh", "-c", "sleep 60"]
        containers:
        - name: run
          image: application-image
          command: ["/static_binary_without_sleep"]
    
  • Clone a git repository into a volume (can be implemented by volume containers in the future):

    pod:
      spec:
        initContainers:
        - name: download
          image: image-with-git
          command: ["git", "clone", "https://github.com/myrepo/myrepo.git", "/var/lib/data"]
          volumeMounts:
          - mountPath: /var/lib/data
            volumeName: git
        containers:
        - name: run
          image: centos:centos7
          command: ["/var/lib/data/binary"]
          volumeMounts:
          - mountPath: /var/lib/data
            volumeName: git
        volumes:
        - emptyDir: {}
          name: git
    
  • Execute a template transformation based on environment (can be implemented by volume containers in the future):

    pod:
      spec:
        initContainers:
        - name: copy
          image: application-image
          command: ["/bin/cp", "mytemplate.j2", "/var/lib/data/"]
          volumeMounts:
          - mountPath: /var/lib/data
            volumeName: data
        - name: transform
          image: image-with-jinja
          command: ["/bin/sh", "-c", "jinja /var/lib/data/mytemplate.j2 > /var/lib/data/mytemplate.conf"]
          volumeMounts:
          - mountPath: /var/lib/data
            volumeName: data
        containers:
        - name: run
          image: application-image
          command: ["/myapplication", "-conf", "/var/lib/data/mytemplate.conf"]
          volumeMounts:
          - mountPath: /var/lib/data
            volumeName: data
        volumes:
        - emptyDir: {}
          name: data
    
  • Perform a container build

    pod:
      spec:
        initContainers:
        - name: copy
          image: base-image
          workingDir: /home/user/source-tree
          command: ["make"]
        containers:
        - name: commit
          image: image-with-docker
          command:
          - /bin/sh
          - -c
          - docker commit $(complex_bash_to_get_container_id_of_copy) \
            docker push $(commit_id) myrepo:latest
          volumesMounts:
          - mountPath: /var/run/docker.sock
            volumeName: dockersocket
    

Backwards compatibility implications

Since this is a net new feature in the API and Kubelet, new API servers during upgrade may not be able to rely on Kubelets implementing init containers. The management of feature skew between master and Kubelet is tracked in issue #4855.

Future work

  • Unify pod QoS class with init containers
  • Implement container / image volumes to make composition of runtime from images efficient