controller_history

Controller History

Author: kow3ns@

Status: Proposal

Abstract

In Kubernetes, in order to update and rollback the configuration and binary images of controller managed Pods, users mutate DaemonSet, StatefulSet, and Deployment Objects, and the corresponding controllers attempt to transition the current state of the system to the new declared target state.

To facilitate update and rollback for these controllers, and to provide a primitive that third-party controllers can build on, we propose a mechanism that allows controllers to manage a bounded history of revisions to the declared target state of their generated Objects.

Affected Components

  1. API Machinery
  2. API Server
  3. Kubectl
  4. Controllers that utilize the feature

Requirements

  1. History is a collection of points in time, and each point in time must be represented by its own Object. While it is tempting to aggregate all of an Object’s history into a single container Object, experience with Borg and Mesos has taught us that this inevitably leads to exhausting the single Object size limit of the system’s storage backend.
  2. We must be able to select the Objects that contain point in time snapshots of versions of an Object to reconstruct the Object’s history.
  3. History respects causality. The Object type used to store point in time snapshots must be strictly ordered with respect to creation. CreationTimestamp should not be used, as this is susceptible to clock skew.
  4. History must not be revisionist. Once an Object corresponding to a version of a controllers target state is created, it can not be mutated.
  5. Controller history requires only current events. Storing an exhaustive history of all revisions to all controllers is out of scope for our purposes, and it can be solved by applying a version control system to manifests. Internal revision history must only store revisions to the controller’s target state that correspond to live Objects and (potentially) a small, configurable number of prior revisions.
  6. History is scale invariant. A revision to a controller is a modification that changes the specification of the Objects it generates. Changing the cardinality of those Objects is a scaling operation and does not constitute a revision.

Terminology

The following terminology is used throughout the rest of this proposal. We make its meaning explicit here. - The specification type of a controller is the type that contains the specification for the Objects generated by the controller. - For example, the specification types for the ReplicaSet, DaemonSet, and StatefulSet controllers are ReplicaSetSpec, DaemonSetSpec, and StatefulSetSpec respectively. - The generated type(s) for a controller is/are the type of the Object(s) generated by the controller. - Pod is a generated type for the ReplicaSet, DaemonSet, and StatefulSet controllers. - PersistentVolumeClaim is also a generated type for the StatefulSet controller. - The current state of a controller is the union of the states of its generated Objects along with its status. - For ReplicaSet, DaemonSet, and StatefulSet, the current state of the corresponding controllers can be derived from Pods they contain and the ReplicasSetStatus, DaemonSetStatus, and StatefulSetStatus objects respectively. - For all specification type Objects for controllers, the target state is the set of fields in the Object that determine the state to which the controller attempts to evolve the system. - This may not necessarily be all fields of the Object. - For example, for the StatefulSet controller .Spec.Template, .Spec.Replicas, and .Spec.VolumeClaims determine the target state. The controller “wants” to create .Spec.Replicas Pods generated from .Spec.Template and .Spec.VolumeClaims. - The target Object state is the subset of the target state necessary to create Objects of the generated type(s). - To make this concrete, for the StatefulSet controller .Spec.Template and .Spec.VolumeClaims are the target Object state. This is enough information for the controller to generate Pods and corresponding PVCs. - If a version of the target Object state was used to generate an Object that has not yet been deleted, we refer to the version, and any snapshots of the version, as live.

API Objects

Kubernetes controllers already persist their current and target states to the API Server. In order to maintain a history of revisions to specification type Objects, we only need to persist snapshots of the target Object states contained in the specification type when they are revised.

One approach would be to, for every specification type, have a corresponding History type. For example, we could introduce a StatefulSetHistory object that aggregates a PodTemplateSpec and a slice of PersistentVolumeClaims. The StatefulSet controller could use this object to store point in time snapshots of versions of StatefulSetSpecs. However, this requires that we introduce a new History Kind for all current and future controllers. It has the benefit of type safety, but, for this benefit, we trade generality.

Another approach would be to use PodTemplate objects. This mechanisms provides the desired generality, but it only provides for the recording of versions of PodTemplateSpecs (e.g. For StatefulSet, we can not use PodTemplates to record revisions to PersistentVolumeClaims). Also, it introduces the potential for overlapping histories for two Objects of different Kinds, with the same .Name in the same Namespace. Lastly, it constrains the PodTemplate Kind from evolving to fulfill its original intention.

We propose an approach that has analogs with the approach taken by the Mesos community. Mesos frameworks, which are in some ways like Kubernetes controllers, are responsible for check pointing, persisting, and recovering their own state. This problem is so common that Mesos provides a “State Abstraction” that allows frameworks to persist their state in either ZooKeeper or the Mesos Replicate Log (A Multi-Paxos based state machine used by the Mesos Masters). This State Abstraction is a mutable, durable dictionary where keys and values are opaque strings. As controllers only need the capability to persist an immutable point in time snapshot of target Object states to implement a revision history, we propose to use the ControllerRevision object for this purpose.

// ControllerRevision implements an immutable snapshot of state data. Clients 
// are responsible for serializing and deserializing the objects that contain 
// their internal state. 
// Once a ControllerRevision has been successfully created, it can not be updated. 
// The API Server will fail validation of all requests that attempt to mutate 
// the Data field. ControllerRevisions may, however, be deleted.
type ControllerRevision struct {
    metav1.TypeMeta
    // +optional
    metav1.ObjectMeta
    // Data contains the serialized state.
    Data runtime.RawExtension
    // Revision indicates the revision of the state represented by Data.
    Revision int64
}

API Server

The API Server must support the creation and deletion of ControllerRevision objects. As we have no mechanism for declarative immutability, the API server must fail any update request that updates the .Data field of a ControllerRevision Object.

Controllers

This section is presented as a generalization of how an arbitrary controller can use ControllerRevision to persist a history of revisions to its specification type Objects. The technique is applicable, without loss of generality, to the existing Kubernetes controllers that have Pod as a generated type.

When a controller detects a revision to the target Object state of a specification type Object it will do the following.

  1. The controller will create a snapshot of the current target Object state.
  2. The controller will reconstruct the history of revisions to the Object’s target Object state.
  3. The controller will test the current target Object state for equivalence with all other versions in the Object’s revision history.
    • If the current version is semantically equivalent to its immediate predecessor no update to the Object’s target state has been performed.
    • If the current version is equivalent to a version prior to its immediate predecessor, this indicates a rollback.
    • If the current version is not equivalent to any prior version, this indicates an update or a roll forward.
    • Controllers should use their status objects for book keeping with respect to current and prior revisions.
  4. The controller will reconcile its generated Objects with the new target Object state.
  5. The controller will maintain the length of its history to be less than the configured limit.

Version Snapshot Creation

To take a snapshot of the target Object state contained in a specification type Object, a controller will do the following.

  1. The controller will serialize all the Object’s target object state and store the serialized representation in the ControllerRevision’s .Data.
  2. The controller will store a unique, monotonically increasing revision number in the Revision field.
  3. The controller will compute the hash of the ControllerRevision’s .Data.
  4. The controller will attach a label to the ControllerRevision so that it is selectable with a low probability of overlap.
    • ControllerRefs will be used as the authoritative test for ownership.
    • The specification type Object’s .Selector should be used where applicable.
    • Alternatively, a Kind unique label may be set to the .Name of the specification type Object.
  5. The controller will add a ControllerRef indicating the specification type Object as the owner of the ControllerRevision in the ControllerRevision’s .OwnerReferences.
  6. The controller will use the hash from above, along with a user identifiable prefix, to generate a unique .Name for the ControllerRevision.
    • The controller should, where possible, use the .Name of the specification type Object.
  7. The controller will persist the ControllerRevision via the API Server.

Revision Number Selection

We propose two methods for selecting the .Revision used to order a specification type Object’s revision history.

  1. Set the .Revision field to the .Generation field.
    • This approach has the benefit of leveraging the existing monotonically increasing sequence generated by .Generation field.
    • The downside of this approach is that history will not survive the destruction of an Object.
  2. Use an approach analogous to Deployment.
    1. Reconstruct the Object’s revision history.
    2. If the history is empty, use a .Revision of 0.
    3. If the history is not empty, set the .Revision to a value greater than the maximum value of all previous .Revisions.

History Reconstruction

To reconstruct the history of a specification type Object, a controller will do the following.

  1. Select all ControllerRevision Objects labeled as described above.
  2. Filter any ControllerRevisions that do not have a ControllerRef in their .OwnerReferences indicating ownership by the Object.
  3. Sort the ControllerRevisions by the .Revision field.
  4. This produces a strictly ordered set of ControllerRevisions that comprises the ordered revision history of the specification type Object.

History Maintenance

Controllers should be configured, either globally or on a per specification type Object basis, to have a RevisionHistoryLimit. This field will indicate the number of non-live revisions the controller should maintain in its history for each specification type Object. Every time a controller observes a specification type Object it will do the following.

  1. The controller will reconstruct the Object’s revision history.
    • Note that the process of reconstructing the Object’s history filters any ControllerRevisions not owned by the Object.
  2. The controller will filter any ControllerRevisions that represent a live version.
  3. If the number of remaining ControllerRevisions is greater than the configured RevisionHistoryLimit, the controller will delete them, in order with respect to the value mapped to their .Revisions, until the number of remaining ControllerRevisions is equal to the RevisionHistoryLimit.

This ensures that the number of recorded, non-live revisions is less than or equal to the configured RevisionHistoryLimit.

Version Tracking

Controllers must track the version of the target Object state that corresponds to their generated Objects. This information is necessary to determine which versions are live, and to track which Objects need to be updated during a target state update or rollback. We propose two methods that controllers may use to track live versions and their association with generated Objects.

  1. The most straightforward method is labeling. In this method the generated Objects are labeled with the .Name of the ControllerRevision object that corresponds to the version of the target Object state that was used to generate them. As we have taken care to ensure the uniqueness of the .Names of the ControllerRevisions, this approach is reasonable.
    • A revision is considered to be live while any generated Object labeled with its .Name is live.
    • This method has the benefit of providing visibility, via the label, to users with respect to the historical provenance of a generated Object.
    • The primary drawback is the lack of support for using garbage collection to ensure that only non-live version snapshots are collected.
  2. Controllers may also use the OwnerReferences field of the ControllerRevision to record all Objects that are generated from target Object state version represented by the ControllerRevision as its owners.
    • A revision is considered to be live while any generated Object that owns it is live.
    • This method allows for the implementation of generic garbage collection.
    • The primary drawback of this method is that the book keeping is complex, and deciding if a generated Object corresponds to a particular revision will require testing each Object for membership in the OwnerReferences of all ControllerRevisions.

Note that, since we are labeling the generated Objects to indicate their provenance with respect to the version of the controller’s target Object state, we are susceptible to downstream mutations by other controllers changing the controller’s product. The best we can do is guarantee that our product meets the specification at the time of creation. If a third-party mutates the product downstream (as long as it does so in a consistent and intentional way), we don’t want to recall it and make it conform to the original specification. This would cause the controllers to “fight” indefinitely.

At the cost of the complexity of implementing both labeling and ownership, controllers may use a combination of both approaches to mitigate the deficiencies of each.

Version Equivalence

When the target Object state of a specification type Object is revised, we wish to minimize the number of mutations to generated Objects as the controller seeks to conform the system to its target state. That is, if a generated Object already conforms to the revised target Object state, it is imperative that we do not mutate it.

Failure to implement this correctly could result in the simultaneous rolling restart of every Pod in every StatefulSet and DaemonSet in the system when additions are made to PodTemplateSpec during a master upgrade. It is therefore necessary to determine if the current target Object state is equivalent to a prior version.

Since we track the version of of generated Objects, this reduces to deciding if the version of the target Object state associated with the generated Object is equivalent to the current target Object state. Even though hashing is used to generate the .Name of the ControllerRevisions used to encapsulate versions of the target Object state, as we do not require cryptographically strong collision resistance, and given we use a collision resolution technique, we can’t use the generated names of ControllerRevisions to decide equality.

We propose that two ControllerRevisions can be considered equal if their .Data is equivalent, but that it is not sufficient to compare the serialized representation of their .Data. Consider that the addition of new fields to the Objects that represent the target Object state may cause the serialized representation of those Objects to be unequal even when they are semantically equivalent.

The controller should deserialize the values of the ControllerRevisions representing their target Object state and perform a deep, semantic equality test. Here all differences that do not constitute a mutation to the target Object state is disregarded during the equivalence test.

Target Object State Reconciliation

There are three ways for a controller to reconcile a generated Object with the declared target Object state.

  1. If the target Object state is equivalent to the target Object state associated with the generated Object, the controller will update the associated version tracking information.
  2. If the Object can be updated in place to reconcile its state with the current target Object state, a controller may update the Object in place provided that the associated version tracking information is updated as well.
  3. Otherwise, the controller must destroy the Object and recreate it from the current target Object state.

Kubernetes Upgrades

During the upgrade process form a version of Kubernetes that does not support controller history to a version that does, controllers that implement history based update mechanisms may find that they have specification type Objects with no history and with generated Objects. For instance, a StatefulSet may exist with several Pods and no history. We defer requirements for handling history initialization to the individual proposals pertaining to those controller’s update mechanisms. However, implementors should take note of the following.

  1. If the history of an Object is not initialized, controllers should continue to (re)create generated Objects based on the current target Object state.
  2. The history should be initialized on the first mutation to the specification type Object for which the history will be generated.
  3. After the history has been initialized, any generated Objects that have no indication of the revision from which they were generated may be treated as if they have a nil revision. That is, without respect to the method of version tracking used, the generated Objects may be treated as if they have a version that corresponds to no revision, and the controller may proceed to reconcile their state as appropriate to the internal implementation.

Kubectl

Modifications to kubectl to leverage controller history are an optional extension. Users can trigger rolling updates and rollbacks by modifying their manifests and using kubectl apply. Controllers will be able to detect revisions to their target Object state and perform reconciliation as necessary.

Viewing History

Users can view a controller’s revision history with the following command.

> kubectl rollout history

To view the details of the revision indicated by <revision>. Users can use the following command.

> kubectl rollout history --revision <revision>

Rollback

For future work, kubectl rollout undo can be implemented in the general case as an extension of the above.

> kubectl rollout undo

Here kubectl undo simply uses strategic merge patch to apply the state contained at a particular revision.

Tests

  1. Controllers can create a ControllerRevision containing a revision of their target Object state.
  2. Controllers can reconstruct their revision history.
  3. Controllers can’t update a ControllerRevision’s .Data.
  4. Controllers can delete a ControllerRevision to maintain their history with respect to the configured RevisionHistoryLimit.

Appendix

Hashing

We will require a CRHF (collision resistant hash function), but, as we expect no adversaries, such a function need not be resistant to pre-image and secondary pre-image attacks. As the property of interest is primarily collision resistance, and as we provide a method of collision resolution, both cryptographically strong functions, such as Secure Hash Algorithm 2 (SHA-2), and non-cryptographic functions, such as Fowler-Noll-Vo (FNV) are applicable.

Collision Resolution

As the function selected for hashing may not be cryptographically strong and may produce collisions, we need a method for collision resolution. To demonstrate its feasibility, we construct such a scheme here. However, this proposal does not mandate its use.

Given a hash function with output size HashSize defined as func H(s string) [HashSize] byte, in order to resolve collisions we define a new function func H'(s string, n int) [HashSize]byte where H' returns the result of invoking H on the concatenation of s with the string value of n. We define a third function func H''(s string, exists func (string) bool)(int,[HashSize]byte). H'' will start with n := 0 and compute s' := H'(s,n), incrementing n when exists(s') returns true, until exists(s') returns false. After this it will return n,s'.

For our purposes, the implementation of the exists function will attempt to create a .Named ControllerRevision via the API Server using a unique name generation. If creation fails, due to a conflict, the method returns false.

Unique Name Generation

We can use our hash function and collision resolution scheme to generate a system wide unique identifier for an Object based on a deterministic non-unique prefix and a serialized representation of the Object. Kubernetes Object’s .Name fields must conform to a DNS subdomain. Therefore, the total length of the unique identifier must not exceed 255, and in practice 253, characters. We can generate a unique identifier that meets this constraint by selecting a hash function such that the output length is equal to 253-len(prefix) and applying our hash function and collision-resolution scheme to the serialized representation of the Object’s data. The unique hash and integer can be combined to produce a unique suffix for the Object’s .Name.

  1. We must also ensure that unique name does not contain any bad words.
  2. We may also wish to spend additional characters to prettify the generated name for readability.