federated-replicasets

Federated ReplicaSets

Requirements & Design Document

This document is a markdown version converted from a working Google Doc. Please refer to the original for extended commentary and discussion.

Author: Marcin Wielgus mwielgus@google.com Based on discussions with Quinton Hoole quinton@google.com, Wojtek Tyczyński wojtekt@google.com

Overview

Summary & Vision

When running a global application on a federation of Kubernetes clusters the owner currently has to start it in multiple clusters and control whether he has both enough application replicas running locally in each of the clusters (so that, for example, users are handled by a nearby cluster, with low latency) and globally (so that there is always enough capacity to handle all traffic). If one of the clusters has issues or hasn’t enough capacity to run the given set of replicas the replicas should be automatically moved to some other cluster to keep the application responsive.

In single cluster Kubernetes there is a concept of ReplicaSet that manages the replicas locally. We want to expand this concept to the federation level.

Goals

  • Win large enterprise customers who want to easily run applications across multiple clusters
  • Create a reference controller implementation to facilitate bringing other Kubernetes concepts to Federated Kubernetes.

Glossary

Federation Cluster - a cluster that is a member of federation.

Local ReplicaSet (LRS) - ReplicaSet defined and running on a cluster that is a member of federation.

Federated ReplicaSet (FRS) - ReplicaSet defined and running inside of Federated K8S server.

Federated ReplicaSet Controller (FRSC) - A controller running inside of Federated K8S server that controls FRS.

User Experience

Critical User Journeys

  • [CUJ1] User wants to create a ReplicaSet in each of the federation cluster. They create a definition of federated ReplicaSet on the federated master and (local) ReplicaSets are automatically created in each of the federation clusters. The number of replicas is each of the Local ReplicaSets is (perhaps indirectly) configurable by the user.
  • [CUJ2] When the current number of replicas in a cluster drops below the desired number and new replicas cannot be scheduled then they should be started in some other cluster.

Features Enabling Critical User Journeys

Feature #1 -> CUJ1: A component which looks for newly created Federated ReplicaSets and creates the appropriate Local ReplicaSet definitions in the federated clusters.

Feature #2 -> CUJ2: A component that checks how many replicas are actually running in each of the subclusters and if the number matches to the FederatedReplicaSet preferences (by default spread replicas evenly across the clusters but custom preferences are allowed - see below). If it doesn’t and the situation is unlikely to improve soon then the replicas should be moved to other subclusters.

API and CLI

All interaction with FederatedReplicaSet will be done by issuing kubectl commands pointing on the Federated Master API Server. All the commands would behave in a similar way as on the regular master, however in the next versions (1.5+) some of the commands may give slightly different output. For example kubectl describe on federated replica set should also give some information about the subclusters.

Moreover, for safety, some defaults will be different. For example for kubectl delete federatedreplicaset cascade will be set to false.

FederatedReplicaSet would have the same object as local ReplicaSet (although it will be accessible in a different part of the api). Scheduling preferences (how many replicas in which cluster) will be passed as annotations.

FederateReplicaSet preferences

The preferences are expressed by the following structure, passed as a serialized json inside annotations.

type FederatedReplicaSetPreferences struct {  
    // If set to true then already scheduled and running replicas may be moved to other clusters to
    // in order to bring cluster replicasets towards a desired state. Otherwise, if set to false,
    // up and running replicas will not be moved.
    Rebalance bool `json:"rebalance,omitempty"`

    // Map from cluster name to preferences for that cluster. It is assumed that if a cluster   
    // doesn't have a matching entry then it should not have local replica. The cluster matches   
    // to "*" if there is no entry with the real cluster name.   
    Clusters map[string]LocalReplicaSetPreferences  
}

// Preferences regarding number of replicas assigned to a cluster replicaset within a federated replicaset.
type ClusterReplicaSetPreferences struct {
    // Minimum number of replicas that should be assigned to this Local ReplicaSet. 0 by default.
    MinReplicas int64 `json:"minReplicas,omitempty"`

    // Maximum number of replicas that should be assigned to this Local ReplicaSet. Unbounded if no value provided (default).
    MaxReplicas *int64 `json:"maxReplicas,omitempty"`

    // A number expressing the preference to put an additional replica to this LocalReplicaSet. 0 by default.
    Weight int64
}

How this works in practice:

Scenario 1. I want to spread my 50 replicas evenly across all available clusters. Config:

FederatedReplicaSetPreferences {
   Rebalance : true
   Clusters : map[string]LocalReplicaSet {
     "*" : LocalReplicaSet{ Weight: 1}
   }
}

Example:

  • Clusters A,B,C, all have capacity. Replica layout: A=16 B=17 C=17.
  • Clusters A,B,C and C has capacity for 6 replicas. Replica layout: A=22 B=22 C=6
  • Clusters A,B,C. B and C are offline: Replica layout: A=50

Scenario 2. I want to have only 2 replicas in each of the clusters.

FederatedReplicaSetPreferences {
   Rebalance : true
   Clusters : map[string]LocalReplicaSet {
     "*" : LocalReplicaSet{ MaxReplicas: 2; Weight: 1}
   }
}

Or

FederatedReplicaSetPreferences {
   Rebalance : true
   Clusters : map[string]LocalReplicaSet {
     "*" : LocalReplicaSet{ MinReplicas: 2; Weight: 0 }
	 }
 }

Or

FederatedReplicaSetPreferences {  
   Rebalance : true
   Clusters : map[string]LocalReplicaSet {  
     "*" : LocalReplicaSet{ MinReplicas: 2; MaxReplicas: 2}   
   }  
}

There is a global target for 50, however if there are 3 clusters there will be only 6 replicas running.

Scenario 3. I want to have 20 replicas in each of 3 clusters.

FederatedReplicaSetPreferences {  
   Rebalance : true
   Clusters : map[string]LocalReplicaSet {  
     "*" : LocalReplicaSet{ MinReplicas: 20; Weight: 0}  
   }  
}

There is a global target for 50, however clusters require 60. So some clusters will have less replicas. Replica layout: A=20 B=20 C=10.

Scenario 4. I want to have equal number of replicas in clusters A,B,C, however don’t put more than 20 replicas to cluster C.

FederatedReplicaSetPreferences {
   Rebalance : true
   Clusters : map[string]LocalReplicaSet {  
     "*" : LocalReplicaSet{ Weight: 1}  
     C : LocalReplicaSet{ MaxReplicas: 20,  Weight: 1}  
   }  
}

Example:

  • All have capacity. Replica layout: A=16 B=17 C=17.
  • B is offline/has no capacity Replica layout: A=30 B=0 C=20
  • A and B are offline: Replica layout: C=20

Scenario 5. I want to run my application in cluster A, however if there are troubles FRS can also use clusters B and C, equally.

FederatedReplicaSetPreferences {  
   Clusters : map[string]LocalReplicaSet {  
     A : LocalReplicaSet{ Weight: 1000000}  
     B : LocalReplicaSet{ Weight: 1}  
     C : LocalReplicaSet{ Weight: 1}  
   }  
}

Example:

  • All have capacity. Replica layout: A=50 B=0 C=0.
  • A has capacity for only 40 replicas Replica layout: A=40 B=5 C=5

Scenario 6. I want to run my application in clusters A, B and C. Cluster A gets twice the QPS than other clusters.

FederatedReplicaSetPreferences {  
   Clusters : map[string]LocalReplicaSet {  
     A : LocalReplicaSet{ Weight: 2}  
     B : LocalReplicaSet{ Weight: 1}  
     C : LocalReplicaSet{ Weight: 1}  
   }  
}

Scenario 7. I want to spread my 50 replicas evenly across all available clusters, but if there are already some replicas, please do not move them. Config:

FederatedReplicaSetPreferences {
   Rebalance : false
   Clusters : map[string]LocalReplicaSet {
     "*" : LocalReplicaSet{ Weight: 1}
   }
}

Example:

  • Clusters A,B,C, all have capacity, but A already has 20 replicas Replica layout: A=20 B=15 C=15.
  • Clusters A,B,C and C has capacity for 6 replicas, A has already 20 replicas. Replica layout: A=22 B=22 C=6
  • Clusters A,B,C and C has capacity for 6 replicas, A has already 30 replicas. Replica layout: A=30 B=14 C=6

The Idea

A new federated controller - Federated Replica Set Controller (FRSC) will be created inside federated controller manager. Below are enumerated the key idea elements:

  • [I0] It is considered OK to have slightly higher number of replicas globally for some time.

  • [I1] FRSC starts an informer on the FederatedReplicaSet that listens on FRS being created, updated or deleted. On each create/update the scheduling code will be started to calculate where to put the replicas. The default behavior is to start the same amount of replicas in each of the cluster. While creating LocalReplicaSets (LRS) the following errors/issues can occur:

    • [E1] Master rejects LRS creation (for known or unknown reason). In this case another attempt to create a LRS should be attempted in 1m or so. This action can be tied with [I5]. Until the the LRS is created the situation is the same as [E5]. If this happens multiple times all due replicas should be moved elsewhere and later moved back once the LRS is created.

    • [E2] LRS with the same name but different configuration already exists. The LRS is then overwritten and an appropriate event created to explain what happened. Pods under the control of the old LRS are left intact and the new LRS may adopt them if they match the selector.

    • [E3] LRS is new but the pods that match the selector exist. The pods are adopted by the RS (if not owned by some other RS). However they may have a different image, configuration etc. Just like with regular LRS.

  • [I2] For each of the cluster FRSC starts a store and an informer on LRS that will listen for status updates. These status changes are only interesting in case of troubles. Otherwise it is assumed that LRS runs trouble free and there is always the right number of pod created but possibly not scheduled.

    • [E4] LRS is manually deleted from the local cluster. In this case a new LRS should be created. It is the same case as [E1]. Any pods that were left behind won’t be killed and will be adopted after the LRS is recreated.

    • [E5] LRS fails to create (not necessary schedule) the desired number of pods due to master troubles, admission control etc. This should be considered as the same situation as replicas unable to schedule (see [I4]).

    • [E6] It is impossible to tell that an informer lost connection with a remote cluster or has other synchronization problem so it should be handled by cluster liveness probe and deletion [I6].

  • [I3] For each of the cluster start an store and informer to monitor whether the created pods are eventually scheduled and what is the current number of correctly running ready pods. Errors:

    • [E7] It is impossible to tell that an informer lost connection with a remote cluster or has other synchronization problem so it should be handled by cluster liveness probe and deletion [I6]
  • [I4] It is assumed that a not scheduled pod is a normal situation and can last up to X min if there is a huge traffic on the cluster. However if the replicas are not scheduled in that time then FRSC should consider moving most of the unscheduled replicas elsewhere. For that purpose FRSC will maintain a data structure where for each FRS controlled LRS we store a list of pods belonging to that LRS along with their current status and status change timestamp.

  • [I5] If a new cluster is added to the federation then it doesn’t have a LRS and the situation is equal to [E1]/[E4].

  • [I6] If a cluster is removed from the federation then the situation is equal to multiple [E4]. It is assumed that if a connection with a cluster is lost completely then the cluster is removed from the the cluster list (or marked accordingly) so [E6] and [E7] don’t need to be handled.

  • [I7] All ToBeChecked FRS are browsed every 1 min (configurable), checked against the current list of clusters, and all missing LRS are created. This will be executed in combination with [I8].

  • [I8] All pods from ToBeChecked FRS/LRS are browsed every 1 min (configurable) to check whether some replica move between clusters is needed or not.

  • FRSC never moves replicas to LRS that have not scheduled/running pods or that has pods that failed to be created.

    • When FRSC notices that a number of pods are not scheduler/running or not_even_created in one LRS for more than Y minutes it takes most of them from LRS, leaving couple still waiting so that once they are scheduled FRSC will know that it is ok to put some more replicas to that cluster.
  • [I9] FRS becomes ToBeChecked if:

    • It is newly created
    • Some replica set inside changed its status
    • Some pods inside cluster changed their status
    • Some cluster is added or deleted. > FRS stops ToBeChecked if is in desired configuration (or is stable enough).

(RE)Scheduling algorithm

To calculate the (re)scheduling moves for a given FRS:

  1. For each cluster FRSC calculates the number of replicas that are placed (not necessary up and running) in the cluster and the number of replicas that failed to be scheduled. Cluster capacity is the difference between the the placed and failed to be scheduled.

  2. Order all clusters by their weight and hash of the name so that every time we process the same replica-set we process the clusters in the same order. Include federated replica set name in the cluster name hash so that we get slightly different ordering for different RS. So that not all RS of size 1 end up on the same cluster.

  3. Assign minimum preferred number of replicas to each of the clusters, if there is enough replicas and capacity.

  4. If rebalance = false, assign the previously present replicas to the clusters, remember the number of extra replicas added (ER). Of course if there is enough replicas and capacity.

  5. Distribute the remaining replicas with regard to weights and cluster capacity. In multiple iterations calculate how many of the replicas should end up in the cluster. For each of the cluster cap the number of assigned replicas by max number of replicas and cluster capacity. If there were some extra replicas added to the cluster in step 4, don’t really add the replicas but balance them gains ER from 4.

Goroutines layout

  • [GR1] Involved in FRS informer (see [[I1]]). Whenever a FRS is created and updated it puts the new/updated FRS on FRS_TO_CHECK_QUEUE with delay 0.

  • [GR2_1…GR2_N] Involved in informers/store on LRS (see [[I2]]). On all changes the FRS is put on FRS_TO_CHECK_QUEUE with delay 1min.

  • [GR3_1…GR3_N] Involved in informers/store on Pods (see [[I3]] and [[I4]]). They maintain the status store so that for each of the LRS we know the number of pods that are actually running and ready in O(1) time. They also put the corresponding FRS on FRS_TO_CHECK_QUEUE with delay 1min.

  • [GR4] Involved in cluster informer (see [[I5]] and [[I6]] ). It puts all FRS on FRS_TO_CHECK_QUEUE with delay 0.

  • [GR5_*] Go routines handling FRS_TO_CHECK_QUEUE that put FRS on FRS_CHANNEL after the given delay (and remove from FRS_TO_CHECK_QUEUE). Every time an already present FRS is added to FRS_TO_CHECK_QUEUE the delays are compared and updated so that the shorter delay is used.

  • [GR6] Contains a selector that listens on a FRS_CHANNEL. Whenever a FRS is received it is put to a work queue. Work queue has no delay and makes sure that a single replica set is process is processed by only one goroutine.

  • [GR7_*] Goroutines related to workqueue. They fire DoFrsCheck on the FRS. Multiple replica set can be processed in parallel. Two Goroutines cannot process the same FRS at the same time.

Func DoFrsCheck

The function does [[I7]] and[[I8]]. It is assumed that it is run on a single thread/goroutine so we check and evaluate the same FRS on many goroutines (however if needed the function can be parallelized for different FRS). It takes data only from store maintained by GR2* and GR3*. The external communication is only required to:

  • Create LRS. If a LRS doesn’t exist it is created after the rescheduling, when we know how much replicas should it have.

  • Update LRS replica targets.

If FRS is not in the desired state then it is put to FRS_TO_CHECK_QUEUE with delay 1min (possibly increasing).

Monitoring and status reporting

FRCS should expose a number of metrics form the run, like

  • FRSC -> LRS communication latency
  • Total times spent in various elements of DoFrsCheck

FRSC should also expose the status of FRS as an annotation on FRS and as events.

Workflow

Here is the sequence of tasks that need to be done in order for a typical FRS to be split into a number of LRS’s and to be created in the underlying federated clusters.

Note a: the reason the workflow would be helpful at this phase is that for every one or two steps we can create PRs accordingly to start with the development.

Note b: we assume that the federation is already in place and the federated clusters are added to the federation.

Step 1. the client sends an RS create request to the federation-apiserver

Step 2. federation-apiserver persists an FRS into the federation etcd

Note c: federation-apiserver populates the clusterid field in the FRS before persisting it into the federation etcd

Step 3: the federation-level “informer” in FRSC watches federation etcd for new/modified FRS’s, with empty clusterid or clusterid equal to federation ID, and if detected, it calls the scheduling code

Step 4.

Note d: scheduler populates the clusterid field in the LRS with the IDs of target clusters

Note e: at this point let us assume that it only does the even distribution, i.e., equal weights for all of the underlying clusters

Step 5. As soon as the scheduler function returns the control to FRSC, the FRSC starts a number of cluster-level “informer”s, one per every target cluster, to watch changes in every target cluster etcd regarding the posted LRS’s and if any violation from the scheduled number of replicase is detected the scheduling code is re-called for re-scheduling purposes.