Running various types of workloads with different priorities is a common practice in medium and large clusters to achieve higher resource utilization. In such scenarios, the amount of workload can be larger than what the total resources of the cluster can handle. If so, the cluster chooses the most important workloads and runs them. The importance of workloads are specified by a combination of priority, QoS, or other cluster-specific metrics. The potential to have more work than what cluster resources can handle is called “overcommitment”. Overcommitment is very common in on-prem clusters where the number of nodes is fixed, but it can similarly happen in cloud as cloud customers may choose to run their clusters overcommitted/overloaded at times in order to save money. For example, a cloud customer may choose to run at most 100 nodes, knowing that all of their critical workloads fit on 100 nodes and if there is more work, they won’t be critical and can wait until cluster load decreases.
When a new pod has certain scheduling requirements that makes it infeasible on any node in the cluster, scheduler may choose to kill lower priority pods to satisfy the scheduling requirements of the new pod. We call this operation “preemption”. Preemption is distinguished from “eviction” where kubelet kills a pod on a node because that particular node is running out of resources.
This document describes how preemption in Kubernetes works. Preemption is the action taken when an important pod requires resources or conditions which are not available in the cluster. So, one or more pods need to be killed to make room for the important pod.
In this proposal, the only scenario under which a group of pods in Kubernetes may be preempted is when a higher priority pod cannot be scheduled due to various unmet scheduling requirements, such as lack of resources, unsatisfied affinity or anti-affinity rules, etc., and the preemption of the lower priority pods allows the higher priority pod to be scheduled. So, if the preemption of the lower priority pods does not help with scheduling of the higher priority pod, those lower priority pods will keep running and the higher priority pod will stay pending.
Please note the terminology here. The above scenario does not include “evictions” that are performed by the Kubelet when a node runs out of resources.
Please also note that scheduler may preempt a pod on one node in order to meet the scheduling requirements of a pending pod on another node. For example, if there is a low-priority pod running on node N in rack R, and there is a high-priority pending pod, and one or both of the pods have a requiredDuringScheduling anti-affinity rule saying they can’t run on the same rack, then the lower-priority pod might be preempted to enable the higher-priority pod to schedule onto some node M != N on rack R (or, of course, M == N, which is the more standard same-node preemption scenario).
We propose preemption to be done by the scheduler – it does it by deleting the being preempted pods. The component that performs the preemption must have the logic to find the right nodes for the pending pod. It must also have the logic to check whether preempting the chosen pods allows scheduling of the pending pod. These require the component to have the knowledge of predicate and priority functions.
We believe having scheduler perform preemption has the following benefits:
When scheduling a pending pod, scheduler tries to place the pod on a node that does not require preemption. If there is no such a node, scheduler may favor a node where the number and/or priority of victims (preempted pods) is smallest. After choosing the node, scheduler considers the lowest priority pods for preemption first. Scheduler starts from the lowest priority and considers enough pods that should be preempted to allow the pending pod to schedule. Scheduler only considers pods that have lower priority than the pending pod.
“Eviction” is the act of killing one or more pods on a node when the node is under resource pressure. Kubelet performs eviction. The eviction process is described in separate document by sig-node, but since it is closely related to the “preemption”, we explain it briefly here.
Kubelet uses a function of priority, usage, and requested resources to determine which pod(s) should be evicted. When pods with the same priority are considered for eviction, the one with the highest percentage of usage over “requests” is the one that is evicted first.
This implies that Best Effort pods are more likely to be evicted among a set of pods with the same priority. The reason is that any amount of resource usage by Best Effort pods translates into a very large percentage of usage over “requests”, as Best Effort pods have zero requests for resources. So, while scheduler does not preempt Best Effort pods for releasing resources on a node, it is likely that these pods are evicted by the Kubelet after scheduler schedules a higher priority pod on the node.
Here is an example:
So, best effort pods may be killed to make room for higher priority pods, although the scheduler does not preempt them directly.
Now, assume everything in the above example, but the best effort pod has priority 2000. In this scenario, scheduler schedules the pending pod with priority 200 on the node, but it may be evicted by the Kubelet, because Kubelet’s eviction function may determine that the best effort pod should stay given its high priority and despite its usage above request. Given this scenario, scheduler should avoid the node and should try scheduling the pod on a different node if the pod is evicted by the Kubelet. This is an optimization to prevent possible ping-pong behavior between Kubelet and Scheduler.
Kubernetes allows a cluster to have more than one scheduler. This introduces a race condition where one scheduler (scheduler A) may perform preemption of one or more pods and another scheduler (scheduler B) schedules a different pod than the initial pending pod in the space opened after the preemption of pods and before the scheduler A has the chance to schedule the initial pending pod. In this case, scheduler A goes ahead and schedules the initial pending pod on the node thinking that the space is still available. However, the pod from A will be rejected by the kubelet admission process if there are not enough free resources on the node after the pod from B has been bound (or any other predicate that kubelet admission checks fails). This is not a major issue, as schedulers will try again to schedule the rejected pod.
Our assumption is that multiple schedulers cooperate with one another. If they don’t, scheduler A may schedule pod A. Scheduler B preempts pod A to schedule pod B which is then preempted by scheduler A to schedule pod A and we go in a loop.
Evicting victim(s) and binding the pending Pod (P) are not transactional.
Preemption victims may have “
TerminationGracePeriodSeconds” which will create
even a larger time gap between the eviction and binding points. When a victim
with termination grace period receives its termination signal, it keeps running
on the node until it terminates successfully or its grace period is over. This
creates a time gap between the point that the scheduler preempts Pods and the
time when the pending Pod (P) can be scheduled on the Node (N). Note that the
pending queue is a FIFO and when a Pod is considered for scheduling and it
cannot be scheduled, it goes to the end of the queue. When P is determined
unschedulable and it preempts victims, it goes to the end of the queue as well.
After preempting victims, the scheduler keeps scheduling other pending Pods. As
victims exit or get terminated, the scheduler tries to schedule Pods in the
pending queue, and one or more of them may be considered and scheduled to N
before the scheduler considers scheduling P again. In such a case, it is likely
that when all the victims exit, Pod P won’t fit on Node N anymore. So, scheduler
will have to preempt other Pods on Node N or another Node so that P can be
scheduled. This scenario might be repeated again for the second and subsequent
rounds of preemption, and P might not get scheduled for a while. This scenario
can cause problems in various clusters, but is particularly problematic in
clusters with a high Pod creation rate.
When determining feasibility of a pod on a node, assume that all the pods with higher or equal priority in the unschedulable list are already running on their respective “nominated” nodes. Pods in the unschedulable list that do not have a nominated node are not considered running.
If the pod was schedulable on the node in presence of the higher priority pods, run predicates again without those higher priority pods on the nodes. If the pod is still schedulable, then run it. This second step is needed, because those higher priority pods are not actually running on the nodes yet. As a result, certain predicates, like inter-pod affinity, may not be satisfied.
This applies to preemption logic as well, i.e., preemption logic must follow the two steps when it considers viability of preemption.
The alpha version of preemption already has a logic that performs preemption for a pod only in one of the two scenarios:
The new changes are as follows:
Scheduler preemption will support PDB for Beta, but respecting PDB is not guaranteed. Preemption will try to avoid violating PDB, but if it doesn’t find any lower priority pod to preempt without violating PDB, it goes ahead and preempts victims despite violating PDB. This is to guarantee that higher priority pods will always get precedence over lower priority pods in obtaining cluster resources.
Here is what preemption will do:
The first step of preemption algorithm is to find whether a given Node (N) has the potential to run the pending pod (P). In order to do so, preemption logic simulates removal of all Pods with lower priority than P from N and then checks whether P can be scheduled on N. If P still cannot be scheduled on N, then N is considered infeasible.
The problem in this approach is that if P has an inter-pod affinity to one of those lower priority pods on N, then preemption logic determines N infeasible for preemption, while N may be able to run both P and the other Pod(s) that P has affinity to.
In order to solve this problem, we propose the following algorithm.
Supporting inter-pod affinity on lower priority pods needs a fairly complex logic which could degrade performance when there are many pods matching the pending pod’s affinity rules. We could have limited the maximum number of matching pods supported in order to address the performance issue, but it would have been very confusing to users and would have removed predictability of scheduling. Moreover, inter-pod affinity is a way for users to define dependency among pods. Inter-pod affinity to lower priority pods creates dependency on lower priority pods. Such a dependency is probably not desired in most realistic scenarios. Given these points, we decided not to implement this feature.
In certain scenarios, scheduling a pending pod (P) on a node (N1), requires preemption of one or more pods on other nodes. An example of such scenarios is a lower priority pod with anti-affinity to P running on a different node in the same zone and the topology of the anti-affinity is zone. Another example is a lower priority pod running on a different node than N1 and is consuming a non-local resource that P needs. In all of such cases, preemption of one or more pods on nodes other than N1 is required to make P schedulable on N1. Such a preemption is called “cross node preemption”.
When a pod P is not schedulable on a node N even after removal of all lower priority pods from node N, there may be other pods on other nodes that are not allowing it to schedule. Since scheduler preemption logic should not rely on the internals of its predicate functions, it has to perform an exhaustive search for other pods whose removal may allow P to be scheduled. Such an exhaustive search will be prohibitively expensive in large clusters.
Given that we do not have a solution with reasonable performance for supporting cross node preemption, we have decided not to implement this feature.
Preemption gives higher precedence to most important pods in the cluster and tries to provide better availability of cluster resources for such pods. As a result, we may not need to scale the cluster up for all pending pods. Particularly, scaling up the cluster may not be necessary in two scenarios:
In order to address these cases:
1. Cluster Autoscaler will not scale up the cluster for pods with
1. Cluster Autoscaler ignores all the pods whose priority is below a certain value.
This value may be configured by a command line flag and will be zero by default.
There are two potential alternatives for the component that performs preemption: rescheduler and Kubelet.
Kubernetes has a “rescheduler” that performs a rudimentary form of preemption today. The more sophisticated form of preemption that is proposed in this document would require many changes to the current rescheduler. The main drawbacks of using the rescheduler for preemption are
Another option is for the scheduler to send the pending pod to a node without doing any preemption, and relying on the kubelet to do the preemption(s). Similar to the rescheduler option, this option requires replicating the preemption and scheduling logic. Kubelet already has the logic to evict pods when a node is under resource pressure, but this logic is much simpler than the whole scheduling logic that considers various scheduling parameters, such as affinity, anti-affinity, PodDisruptionBudget, etc. That is why we believe the scheduler is the right component to perform preemption.
An alternative to preemption by priority and breaking ties with QoS which was proposed earlier, is to preempt by QoS first and break ties with priority. We believe this could cause confusion for users and might reduce cluster utilization. Imagine the following scenario:
If scheduler uses QoS as the first metric for preemption, the web server will be preempted by lower priority “Guaranteed” pods. This can be counter intuitive to users, as they probably don’t expect a lower priority pod to preempt a higher priority one.
To solve the problem, the user might try running his web server as Guaranteed, but in that case, the user might have to set much higher “requests” than the web server normally uses. This would prevent other Guaranteed pods from being scheduled on the node running the web server and therefore, would lower resource utilization of the node.