Implementation Owner: @pospispa
User can delete a Persistent Volume Claim (PVC) that is being used by a pod. This may have negative impact on the pod and it may result in data loss.
For more details see issue https://github.com/kubernetes/kubernetes/issues/45143
Postpone the PVC deletion until the PVC is not used by any pod.
Depending on the storage type the data loss occurs in one of the below scenarios:
- in case the dynamic provisioning is used and reclaim policy is
Delete the PVC deletion triggers deletion of the associated storage asset and PV.
- the same as above applies for the static provisioning and
Delete reclaim policy.
A new plugin for PVC admission controller will be created. The plugin will automatically add finalizer information into newly created PVC’s metadata.
Scheduler will check if a pod uses a PVC and if any of the PVCs has
deletionTimestamp set. In case this is true an error will be logged: “PVC (%pvcName) is in scheduled for deletion state” and scheduler will behave as if PVC was not found.
Kubelet does currently live lookup of PVC(s) that are used by a pod.
In case any of the PVC(s) used by the pod has the
deletionTimestamp set kubelet won’t start the pod but will report and error: “can’t start pod (%pod) because it’s using PVC (%pvcName) that is being deleted”. Kubelet will follow the same code path as if PVC(s) do not exist.
PVC finalizing controller is a new internal controller.
PVC finalizing controller watches for both PVC and pod events that are processed as described below:
1. PVC add/update/delete events:
nil and finalizer is missing, the PVC is added to PVC queue.
non-nil and finalizer is present, the PVC is added to PVC queue.
2. Pod add events:
- If pod is terminated, all referenced PVCs are added to PVC queue.
3. Pod update events:
- If pod is changing from non-terminated to terminated state, all referenced PVCs are added to PVC queue.
4. Pod delete events:
- All referenced PVCs are added to PVC queue.
PVC and pod information are kept in a cache that is done inherently for an informer.
The PVC queue holds PVCs that need to be processed according to the below rules:
- If PVC is not found in cache, the PVC is skipped.
- If PVC is in cache with
deletionTimestamp and missing finalizer, finalizer is added to the PVC. In case the adding finalizer operation fails, the PVC is re-queued into the PVC queue.
- If PVC is in cache with
deletionTimestamp and finalizer is present, live pod list is done for the PVC namespace. If all pods referencing the PVC are not yet bound to a node or are terminated, the finalizer removal is attempted. In case the finalizer removal operation fails the PVC is re-queued.
In case a PVC has the
deletionTimestamp set the commands
kubectl get pvc and
kubectl describe pvc will display that the PVC is in terminating state.
There were alternatives discussed in issue https://github.com/kubernetes/kubernetes/issues/45143
The implementation proposes that kubelet live updates PVC(s) used by a pod before it starts the pod in order not to start a pod that uses a PVC that has the
An alternative is that scheduler will live update PVC(s) used by a pod in order not to schedule a pod that uses a PVC that has the
But live update represents a performance penalty. As the live update performance penalty is already present in the kubelet it’s better to do the live update in kubelet.
Scheduler will maintain information on both pods and PVCs from API server.
In case a pod is being scheduled and is using PVCs that do not have condition PVCUsedByPod set it will set this condition for these PVCs.
In case a pod is terminated and was using PVCs the scheduler will update PVCUsedByPod condition for these PVCs accordingly.
PVC finalizing controller won’t watch pods because the information whether a PVC is used by a pod or not is now maintained by the scheduler.
In case PVC finalizing controller gets an update of a PVC and this PVC has
deletionTimestamp set it will do live PVC update for this PVC in order to get up-to-date value of its PVCUsedByPod field. In case the PVCUsedByPod is not true it will remove the finalizer information from this PVC.
Scheduler will be responsible for removing the finalizer information from PVCs that are being deleted.
So scheduler will watch pods and PVCs and will maintain internal cache of pods and PVCs.
In case a PVC is deleted scheduler will do one of the below: - In case the PVC is used by a pod it will add the PVC into its internal set of PVCs that are waiting for deletion. - In case the PVC is not used by a pod it will remove the finalizer information from the PVC metadata.
Note: scheduler is the source of truth of pods that are being started. The information on active pods may be a little bit outdated that causes that deletion of a PVC may be postponed (pod status in schedular is active while the pod is terminated in API server), but this does not cause any harm.
The disadvantage is that scheduler will become responsible for PVC deletion postponing that will make scheduler bigger.