Kubernetes 1.26: Pod Scheduling Readiness
Author: Wei Huang (Apple), Abdullah Gharaibeh (Google)
Kubernetes 1.26 introduced a new Pod feature: scheduling gates. In Kubernetes, scheduling gates are keys that tell the scheduler when a Pod is ready to be considered for scheduling.
What problem does it solve?
When a Pod is created, the scheduler will continuously attempt to find a node that fits it. This infinite loop continues until the scheduler either finds a node for the Pod, or the Pod gets deleted.
Pods that remain unschedulable for long periods of time (e.g., ones that are blocked on some external event) waste scheduling cycles. A scheduling cycle may take ≅20ms or more depending on the complexity of the Pod's scheduling constraints. Therefore, at scale, those wasted cycles significantly impact the scheduler's performance. See the arrows in the "scheduler" box below.
Scheduling gates helps address this problem. It allows declaring that newly created Pods are not ready for scheduling. When scheduling gates are present on a Pod, the scheduler ignores the Pod and therefore saves unnecessary scheduling attempts. Those Pods will also be ignored by Cluster Autoscaler if you have it installed in the cluster.
Clearing the gates is the responsibility of external controllers with knowledge of when the Pod should be considered for scheduling (e.g., a quota manager).
How does it work?
Scheduling gates in general works very similar to Finalizers. Pods with a non-empty
spec.schedulingGates field will show as status
SchedulingGated and be blocked from
scheduling. Note that more than one gate can be added, but they all should be added upon Pod
creation (e.g., you can add them as part of the spec or via a mutating webhook).
NAME READY STATUS RESTARTS AGE test-pod 0/1 SchedulingGated 0 10s
To clear the gates, you update the Pod by removing all of the items from the Pod's
field. The gates do not need to be removed all at once, but only when all the gates are removed the
scheduler will start to consider the Pod for scheduling.
Under the hood, scheduling gates are implemented as a PreEnqueue scheduler plugin, a new scheduler framework extension point that is invoked at the beginning of each scheduling cycle.
An important use case this feature enables is dynamic quota management. Kubernetes supports ResourceQuota, however the API Server enforces quota at the time you attempt Pod creation. For example, if a new Pod exceeds the CPU quota, it gets rejected. The API Server doesn't queue the Pod; therefore, whoever created the Pod needs to continuously attempt to recreate it again. This either means a delay between resources becoming available and the Pod actually running, or it means load on the API server and Scheduler due to constant attempts.
Scheduling gates allows an external quota manager to address the above limitation of ResourceQuota.
Specifically, the manager could add a
example.com/quota-check scheduling gate to all Pods created in the
cluster (using a mutating webhook). The manager would then remove the gate when there is quota to
start the Pod.
To use this feature, the
PodSchedulingReadiness feature gate must be enabled in the API Server
and scheduler. You're more than welcome to test it out and tell us (SIG Scheduling) what you think!