Workload API
FEATURE STATE:
Kubernetes v1.35 [alpha](disabled by default)
The Workload API resource defines the scheduling requirements and structure of a multi-Pod
application. While workload controllers such as Job
manage the application's runtime state, the Workload specifies how groups of Pods
should be scheduled. The Job controller is the only built-in controller that creates
PodGroup objects from the Workload's
PodGroupTemplates at runtime.
What is a Workload?
The Workload API resource is part of the scheduling.k8s.io/v1alpha2
API group
and your cluster must have that API group enabled, as well as the GenericWorkload
feature gate,
before you can use this API.
A Workload is a static, long-lived policy template. It defines what scheduling
policies should be applied to groups of Pods, but does not track runtime state itself.
Runtime scheduling state is maintained by PodGroup
objects, which controllers create from the Workload's PodGroupTemplates.
API structure
A Workload consists of two fields: a list of PodGroupTemplates and an optional controller
reference. The entire Workload spec is immutable after creation: you cannot modify
existing templates, add new templates, or remove templates from podGroupTemplates.
PodGroupTemplates
The spec.podGroupTemplates list defines the distinct components of your workload.
For example, a machine learning job might have a driver template and a worker template.
Each entry in podGroupTemplates must have:
- A unique
name that will be used to reference the template in the PodGroup's spec.podGroupTemplateRef. - A scheduling policy (
basic or gang).
If the WorkloadAwarePreemption feature gate is enabled each entry in podGroups can also have priority and disruption mode.
The maximum number of PodGroupTemplates in a single Workload is 8.
apiVersion: scheduling.k8s.io/v1alpha2
kind: Workload
metadata:
name: training-job-workload
namespace: some-ns
spec:
controllerRef:
apiGroup: batch
kind: Job
name: training-job
podGroupTemplates:
- name: workers
schedulingPolicy:
gang:
# The gang is schedulable only if 4 pods can run at once
minCount: 4
priorityClassName: high-priority # Only applicable with WorkloadAwarePreemption feature gate
disruptionMode: PodGroup # Only applicable with WorkloadAwarePreemption feature gate
When a workload controller creates a PodGroup from one of these templates, it copies the
schedulingPolicy into the PodGroup's own spec. Changes to the Workload only affect
newly created PodGroups, not existing ones.
Referencing a workload controlling object
The controllerRef field links the Workload back to the specific high-level object defining the application,
such as a Job or a custom CRD. This is useful for observability and tooling.
This data is not used to schedule or manage the Workload.
Gang scheduling with Jobs
FEATURE STATE:
Kubernetes v1.36 [alpha](disabled by default)
When the
EnableWorkloadWithJob
feature gate is enabled, the
Job controller automatically
creates Workload and PodGroup objects for parallel indexed Jobs where
.spec.parallelism equals .spec.completions. The gang policy's minCount
is set to the Job's parallelism, so all Pods must be schedulable together
before any of them are bound to nodes.
This is the built-in path for using gang scheduling with Jobs.
You do not need to create Workload or PodGroup objects yourself as the Job
controller handles it automatically. Other workload controllers (such as
JobSet) may manage their own Workload and PodGroup objects independently.
What's next
1 - Pod Group Disruption and Priority
FEATURE STATE:
Kubernetes v1.36 [alpha](disabled by default)
PodGroup can declare a disruption mode. This mode dictates how
the scheduler can disrupt a running PodGroup, for example to accommodate
a higher priority PodGroup. A PodGroup also has a priority,
which overrides the priority of the individual pods from the group
for workload-aware preemption events.
Disruption mode types
Note:
As of 1.36, the
priority or
disruptionMode fields of the PodGroup are only respected
by
workload-aware preemption.
During the pod scheduling phase, the scheduler does not take into account
the
priority or
disruptionMode fields of the PodGroup.
The API supports two disruption modes: Pod and PodGroup.
The default one is Pod.
Pod
The Pod mode instructs the scheduler to treat all Pods in the group as separate entities,
allowing independent disruption of a single pod from a PodGroup.
PodGroup
The PodGroup mode emphasizes "all-or-nothing" semantics for disruption.
It instructs the scheduler that all pods from the PodGroup have to be disrupted together.
Pod group priority
PodGroup uses the same concept of PriorityClass as single Pods.
Once you have created one or more PriorityClasses,
you can create a PodGroup that specifies one of those PriorityClass names in its specification.
The priority admission controller uses the priorityClassName field and populates the integer value of the priority.
If the priority class is not found, the PodGroup is rejected.
When priorityClassName is not set for a PodGroup, Kubernetes looks for a default (a PriorityClass with globalDefault set true)
If there is no PriorityClass with globalDefault set true, a PodGroup with no priorityClassName has priority zero.
The priority of the PodGroup is an authorative priority for all pods in the group during workload-aware preemption events, even when priorities of individual pods forming this PodGroup differ.
The following YAML is an example of a PodGroup configuration that uses the high-priority PriorityClass,
which maps to the integer priority value of 1000000.
The priority admission controller checks the specification and resolves the priority of the PodGroup to 1000000.
apiVersion: scheduling.k8s.io/v1alpha2
kind: PodGroup
metadata:
namespace: ns-1
name: job-1
spec:
priorityClassName: high-priority
What's next
2 - PodGroup Scheduling Policies
FEATURE STATE:
Kubernetes v1.35 [alpha](disabled by default)
Every PodGroup must declare a scheduling policy
in its spec.schedulingPolicy field. This policy dictates how the scheduler treats the
collection of Pods in the group.
Policy types
The schedulingPolicy field supports two policy types: basic and gang.
You must specify exactly one.
Basic policy
The basic policy instructs the scheduler to evaluate all Pods on a best-effort basis.
Unlike the gang policy, a PodGroup using the basic policy is considered feasible
regardless of how many of its Pods are currently schedulable.
The primary reason to use the basic policy is to organize Pods into a group for better
observability and management, while still evaluating them together within a single, atomic
PodGroup scheduling cycle.
This policy is suited for groups that do not require simultaneous startup but logically
belong together, or to open the way for group-level constraints that do not imply
"all-or-nothing" placement.
schedulingPolicy:
basic: {}
Gang policy
The gang policy enforces "all-or-nothing" scheduling. This is essential for tightly-coupled
workloads where partial startup results in deadlocks or wasted resources.
This can be used for Jobs
or any other batch process where all workers must run concurrently to make progress.
The gang policy requires a minCount field, which is the minimum number of Pods that must be
schedulable simultaneously for the group to be feasible:
schedulingPolicy:
gang:
# The number of Pods that must be schedulable simultaneously
# for the group to be admitted.
minCount: 4
Setting policies via PodGroupTemplates
When using the Workload API, you define scheduling
policies inside PodGroupTemplates. The workload controller copies the policy from the
template into each PodGroup it creates, making the PodGroup self-contained. Changes to the
Workload's templates only affect newly created PodGroups, not existing ones.
For standalone PodGroups (created without a Workload), you set spec.schedulingPolicy
directly on the PodGroup itself.
What's next
3 - Topology-Aware Workload Scheduling
FEATURE STATE:
Kubernetes v1.36 [alpha](disabled by default)
Topology-Aware Scheduling (TAS) is a feature of the Workload API that optimizes the placement of
pods within the cluster.
TAS ensures that all pods within a PodGroup are co-located into a specific topology domain,
such as a single server rack or zone. This minimizes inter-pod communication latency and prevents
workload fragmentation across the cluster infrastructure.
Topology-aware scheduling with gang scheduling policy
When applied to PodGroups with gang scheduling policy, TAS simulates the potential assignment
(placement) of the full group of pods at once. It guarantees that at least the specified
minCount pods can fit together into the same topology domain before committing resources.
If no feasible placement is found, the entire PodGroup becomes unschedulable.
This is the recommended approach for workloads like distributed AI and ML training that strictly
require proximity to minimize inter-pod communication latency.
If new pods are added to the PodGroup where some pods are already scheduled (for example, if pods
are recreated), the scheduler will force all new incoming pods to land on the exact same topology
domain where the existing pods currently reside. If that specific domain lacks sufficient capacity
for the new pods, the pods will remain pending - even if it means that less than minCount pods
are scheduled at this point.
Note:
As of v1.36 Topology-Aware Scheduling does not trigger workload or pod preemption. If no
feasible placement can be found without triggering preemption, the PodGroup becomes unschedulable.Topology-aware scheduling with basic scheduling policy
Using TAS with basic scheduling policy may exhibit inconsistent behavior. The scheduler may only
observe a subset of pods when entering the PodGroup scheduling cycle - therefore placement
feasibility is only evaluated for the observed pods, rather than the entire PodGroup. To partially
mitigate this limitation, you can use scheduling gates to hold off PodGroup scheduling until all
pods within the PodGroup are in the scheduling queue.
If no feasible placement is found for the entire PodGroup, only a subset of pods may be scheduled,
and they are guaranteed to meet the scheduling constraints.
If new pods are added to the PodGroup where some pods are already scheduled, the scheduler will act
the same as in case of gang policy - forcing the new pods into the same domain, unless there is
insufficient capacity (in which case the new pods will remain pending).
API configuration: scheduling constraints
Every PodGroup (or PodGroupTemplate) may optionally declare the schedulingConstraints field,
which is interpreted by the placement-based PodGroup scheduling algorithm.
If constraints are defined in PodGroupTemplate, they will be copied to referencing PodGroups.
As of Kubernetes v1.36, the API supports topology constraints.
Note:
As of Kubernetes v1.36, you can specify only a single topology constraint in each PodGroup.Topology constraint
To define a topology constraint for a PodGroup you need to set a key, which corresponds to
a Kubernetes node label, representing the target topology domain (for example, a rack or a zone).
The scheduler strictly enforces that all pods within the PodGroup are placed onto nodes that share
the exact same value for this specified label.
Here is an example of a PodGroup configured with a topology constraint:
apiVersion: scheduling.k8s.io/v1alpha2
kind: PodGroup
metadata:
name: example-podgroup
spec:
schedulingPolicy:
gang:
minCount: 4
schedulingConstraints:
topology:
- key: topology.example.com/rack
What's next