Good practices for Dynamic Resource Allocation as a Cluster Admin
This page describes good practices when configuring a Kubernetes cluster utilizing Dynamic Resource Allocation (DRA). These instructions are for cluster administrators.
Separate permissions to DRA related APIs
DRA is orchestrated through a number of different APIs. Use authorization tools (like RBAC, or another solution) to control access to the right APIs depending on the persona of your user.
In general, DeviceClasses and ResourceSlices should be restricted to admins and the DRA drivers. Cluster operators that will be deploying Pods with claims will need access to ResourceClaim and ResourceClaimTemplate APIs; both of these APIs are namespace scoped.
DRA driver deployment and maintenance
DRA drivers are third-party applications that run on each node of your cluster to interface with the hardware of that node and Kubernetes' native DRA components. The installation procedure depends on the driver you choose, but is likely deployed as a DaemonSet to all or a selection of the nodes (using node selectors or similar mechanisms) in your cluster.
Use drivers with seamless upgrade if available
DRA drivers implement the kubeletplugin
package
interface.
Your driver may support seamless upgrades by implementing a property of this
interface that allows two versions of the same DRA driver to coexist for a short
time. This is only available for kubelet versions 1.33 and above and may not be
supported by your driver for heterogeneous clusters with attached nodes running
older versions of Kubernetes - check your driver's documentation to be sure.
If seamless upgrades are available for your situation, consider using it to minimize scheduling delays when your driver updates.
If you cannot use seamless upgrades, during driver downtime for upgrades you may observe that:
- Pods cannot start unless the claims they depend on were already prepared for use.
- Cleanup after the last pod which used a claim gets delayed until the driver is available again. The pod is not marked as terminated. This prevents reusing the resources used by the pod for other pods.
- Running pods will continue to run.
Confirm your DRA driver exposes a liveness probe and utilize it
Your DRA driver likely implements a grpc socket for healthchecks as part of DRA driver good practices. The easiest way to utilize this grpc socket is to configure it as a liveness probe for the DaemonSet deploying your DRA driver. Your driver's documentation or deployment tooling may already include this, but if you are building your configuration separately or not running your DRA driver as a Kubernetes pod, be sure that your orchestration tooling restarts the DRA driver on failed healthchecks to this grpc socket. Doing so will minimize any accidental downtime of the DRA driver and give it more opportunities to self heal, reducing scheduling delays or troubleshooting time.
When draining a node, drain the DRA driver as late as possible
The DRA driver is responsible for unpreparing any devices that were allocated to Pods, and if the DRA driver is drained before Pods with claims have been deleted, it will not be able to finalize its cleanup. If you implement custom drain logic for nodes, consider checking that there are no allocated/reserved ResourceClaim or ResourceClaimTemplates before terminating the DRA driver itself.
Monitor and tune components for higher load, especially in high scale environments
Control plane component kube-scheduler
and the internal ResourceClaim
controller orchestrated by the component kube-controller-manager
do the heavy
lifting during scheduling of Pods with claims based on metadata stored in the
DRA APIs. Compared to non-DRA scheduled Pods, the number of API server calls,
memory, and CPU utilization needed by these components is increased for Pods
using DRA claims. In addition, node local components like the DRA driver and
kubelet utilize DRA APIs to allocated the hardware request at Pod sandbox
creation time. Especially in high scale environments where clusters have many
nodes, and/or deploy many workloads that heavily utilize DRA defined resource
claims, the cluster administrator should configure the relevant components to
anticipate the increased load.
The effects of mistuned components can have direct or snowballing affects
causing different symptoms during the Pod lifecycle. If the kube-scheduler
component's QPS and burst configurations are too low, the scheduler might
quickly identify a suitable node for a Pod but take longer to bind the Pod to
that node. With DRA, during Pod scheduling, the QPS and Burst parameters in the
client-go configuration within kube-controller-manager
are critical.
The specific values to tune your cluster to depend on a variety of factors like
number of nodes/pods, rate of pod creation, churn, even in non-DRA environments;
see the SIG-Scalability README on Kubernetes scalability
thresholds
for more information. In scale tests performed against a DRA enabled cluster
with 100 nodes, involving 720 long-lived pods (90% saturation) and 80 churn pods
(10% churn, 10 times), with a job creation QPS of 10, kube-controller-manager
QPS could be set to as low as 75 and Burst to 150 to meet equivalent metric
targets for non-DRA deployments. At this lower bound, it was observed that the
client side rate limiter was triggered enough to protect apiserver from
explosive burst but was is high enough that pod startup SLOs were not impacted.
While this is a good starting point, you can get a better idea of how to tune
the different components that have the biggest effect on DRA performance for
your deployment by monitoring the following metrics.
kube-controller-manager
metrics
The following metrics look closely at the internal ResourceClaim controller
managed by the kube-controller-manager
component.
- Workqueue Add Rate: Monitor
sum(rate(workqueue_adds_total{name="resource_claim"}[5m]))
to gauge how quickly items are added to the ResourceClaim controller. - Workqueue Depth: Track
sum(workqueue_depth{endpoint="kube-controller-manager", name="resource_claim"})
to identify any backlogs in the ResourceClaim controller. - Workqueue Work Duration: Observe
histogram_quantile(0.99, sum(rate(workqueue_work_duration_seconds_bucket{name="resource_claim"}[5m])) by (le))
to understand the speed at which the ResourceClaim controller processes work.
If you are experiencing low Workqueue Add Rate, high Workqueue Depth, and/or high Workqueue Work Duration, this suggests the controller isn't performing optimally. Consider tuning parameters like QPS, burst, and CPU/memory configurations.
If you are experiencing high Workequeue Add Rate, high Workqueue Depth, but reasonable Workqueue Work Duration, this indicates the controller is processing work, but concurrency might be insufficient. Concurrency is hardcoded in the controller, so as a cluster administrator, you can tune for this by reducing the pod creation QPS, so the add rate to the resource claim workqueue is more manageable.
kube-scheduler
metrics
The following scheduler metrics are high level metrics aggregating performance across all Pods scheduled, not just those using DRA. It is important to note that the end-to-end metrics are ultimately influenced by the kube-controller-manager's performance in creating ResourceClaims from ResourceClainTemplates in deployments that heavily use ResourceClainTemplates.
- Scheduler End-to-End Duration: Monitor
histogram_quantile(0.99, sum(increase(scheduler_pod_scheduling_sli_duration_seconds_bucket[5m])) by (le))
. - Scheduler Algorithm Latency: Track
histogram_quantile(0.99, sum(increase(scheduler_scheduling_algorithm_duration_seconds_bucket[5m])) by (le))
.
kubelet
metrics
When a Pod bound to a node must have a ResourceClaim satisfied, kubelet calls
the NodePrepareResources
and NodeUnprepareResources
methods of the DRA
driver. You can observe this behavior from the kubelet's point of view with the
following metrics.
- Kubelet NodePrepareResources: Monitor
histogram_quantile(0.99, sum(rate(dra_operations_duration_seconds_bucket{operation_name="PrepareResources"}[5m])) by (le))
. - Kubelet NodeUnprepareResources: Track
histogram_quantile(0.99, sum(rate(dra_operations_duration_seconds_bucket{operation_name="UnprepareResources"}[5m])) by (le))
.
DRA kubeletplugin operations
DRA drivers implement the kubeletplugin
package
interface
which surfaces its own metric for the underlying gRPC operation
NodePrepareResources
and NodeUnprepareResources
. You can observe this
behavior from the point of view of the internal kubeletplugin with the following
metrics.
- DRA kubeletplugin gRPC NodePrepareResources operation: Observe
histogram_quantile(0.99, sum(rate(dra_grpc_operations_duration_seconds_bucket{method_name=~".*NodePrepareResources"}[5m])) by (le))
- DRA kubeletplugin gRPC NodeUnprepareResources operation: Observe
histogram_quantile(0.99, sum(rate(dra_grpc_operations_duration_seconds_bucket{method_name=~".*NodeUnprepareResources"}[5m])) by (le))
.