1 - Limit Ranges
By default, containers run with unbounded
compute resources on a Kubernetes cluster.
Using Kubernetes resource quotas,
administrators (also termed cluster operators) can restrict consumption and creation
of cluster resources (such as CPU time, memory, and persistent storage) within a specified
namespace.
Within a namespace, a Pod can consume as much CPU and memory as is allowed by the ResourceQuotas that apply to that namespace.
As a cluster operator, or as a namespace-level administrator, you might also be concerned
about making sure that a single object cannot monopolize all available resources within a namespace.
A LimitRange is a policy to constrain the resource allocations (limits and requests) that you can specify for
each applicable object kind (such as Pod or PersistentVolumeClaim) in a namespace.
A LimitRange provides constraints that can:
- Enforce minimum and maximum compute resources usage per Pod or Container in a namespace.
- Enforce minimum and maximum storage request per
PersistentVolumeClaim in a namespace.
- Enforce a ratio between request and limit for a resource in a namespace.
- Set default request/limit for compute resources in a namespace and automatically
inject them to Containers at runtime.
Kubernetes constrains resource allocations to Pods in a particular namespace
whenever there is at least one LimitRange object in that namespace.
The name of a LimitRange object must be a valid
DNS subdomain name.
Constraints on resource limits and requests
- The administrator creates a LimitRange in a namespace.
- Users create (or try to create) objects in that namespace, such as Pods or
PersistentVolumeClaims.
- First, the LimitRange admission controller applies default request and limit values
for all Pods (and their containers) that do not set compute resource requirements.
- Second, the LimitRange tracks usage to ensure it does not exceed resource minimum,
maximum and ratio defined in any LimitRange present in the namespace.
- If you attempt to create or update an object (Pod or PersistentVolumeClaim) that violates
a LimitRange constraint, your request to the API server will fail with anHTTP status
code
403 Forbidden
and a message explaining the constraint that has been violated. - If you add a LimitRange in a namespace that applies to compute-related resources
such as
cpu
and memory
, you must specify requests or limits for those values.
Otherwise, the system may reject Pod creation. - LimitRange validations occur only at Pod admission stage, not on running Pods.
If you add or modify a LimitRange, the Pods that already exist in that namespace
continue unchanged.
- If two or more LimitRange objects exist in the namespace, it is not deterministic
which default value will be applied.
LimitRange and admission checks for Pods
A LimitRange does not check the consistency of the default values it applies.
This means that a default value for the limit that is set by LimitRange may be
less than the request value specified for the container in the spec that a client
submits to the API server. If that happens, the final Pod will not be schedulable.
For example, you define a LimitRange with below manifest:
Note:
The following examples operate within the default namespace of your cluster, as the namespace
parameter is undefined and the LimitRange scope is limited to the namespace level.
This implies that any references or operations within these examples will interact
with elements within the default namespace of your cluster. You can override the
operating namespace by configuring namespace in the metadata.namespace
field.apiVersion: v1
kind: LimitRange
metadata:
name: cpu-resource-constraint
spec:
limits:
- default: # this section defines default limits
cpu: 500m
defaultRequest: # this section defines default requests
cpu: 500m
max: # max and min define the limit range
cpu: "1"
min:
cpu: 100m
type: Container
along with a Pod that declares a CPU resource request of 700m
, but not a limit:
apiVersion: v1
kind: Pod
metadata:
name: example-conflict-with-limitrange-cpu
spec:
containers:
- name: demo
image: registry.k8s.io/pause:3.8
resources:
requests:
cpu: 700m
then that Pod will not be scheduled, failing with an error similar to:
Pod "example-conflict-with-limitrange-cpu" is invalid: spec.containers[0].resources.requests: Invalid value: "700m": must be less than or equal to cpu limit
If you set both request
and limit
, then that new Pod will be scheduled successfully
even with the same LimitRange in place:
apiVersion: v1
kind: Pod
metadata:
name: example-no-conflict-with-limitrange-cpu
spec:
containers:
- name: demo
image: registry.k8s.io/pause:3.8
resources:
requests:
cpu: 700m
limits:
cpu: 700m
Example resource constraints
Examples of policies that could be created using LimitRange are:
- In a 2 node cluster with a capacity of 8 GiB RAM and 16 cores, constrain Pods in a
namespace to request 100m of CPU with a max limit of 500m for CPU and request 200Mi
for Memory with a max limit of 600Mi for Memory.
- Define default CPU limit and request to 150m and memory default request to 300Mi for
Containers started with no cpu and memory requests in their specs.
In the case where the total limits of the namespace is less than the sum of the limits
of the Pods/Containers, there may be contention for resources. In this case, the
Containers or Pods will not be created.
Neither contention nor changes to a LimitRange will affect already created resources.
What's next
For examples on using limits, see:
Refer to the LimitRanger design document
for context and historical information.
2 - Resource Quotas
When several users or teams share a cluster with a fixed number of nodes,
there is a concern that one team could use more than its fair share of resources.
Resource quotas are a tool for administrators to address this concern.
A resource quota, defined by a ResourceQuota
object, provides constraints that limit
aggregate resource consumption per namespace. It can limit the quantity of objects that can
be created in a namespace by type, as well as the total amount of compute resources that may
be consumed by resources in that namespace.
Resource quotas work like this:
Different teams work in different namespaces. This can be enforced with
RBAC.
The administrator creates one ResourceQuota for each namespace.
Users create resources (pods, services, etc.) in the namespace, and the quota system
tracks usage to ensure it does not exceed hard resource limits defined in a ResourceQuota.
If creating or updating a resource violates a quota constraint, the request will fail with HTTP
status code 403 FORBIDDEN
with a message explaining the constraint that would have been violated.
If quota is enabled in a namespace for compute resources like cpu
and memory
, users must specify
requests or limits for those values; otherwise, the quota system may reject pod creation. Hint: Use
the LimitRanger
admission controller to force defaults for pods that make no compute resource requirements.
See the walkthrough
for an example of how to avoid this problem.
Note:
- For
cpu
and memory
resources, ResourceQuotas enforce that every
(new) pod in that namespace sets a limit for that resource.
If you enforce a resource quota in a namespace for either cpu
or memory
,
you and other clients, must specify either requests
or limits
for that resource,
for every new Pod you submit. If you don't, the control plane may reject admission
for that Pod. - For other resources: ResourceQuota works and will ignore pods in the namespace without
setting a limit or request for that resource. It means that you can create a new pod
without limit/request for ephemeral storage if the resource quota limits the ephemeral
storage of this namespace.
You can use a LimitRange to automatically set
a default request for these resources.
The name of a ResourceQuota object must be a valid
DNS subdomain name.
Examples of policies that could be created using namespaces and quotas are:
- In a cluster with a capacity of 32 GiB RAM, and 16 cores, let team A use 20 GiB and 10 cores,
let B use 10GiB and 4 cores, and hold 2GiB and 2 cores in reserve for future allocation.
- Limit the "testing" namespace to using 1 core and 1GiB RAM. Let the "production" namespace
use any amount.
In the case where the total capacity of the cluster is less than the sum of the quotas of the namespaces,
there may be contention for resources. This is handled on a first-come-first-served basis.
Neither contention nor changes to quota will affect already created resources.
Enabling Resource Quota
ResourceQuota support is enabled by default for many Kubernetes distributions. It is
enabled when the API server
--enable-admission-plugins=
flag has ResourceQuota
as
one of its arguments.
A resource quota is enforced in a particular namespace when there is a
ResourceQuota in that namespace.
Compute Resource Quota
You can limit the total sum of
compute resources
that can be requested in a given namespace.
The following resource types are supported:
Resource Name | Description |
---|
limits.cpu | Across all pods in a non-terminal state, the sum of CPU limits cannot exceed this value. |
limits.memory | Across all pods in a non-terminal state, the sum of memory limits cannot exceed this value. |
requests.cpu | Across all pods in a non-terminal state, the sum of CPU requests cannot exceed this value. |
requests.memory | Across all pods in a non-terminal state, the sum of memory requests cannot exceed this value. |
hugepages-<size> | Across all pods in a non-terminal state, the number of huge page requests of the specified size cannot exceed this value. |
cpu | Same as requests.cpu |
memory | Same as requests.memory |
Resource Quota For Extended Resources
In addition to the resources mentioned above, in release 1.10, quota support for
extended resources is added.
As overcommit is not allowed for extended resources, it makes no sense to specify both requests
and limits
for the same extended resource in a quota. So for extended resources, only quota items
with prefix requests.
are allowed.
Take the GPU resource as an example, if the resource name is nvidia.com/gpu
, and you want to
limit the total number of GPUs requested in a namespace to 4, you can define a quota as follows:
requests.nvidia.com/gpu: 4
See Viewing and Setting Quotas for more details.
Storage Resource Quota
You can limit the total sum of storage resources
that can be requested in a given namespace.
In addition, you can limit consumption of storage resources based on associated storage-class.
Resource Name | Description |
---|
requests.storage | Across all persistent volume claims, the sum of storage requests cannot exceed this value. |
persistentvolumeclaims | The total number of PersistentVolumeClaims that can exist in the namespace. |
<storage-class-name>.storageclass.storage.k8s.io/requests.storage | Across all persistent volume claims associated with the <storage-class-name> , the sum of storage requests cannot exceed this value. |
<storage-class-name>.storageclass.storage.k8s.io/persistentvolumeclaims | Across all persistent volume claims associated with the <storage-class-name> , the total number of persistent volume claims that can exist in the namespace. |
For example, if you want to quota storage with gold
StorageClass separate from
a bronze
StorageClass, you can define a quota as follows:
gold.storageclass.storage.k8s.io/requests.storage: 500Gi
bronze.storageclass.storage.k8s.io/requests.storage: 100Gi
In release 1.8, quota support for local ephemeral storage is added as an alpha feature:
Resource Name | Description |
---|
requests.ephemeral-storage | Across all pods in the namespace, the sum of local ephemeral storage requests cannot exceed this value. |
limits.ephemeral-storage | Across all pods in the namespace, the sum of local ephemeral storage limits cannot exceed this value. |
ephemeral-storage | Same as requests.ephemeral-storage . |
Note:
When using a CRI container runtime, container logs will count against the ephemeral storage quota.
This can result in the unexpected eviction of pods that have exhausted their storage quotas.
Refer to
Logging Architecture for details.
Object Count Quota
You can set quota for the total number of one particular resource kind in the Kubernetes API,
using the following syntax:
count/<resource>.<group>
for resources from non-core groupscount/<resource>
for resources from the core group
Here is an example set of resources users may want to put under object count quota:
count/persistentvolumeclaims
count/services
count/secrets
count/configmaps
count/replicationcontrollers
count/deployments.apps
count/replicasets.apps
count/statefulsets.apps
count/jobs.batch
count/cronjobs.batch
If you define a quota this way, it applies to Kubernetes' APIs that are part of the API server, and
to any custom resources backed by a CustomResourceDefinition. If you use
API aggregation to
add additional, custom APIs that are not defined as CustomResourceDefinitions, the core Kubernetes
control plane does not enforce quota for the aggregated API. The extension API server is expected to
provide quota enforcement if that's appropriate for the custom API.
For example, to create a quota on a widgets
custom resource in the example.com
API group, use count/widgets.example.com
.
When using such a resource quota (nearly for all object kinds), an object is charged
against the quota if the object kind exists (is defined) in the control plane.
These types of quotas are useful to protect against exhaustion of storage resources. For example, you may
want to limit the number of Secrets in a server given their large size. Too many Secrets in a cluster can
actually prevent servers and controllers from starting. You can set a quota for Jobs to protect against
a poorly configured CronJob. CronJobs that create too many Jobs in a namespace can lead to a denial of service.
There is another syntax only to set the same type of quota for certain resources.
The following types are supported:
Resource Name | Description |
---|
configmaps | The total number of ConfigMaps that can exist in the namespace. |
persistentvolumeclaims | The total number of PersistentVolumeClaims that can exist in the namespace. |
pods | The total number of Pods in a non-terminal state that can exist in the namespace. A pod is in a terminal state if .status.phase in (Failed, Succeeded) is true. |
replicationcontrollers | The total number of ReplicationControllers that can exist in the namespace. |
resourcequotas | The total number of ResourceQuotas that can exist in the namespace. |
services | The total number of Services that can exist in the namespace. |
services.loadbalancers | The total number of Services of type LoadBalancer that can exist in the namespace. |
services.nodeports | The total number of NodePorts allocated to Services of type NodePort or LoadBalancer that can exist in the namespace. |
secrets | The total number of Secrets that can exist in the namespace. |
For example, pods
quota counts and enforces a maximum on the number of pods
created in a single namespace that are not terminal. You might want to set a pods
quota on a namespace to avoid the case where a user creates many small pods and
exhausts the cluster's supply of Pod IPs.
You can find more examples on Viewing and Setting Quotas.
Quota Scopes
Each quota can have an associated set of scopes
. A quota will only measure usage for a resource if it matches
the intersection of enumerated scopes.
When a scope is added to the quota, it limits the number of resources it supports to those that pertain to the scope.
Resources specified on the quota outside of the allowed set results in a validation error.
Scope | Description |
---|
Terminating | Match pods where .spec.activeDeadlineSeconds >= 0 |
NotTerminating | Match pods where .spec.activeDeadlineSeconds is nil |
BestEffort | Match pods that have best effort quality of service. |
NotBestEffort | Match pods that do not have best effort quality of service. |
PriorityClass | Match pods that references the specified priority class. |
CrossNamespacePodAffinity | Match pods that have cross-namespace pod (anti)affinity terms. |
The BestEffort
scope restricts a quota to tracking the following resource:
The Terminating
, NotTerminating
, NotBestEffort
and PriorityClass
scopes restrict a quota to tracking the following resources:
pods
cpu
memory
requests.cpu
requests.memory
limits.cpu
limits.memory
Note that you cannot specify both the Terminating
and the NotTerminating
scopes in the same quota, and you cannot specify both the BestEffort
and
NotBestEffort
scopes in the same quota either.
The scopeSelector
supports the following values in the operator
field:
In
NotIn
Exists
DoesNotExist
When using one of the following values as the scopeName
when defining the
scopeSelector
, the operator
must be Exists
.
Terminating
NotTerminating
BestEffort
NotBestEffort
If the operator
is In
or NotIn
, the values
field must have at least
one value. For example:
scopeSelector:
matchExpressions:
- scopeName: PriorityClass
operator: In
values:
- middle
If the operator
is Exists
or DoesNotExist
, the values
field must NOT be
specified.
Resource Quota Per PriorityClass
FEATURE STATE:
Kubernetes v1.17 [stable]
Pods can be created at a specific priority.
You can control a pod's consumption of system resources based on a pod's priority, by using the scopeSelector
field in the quota spec.
A quota is matched and consumed only if scopeSelector
in the quota spec selects the pod.
When quota is scoped for priority class using scopeSelector
field, quota object
is restricted to track only following resources:
pods
cpu
memory
ephemeral-storage
limits.cpu
limits.memory
limits.ephemeral-storage
requests.cpu
requests.memory
requests.ephemeral-storage
This example creates a quota object and matches it with pods at specific priorities. The example
works as follows:
- Pods in the cluster have one of the three priority classes, "low", "medium", "high".
- One quota object is created for each priority.
Save the following YAML to a file quota.yml
.
apiVersion: v1
kind: List
items:
- apiVersion: v1
kind: ResourceQuota
metadata:
name: pods-high
spec:
hard:
cpu: "1000"
memory: 200Gi
pods: "10"
scopeSelector:
matchExpressions:
- operator : In
scopeName: PriorityClass
values: ["high"]
- apiVersion: v1
kind: ResourceQuota
metadata:
name: pods-medium
spec:
hard:
cpu: "10"
memory: 20Gi
pods: "10"
scopeSelector:
matchExpressions:
- operator : In
scopeName: PriorityClass
values: ["medium"]
- apiVersion: v1
kind: ResourceQuota
metadata:
name: pods-low
spec:
hard:
cpu: "5"
memory: 10Gi
pods: "10"
scopeSelector:
matchExpressions:
- operator : In
scopeName: PriorityClass
values: ["low"]
Apply the YAML using kubectl create
.
kubectl create -f ./quota.yml
resourcequota/pods-high created
resourcequota/pods-medium created
resourcequota/pods-low created
Verify that Used
quota is 0
using kubectl describe quota
.
Name: pods-high
Namespace: default
Resource Used Hard
-------- ---- ----
cpu 0 1k
memory 0 200Gi
pods 0 10
Name: pods-low
Namespace: default
Resource Used Hard
-------- ---- ----
cpu 0 5
memory 0 10Gi
pods 0 10
Name: pods-medium
Namespace: default
Resource Used Hard
-------- ---- ----
cpu 0 10
memory 0 20Gi
pods 0 10
Create a pod with priority "high". Save the following YAML to a
file high-priority-pod.yml
.
apiVersion: v1
kind: Pod
metadata:
name: high-priority
spec:
containers:
- name: high-priority
image: ubuntu
command: ["/bin/sh"]
args: ["-c", "while true; do echo hello; sleep 10;done"]
resources:
requests:
memory: "10Gi"
cpu: "500m"
limits:
memory: "10Gi"
cpu: "500m"
priorityClassName: high
Apply it with kubectl create
.
kubectl create -f ./high-priority-pod.yml
Verify that "Used" stats for "high" priority quota, pods-high
, has changed and that
the other two quotas are unchanged.
Name: pods-high
Namespace: default
Resource Used Hard
-------- ---- ----
cpu 500m 1k
memory 10Gi 200Gi
pods 1 10
Name: pods-low
Namespace: default
Resource Used Hard
-------- ---- ----
cpu 0 5
memory 0 10Gi
pods 0 10
Name: pods-medium
Namespace: default
Resource Used Hard
-------- ---- ----
cpu 0 10
memory 0 20Gi
pods 0 10
Cross-namespace Pod Affinity Quota
FEATURE STATE:
Kubernetes v1.24 [stable]
Operators can use CrossNamespacePodAffinity
quota scope to limit which namespaces are allowed to
have pods with affinity terms that cross namespaces. Specifically, it controls which pods are allowed
to set namespaces
or namespaceSelector
fields in pod affinity terms.
Preventing users from using cross-namespace affinity terms might be desired since a pod
with anti-affinity constraints can block pods from all other namespaces
from getting scheduled in a failure domain.
Using this scope operators can prevent certain namespaces (foo-ns
in the example below)
from having pods that use cross-namespace pod affinity by creating a resource quota object in
that namespace with CrossNamespacePodAffinity
scope and hard limit of 0:
apiVersion: v1
kind: ResourceQuota
metadata:
name: disable-cross-namespace-affinity
namespace: foo-ns
spec:
hard:
pods: "0"
scopeSelector:
matchExpressions:
- scopeName: CrossNamespacePodAffinity
operator: Exists
If operators want to disallow using namespaces
and namespaceSelector
by default, and
only allow it for specific namespaces, they could configure CrossNamespacePodAffinity
as a limited resource by setting the kube-apiserver flag --admission-control-config-file
to the path of the following configuration file:
apiVersion: apiserver.config.k8s.io/v1
kind: AdmissionConfiguration
plugins:
- name: "ResourceQuota"
configuration:
apiVersion: apiserver.config.k8s.io/v1
kind: ResourceQuotaConfiguration
limitedResources:
- resource: pods
matchScopes:
- scopeName: CrossNamespacePodAffinity
operator: Exists
With the above configuration, pods can use namespaces
and namespaceSelector
in pod affinity only
if the namespace where they are created have a resource quota object with
CrossNamespacePodAffinity
scope and a hard limit greater than or equal to the number of pods using those fields.
Requests compared to Limits
When allocating compute resources, each container may specify a request and a limit value for either CPU or memory.
The quota can be configured to quota either value.
If the quota has a value specified for requests.cpu
or requests.memory
, then it requires that every incoming
container makes an explicit request for those resources. If the quota has a value specified for limits.cpu
or limits.memory
,
then it requires that every incoming container specifies an explicit limit for those resources.
Viewing and Setting Quotas
kubectl supports creating, updating, and viewing quotas:
kubectl create namespace myspace
cat <<EOF > compute-resources.yaml
apiVersion: v1
kind: ResourceQuota
metadata:
name: compute-resources
spec:
hard:
requests.cpu: "1"
requests.memory: 1Gi
limits.cpu: "2"
limits.memory: 2Gi
requests.nvidia.com/gpu: 4
EOF
kubectl create -f ./compute-resources.yaml --namespace=myspace
cat <<EOF > object-counts.yaml
apiVersion: v1
kind: ResourceQuota
metadata:
name: object-counts
spec:
hard:
configmaps: "10"
persistentvolumeclaims: "4"
pods: "4"
replicationcontrollers: "20"
secrets: "10"
services: "10"
services.loadbalancers: "2"
EOF
kubectl create -f ./object-counts.yaml --namespace=myspace
kubectl get quota --namespace=myspace
NAME AGE
compute-resources 30s
object-counts 32s
kubectl describe quota compute-resources --namespace=myspace
Name: compute-resources
Namespace: myspace
Resource Used Hard
-------- ---- ----
limits.cpu 0 2
limits.memory 0 2Gi
requests.cpu 0 1
requests.memory 0 1Gi
requests.nvidia.com/gpu 0 4
kubectl describe quota object-counts --namespace=myspace
Name: object-counts
Namespace: myspace
Resource Used Hard
-------- ---- ----
configmaps 0 10
persistentvolumeclaims 0 4
pods 0 4
replicationcontrollers 0 20
secrets 1 10
services 0 10
services.loadbalancers 0 2
kubectl also supports object count quota for all standard namespaced resources
using the syntax count/<resource>.<group>
:
kubectl create namespace myspace
kubectl create quota test --hard=count/deployments.apps=2,count/replicasets.apps=4,count/pods=3,count/secrets=4 --namespace=myspace
kubectl create deployment nginx --image=nginx --namespace=myspace --replicas=2
kubectl describe quota --namespace=myspace
Name: test
Namespace: myspace
Resource Used Hard
-------- ---- ----
count/deployments.apps 1 2
count/pods 2 3
count/replicasets.apps 1 4
count/secrets 1 4
Quota and Cluster Capacity
ResourceQuotas are independent of the cluster capacity. They are
expressed in absolute units. So, if you add nodes to your cluster, this does not
automatically give each namespace the ability to consume more resources.
Sometimes more complex policies may be desired, such as:
- Proportionally divide total cluster resources among several teams.
- Allow each tenant to grow resource usage as needed, but have a generous
limit to prevent accidental resource exhaustion.
- Detect demand from one namespace, add nodes, and increase quota.
Such policies could be implemented using ResourceQuotas
as building blocks, by
writing a "controller" that watches the quota usage and adjusts the quota
hard limits of each namespace according to other signals.
Note that resource quota divides up aggregate cluster resources, but it creates no
restrictions around nodes: pods from several namespaces may run on the same node.
Limit Priority Class consumption by default
It may be desired that pods at a particular priority, such as "cluster-services",
should be allowed in a namespace, if and only if, a matching quota object exists.
With this mechanism, operators are able to restrict usage of certain high
priority classes to a limited number of namespaces and not every namespace
will be able to consume these priority classes by default.
To enforce this, kube-apiserver
flag --admission-control-config-file
should be
used to pass path to the following configuration file:
apiVersion: apiserver.config.k8s.io/v1
kind: AdmissionConfiguration
plugins:
- name: "ResourceQuota"
configuration:
apiVersion: apiserver.config.k8s.io/v1
kind: ResourceQuotaConfiguration
limitedResources:
- resource: pods
matchScopes:
- scopeName: PriorityClass
operator: In
values: ["cluster-services"]
Then, create a resource quota object in the kube-system
namespace:
apiVersion: v1
kind: ResourceQuota
metadata:
name: pods-cluster-services
spec:
scopeSelector:
matchExpressions:
- operator : In
scopeName: PriorityClass
values: ["cluster-services"]
kubectl apply -f https://k8s.io/examples/policy/priority-class-resourcequota.yaml -n kube-system
resourcequota/pods-cluster-services created
In this case, a pod creation will be allowed if:
- the Pod's
priorityClassName
is not specified. - the Pod's
priorityClassName
is specified to a value other than cluster-services
. - the Pod's
priorityClassName
is set to cluster-services
, it is to be created
in the kube-system
namespace, and it has passed the resource quota check.
A Pod creation request is rejected if its priorityClassName
is set to cluster-services
and it is to be created in a namespace other than kube-system
.
What's next
3 - Process ID Limits And Reservations
FEATURE STATE:
Kubernetes v1.20 [stable]
Kubernetes allow you to limit the number of process IDs (PIDs) that a
Pod can use.
You can also reserve a number of allocatable PIDs for each node
for use by the operating system and daemons (rather than by Pods).
Process IDs (PIDs) are a fundamental resource on nodes. It is trivial to hit the
task limit without hitting any other resource limits, which can then cause
instability to a host machine.
Cluster administrators require mechanisms to ensure that Pods running in the
cluster cannot induce PID exhaustion that prevents host daemons (such as the
kubelet or
kube-proxy,
and potentially also the container runtime) from running.
In addition, it is important to ensure that PIDs are limited among Pods in order
to ensure they have limited impact on other workloads on the same node.
Note:
On certain Linux installations, the operating system sets the PIDs limit to a low default,
such as 32768
. Consider raising the value of /proc/sys/kernel/pid_max
.You can configure a kubelet to limit the number of PIDs a given Pod can consume.
For example, if your node's host OS is set to use a maximum of 262144
PIDs and
expect to host less than 250
Pods, one can give each Pod a budget of 1000
PIDs to prevent using up that node's overall number of available PIDs. If the
admin wants to overcommit PIDs similar to CPU or memory, they may do so as well
with some additional risks. Either way, a single Pod will not be able to bring
the whole machine down. This kind of resource limiting helps to prevent simple
fork bombs from affecting operation of an entire cluster.
Per-Pod PID limiting allows administrators to protect one Pod from another, but
does not ensure that all Pods scheduled onto that host are unable to impact the node overall.
Per-Pod limiting also does not protect the node agents themselves from PID exhaustion.
You can also reserve an amount of PIDs for node overhead, separate from the
allocation to Pods. This is similar to how you can reserve CPU, memory, or other
resources for use by the operating system and other facilities outside of Pods
and their containers.
PID limiting is an important sibling to compute
resource requests
and limits. However, you specify it in a different way: rather than defining a
Pod's resource limit in the .spec
for a Pod, you configure the limit as a
setting on the kubelet. Pod-defined PID limits are not currently supported.
Caution:
This means that the limit that applies to a Pod may be different depending on
where the Pod is scheduled. To make things simple, it's easiest if all Nodes use
the same PID resource limits and reservations.Node PID limits
Kubernetes allows you to reserve a number of process IDs for the system use. To
configure the reservation, use the parameter pid=<number>
in the
--system-reserved
and --kube-reserved
command line options to the kubelet.
The value you specified declares that the specified number of process IDs will
be reserved for the system as a whole and for Kubernetes system daemons
respectively.
Pod PID limits
Kubernetes allows you to limit the number of processes running in a Pod. You
specify this limit at the node level, rather than configuring it as a resource
limit for a particular Pod. Each Node can have a different PID limit.
To configure the limit, you can specify the command line parameter --pod-max-pids
to the kubelet, or set PodPidsLimit
in the kubelet
configuration file.
PID based eviction
You can configure kubelet to start terminating a Pod when it is misbehaving and consuming abnormal amount of resources.
This feature is called eviction. You can
Configure Out of Resource Handling
for various eviction signals.
Use pid.available
eviction signal to configure the threshold for number of PIDs used by Pod.
You can set soft and hard eviction policies.
However, even with the hard eviction policy, if the number of PIDs growing very fast,
node can still get into unstable state by hitting the node PIDs limit.
Eviction signal value is calculated periodically and does NOT enforce the limit.
PID limiting - per Pod and per Node sets the hard limit.
Once the limit is hit, workload will start experiencing failures when trying to get a new PID.
It may or may not lead to rescheduling of a Pod,
depending on how workload reacts on these failures and how liveness and readiness
probes are configured for the Pod. However, if limits were set correctly,
you can guarantee that other Pods workload and system processes will not run out of PIDs
when one Pod is misbehaving.
What's next
4 - Node Resource Managers
In order to support latency-critical and high-throughput workloads, Kubernetes offers a suite of Resource Managers. The managers aim to co-ordinate and optimise node's resources alignment for pods configured with a specific requirement for CPUs, devices, and memory (hugepages) resources.
Hardware topology alignment policies
Topology Manager is a kubelet component that aims to coordinate the set of components that are
responsible for these optimizations. The the overall resource management process is governed using
the policy you specify.
To learn more, read Control Topology Management Policies on a Node.
Policies for assigning CPUs to Pods
FEATURE STATE:
Kubernetes v1.26 [stable]
(enabled by default: true)
Once a Pod is bound to a Node, the kubelet on that node may need to either multiplex the existing
hardware (for example, sharing CPUs across multiple Pods) or allocate hardware by dedicating some
resource (for example, assigning one of more CPUs for a Pod's exclusive use).
By default, the kubelet uses CFS quota
to enforce pod CPU limits. When the node runs many CPU-bound pods, the workload can move to different CPU cores depending on
whether the pod is throttled and which CPU cores are available at scheduling time. Many workloads are not sensitive to this migration and thus
work fine without any intervention.
However, in workloads where CPU cache affinity and scheduling latency significantly affect workload performance, the kubelet allows alternative CPU
management policies to determine some placement preferences on the node.
This is implemented using the CPU Manager and its policy.
There are two available policies:
none
: the none
policy explicitly enables the existing default CPU
affinity scheme, providing no affinity beyond what the OS scheduler does
automatically. Limits on CPU usage for
Guaranteed pods and
Burstable pods
are enforced using CFS quota.static
: the static
policy allows containers in Guaranteed
pods with integer CPU
requests
access to exclusive CPUs on the node. This exclusivity is enforced
using the cpuset cgroup controller.
Note:
System services such as the container runtime and the kubelet itself can continue to run on these exclusive CPUs. The exclusivity only extends to other pods.CPU Manager doesn't support offlining and onlining of CPUs at runtime.
Static policy
The static policy enables finer-grained CPU management and exclusive CPU assignment.
This policy manages a shared pool of CPUs that initially contains all CPUs in the
node. The amount of exclusively allocatable CPUs is equal to the total
number of CPUs in the node minus any CPU reservations set by the kubelet configuration.
CPUs reserved by these options are taken, in integer quantity, from the initial shared pool in ascending order by physical
core ID. This shared pool is the set of CPUs on which any containers in
BestEffort
and Burstable
pods run. Containers in Guaranteed
pods with fractional
CPU requests
also run on CPUs in the shared pool. Only containers that are
both part of a Guaranteed
pod and have integer CPU requests
are assigned
exclusive CPUs.
Note:
The kubelet requires a CPU reservation greater than zero when the static policy is enabled.
This is because zero CPU reservation would allow the shared pool to become empty.As Guaranteed
pods whose containers fit the requirements for being statically
assigned are scheduled to the node, CPUs are removed from the shared pool and
placed in the cpuset for the container. CFS quota is not used to bound
the CPU usage of these containers as their usage is bound by the scheduling domain
itself. In others words, the number of CPUs in the container cpuset is equal to the integer
CPU limit
specified in the pod spec. This static assignment increases CPU
affinity and decreases context switches due to throttling for the CPU-bound
workload.
Consider the containers in the following pod specs:
spec:
containers:
- name: nginx
image: nginx
The pod above runs in the BestEffort
QoS class because no resource requests
or
limits
are specified. It runs in the shared pool.
spec:
containers:
- name: nginx
image: nginx
resources:
limits:
memory: "200Mi"
requests:
memory: "100Mi"
The pod above runs in the Burstable
QoS class because resource requests
do not
equal limits
and the cpu
quantity is not specified. It runs in the shared
pool.
spec:
containers:
- name: nginx
image: nginx
resources:
limits:
memory: "200Mi"
cpu: "2"
requests:
memory: "100Mi"
cpu: "1"
The pod above runs in the Burstable
QoS class because resource requests
do not
equal limits
. It runs in the shared pool.
spec:
containers:
- name: nginx
image: nginx
resources:
limits:
memory: "200Mi"
cpu: "2"
requests:
memory: "200Mi"
cpu: "2"
The pod above runs in the Guaranteed
QoS class because requests
are equal to limits
.
And the container's resource limit for the CPU resource is an integer greater than
or equal to one. The nginx
container is granted 2 exclusive CPUs.
spec:
containers:
- name: nginx
image: nginx
resources:
limits:
memory: "200Mi"
cpu: "1.5"
requests:
memory: "200Mi"
cpu: "1.5"
The pod above runs in the Guaranteed
QoS class because requests
are equal to limits
.
But the container's resource limit for the CPU resource is a fraction. It runs in
the shared pool.
spec:
containers:
- name: nginx
image: nginx
resources:
limits:
memory: "200Mi"
cpu: "2"
The pod above runs in the Guaranteed
QoS class because only limits
are specified
and requests
are set equal to limits
when not explicitly specified. And the
container's resource limit for the CPU resource is an integer greater than or
equal to one. The nginx
container is granted 2 exclusive CPUs.
Static policy options
The behavior of the static policy can be fine-tuned using the CPU Manager policy options.
The following policy options exist for the static CPU management policy:
{{/* options in alphabetical order */}}
align-by-socket
(alpha, hidden by default)- Align CPUs by physical package / socket boundary, rather than logical NUMA boundaries (available since Kubernetes v1.25)
distribute-cpus-across-cores
(alpha, hidden by default) - Allocate virtual cores, sometimes called hardware threads, across different physical cores (available since Kubernetes v1.31)
distribute-cpus-across-numa
(alpha, hidden by default) - Spread CPUs across different NUMA domains, aiming for an even balance between the selected domains (available since Kubernetes v1.23)
full-pcpus-only
(beta, visible by default) - Always allocate full physical cores (available since Kubernetes v1.22)
strict-cpu-reservation
(alpha, hidden by default) - Prevent all the pods regardless of their Quality of Service class to run on reserved CPUs (available since Kubernetes v1.32)
prefer-align-cpus-by-uncorecache
(alpha, hidden by default) - Align CPUs by uncore (Last-Level) cache boundary on a best-effort way (available since Kubernetes v1.32)
You can toggle groups of options on and off based upon their maturity level
using the following feature gates:
CPUManagerPolicyBetaOptions
(default enabled). Disable to hide beta-level options.CPUManagerPolicyAlphaOptions
(default disabled). Enable to show alpha-level options.
You will still have to enable each option using the cpuManagerPolicyOptions
field in the
kubelet configuration file.
For more detail about the individual options you can configure, read on.
full-pcpus-only
If the full-pcpus-only
policy option is specified, the static policy will always allocate full physical cores.
By default, without this option, the static policy allocates CPUs using a topology-aware best-fit allocation.
On SMT enabled systems, the policy can allocate individual virtual cores, which correspond to hardware threads.
This can lead to different containers sharing the same physical cores; this behaviour in turn contributes
to the noisy neighbours problem.
With the option enabled, the pod will be admitted by the kubelet only if the CPU request of all its containers
can be fulfilled by allocating full physical cores.
If the pod does not pass the admission, it will be put in Failed state with the message SMTAlignmentError
.
distribute-cpus-across-numa
If the distribute-cpus-across-numa
policy option is specified, the static
policy will evenly distribute CPUs across NUMA nodes in cases where more than
one NUMA node is required to satisfy the allocation.
By default, the CPUManager
will pack CPUs onto one NUMA node until it is
filled, with any remaining CPUs simply spilling over to the next NUMA node.
This can cause undesired bottlenecks in parallel code relying on barriers (and
similar synchronization primitives), as this type of code tends to run only as
fast as its slowest worker (which is slowed down by the fact that fewer CPUs
are available on at least one NUMA node).
By distributing CPUs evenly across NUMA nodes, application developers can more
easily ensure that no single worker suffers from NUMA effects more than any
other, improving the overall performance of these types of applications.
align-by-socket
If the align-by-socket
policy option is specified, CPUs will be considered
aligned at the socket boundary when deciding how to allocate CPUs to a
container. By default, the CPUManager
aligns CPU allocations at the NUMA
boundary, which could result in performance degradation if CPUs need to be
pulled from more than one NUMA node to satisfy the allocation. Although it
tries to ensure that all CPUs are allocated from the minimum number of NUMA
nodes, there is no guarantee that those NUMA nodes will be on the same socket.
By directing the CPUManager
to explicitly align CPUs at the socket boundary
rather than the NUMA boundary, we are able to avoid such issues. Note, this
policy option is not compatible with TopologyManager
single-numa-node
policy and does not apply to hardware where the number of sockets is greater
than number of NUMA nodes.
distribute-cpus-across-cores
If the distribute-cpus-across-cores
policy option is specified, the static policy
will attempt to allocate virtual cores (hardware threads) across different physical cores.
By default, the CPUManager
tends to pack cpus onto as few physical cores as possible,
which can lead to contention among cpus on the same physical core and result
in performance bottlenecks. By enabling the distribute-cpus-across-cores
policy,
the static policy ensures that cpus are distributed across as many physical cores
as possible, reducing the contention on the same physical core and thereby
improving overall performance. However, it is important to note that this strategy
might be less effective when the system is heavily loaded. Under such conditions,
the benefit of reducing contention diminishes. Conversely, default behavior
can help in reducing inter-core communication overhead, potentially providing
better performance under high load conditions.
strict-cpu-reservation
The reservedSystemCPUs
parameter in KubeletConfiguration,
or the deprecated kubelet command line option --reserved-cpus
, defines an explicit CPU set for OS system daemons
and kubernetes system daemons. More details of this parameter can be found on the
Explicitly Reserved CPU List page.
By default this isolation is implemented only for guaranteed pods with integer CPU requests not for burstable and best-effort pods
(and guaranteed pods with fractional CPU requests). Admission is only comparing the cpu requests against the allocatable cpus.
Since the cpu limit is higher than the request, the default behaviour allows burstable and best-effort pods to use up the capacity
of reservedSystemCPUs
and cause host OS services to starve in real life deployments.
If the strict-cpu-reservation
policy option is enabled, the static policy will not allow
any workload to use the CPU cores specified in reservedSystemCPUs
.
prefer-align-cpus-by-uncorecache
If the prefer-align-cpus-by-uncorecache
policy is specified, the static policy
will allocate CPU resources for individual containers such that all CPUs assigned
to a container share the same uncore cache block (also known as the Last-Level Cache
or LLC). By default, the CPUManager
will tightly pack CPU assignments which can
result in containers being assigned CPUs from multiple uncore caches. This option
enables the CPUManager
to allocate CPUs in a way that maximizes the efficient use
of the uncore cache. Allocation is performed on a best-effort basis, aiming to
affine as many CPUs as possible within the same uncore cache. If the container's
CPU requirement exceeds the CPU capacity of a single uncore cache, the CPUManager
minimizes the number of uncore caches used in order to maintain optimal uncore
cache alignment. Specific workloads can benefit in performance from the reduction
of inter-cache latency and noisy neighbors at the cache level. If the CPUManager
cannot align optimally while the node has sufficient resources, the container will
still be admitted using the default packed behavior.
Memory Management Policies
FEATURE STATE:
Kubernetes v1.32 [stable]
(enabled by default: true)
The Kubernetes Memory Manager enables the feature of guaranteed memory (and hugepages)
allocation for pods in the Guaranteed
QoS class.
The Memory Manager employs hint generation protocol to yield the most suitable NUMA affinity for a pod.
The Memory Manager feeds the central manager (Topology Manager) with these affinity hints.
Based on both the hints and Topology Manager policy, the pod is rejected or admitted to the node.
Moreover, the Memory Manager ensures that the memory which a pod requests
is allocated from a minimum number of NUMA nodes.
Other resource managers
The configuration of individual managers is elaborated in dedicated documents: