This the multi-page printable view of this section. Click here to print.
1 - Considerations for large clusters
A cluster is a set of nodes (physical or virtual machines) running Kubernetes agents, managed by the control plane. Kubernetes v1.21 supports clusters with up to 5000 nodes. More specifically, Kubernetes is designed to accommodate configurations that meet all of the following criteria:
- No more than 110 pods per node
- No more than 5000 nodes
- No more than 150000 total pods
- No more than 300000 total containers
You can scale your cluster by adding or removing nodes. The way you do this depends on how your cluster is deployed.
Cloud provider resource quotas
To avoid running into cloud provider quota issues, when creating a cluster with many nodes, consider:
- Requesting a quota increase for cloud resources such as:
- Computer instances
- Storage volumes
- In-use IP addresses
- Packet filtering rule sets
- Number of load balancers
- Network subnets
- Log streams
- Gating the cluster scaling actions to bring up new nodes in batches, with a pause between batches, because some cloud providers rate limit the creation of new instances.
Control plane components
For a large cluster, you need a control plane with sufficient compute and other resources.
Typically you would run one or two control plane instances per failure zone, scaling those instances vertically first and then scaling horizontally after reaching the point of falling returns to (vertical) scale.
You should run at least one instance per failure zone to provide fault-tolerance. Kubernetes nodes do not automatically steer traffic towards control-plane endpoints that are in the same failure zone; however, your cloud provider might have its own mechanisms to do this.
For example, using a managed load balancer, you configure the load balancer to send traffic that originates from the kubelet and Pods in failure zone A, and direct that traffic only to the control plane hosts that are also in zone A. If a single control-plane host or endpoint failure zone A goes offline, that means that all the control-plane traffic for nodes in zone A is now being sent between zones. Running multiple control plane hosts in each zone makes that outcome less likely.
To improve performance of large clusters, you can store Event objects in a separate dedicated etcd instance.
When creating a cluster, you can (using custom tooling):
- start and configure additional etcd instance
- configure the API server to use it for storing events
See Operating etcd clusters for Kubernetes and Set up a High Availability etcd cluster with kubeadm for details on configuring and managing etcd for a large cluster.
Kubernetes resource limits help to minimize the impact of memory leaks and other ways that pods and containers can impact on other components. These resource limits apply to addon resources just as they apply to application workloads.
For example, you can set CPU and memory limits for a logging component:
... containers: - name: fluentd-cloud-logging image: fluent/fluentd-kubernetes-daemonset:v1 resources: limits: cpu: 100m memory: 200Mi
Addons' default limits are typically based on data collected from experience running each addon on small or medium Kubernetes clusters. When running on large clusters, addons often consume more of some resources than their default limits. If a large cluster is deployed without adjusting these values, the addon(s) may continuously get killed because they keep hitting the memory limit. Alternatively, the addon may run but with poor performance due to CPU time slice restrictions.
To avoid running into cluster addon resource issues, when creating a cluster with many nodes, consider the following:
- Some addons scale vertically - there is one replica of the addon for the cluster or serving a whole failure zone. For these addons, increase requests and limits as you scale out your cluster.
- Many addons scale horizontally - you add capacity by running more pods - but with a very large cluster you may also need to raise CPU or memory limits slightly. The VerticalPodAutoscaler can run in recommender mode to provide suggested figures for requests and limits.
- Some addons run as one copy per node, controlled by a DaemonSet: for example, a node-level log aggregator. Similar to the case with horizontally-scaled addons, you may also need to raise CPU or memory limits slightly.
VerticalPodAutoscaler is a custom resource that you can deploy into your cluster
to help you manage resource requests and limits for pods.
Visit Vertical Pod Autoscaler to learn more about
VerticalPodAutoscaler and how you can use it to scale cluster
components, including cluster-critical addons.
The cluster autoscaler integrates with a number of cloud providers to help you run the right number of nodes for the level of resource demand in your cluster.
2 - Running in multiple zones
This page describes running Kubernetes across multiple zones.
Kubernetes is designed so that a single Kubernetes cluster can run across multiple failure zones, typically where these zones fit within a logical grouping called a region. Major cloud providers define a region as a set of failure zones (also called availability zones) that provide a consistent set of features: within a region, each zone offers the same APIs and services.
Typical cloud architectures aim to minimize the chance that a failure in one zone also impairs services in another zone.
Control plane behavior
All control plane components support running as a pool of interchangeable resources, replicated per component.
When you deploy a cluster control plane, place replicas of control plane components across multiple failure zones. If availability is an important concern, select at least three failure zones and replicate each individual control plane component (API server, scheduler, etcd, cluster controller manager) across at least three failure zones. If you are running a cloud controller manager then you should also replicate this across all the failure zones you selected.
Note: Kubernetes does not provide cross-zone resilience for the API server endpoints. You can use various techniques to improve availability for the cluster API server, including DNS round-robin, SRV records, or a third-party load balancing solution with health checking.
If your cluster spans multiple zones or regions, you can use node labels in conjunction with Pod topology spread constraints to control how Pods are spread across your cluster among fault domains: regions, zones, and even specific nodes. These hints enable the scheduler to place Pods for better expected availability, reducing the risk that a correlated failure affects your whole workload.
For example, you can set a constraint to make sure that the 3 replicas of a StatefulSet are all running in different zones to each other, whenever that is feasible. You can define this declaratively without explicitly defining which availability zones are in use for each workload.
Distributing nodes across zones
Kubernetes' core does not create nodes for you; you need to do that yourself, or use a tool such as the Cluster API to manage nodes on your behalf.
Using tools such as the Cluster API you can define sets of machines to run as worker nodes for your cluster across multiple failure domains, and rules to automatically heal the cluster in case of whole-zone service disruption.
Manual zone assignment for Pods
You can apply node selector constraints to Pods that you create, as well as to Pod templates in workload resources such as Deployment, StatefulSet, or Job.
Storage access for zones
When persistent volumes are created, the
automatically adds zone labels to any PersistentVolumes that are linked to a specific
zone. The scheduler then ensures,
NoVolumeZoneConflict predicate, that pods which claim a given PersistentVolume
are only placed into the same zone as that volume.
You can specify a StorageClass for PersistentVolumeClaims that specifies the failure domains (zones) that the storage in that class may use. To learn about configuring a StorageClass that is aware of failure domains or zones, see Allowed topologies.
By itself, Kubernetes does not include zone-aware networking. You can use a
to configure cluster networking, and that network solution might have zone-specific
elements. For example, if your cloud provider supports Services with
type=LoadBalancer, the load balancer might only send traffic to Pods running in the
same zone as the load balancer element processing a given connection.
Check your cloud provider's documentation for details.
For custom or on-premises deployments, similar considerations apply. Service and Ingress behavior, including handling of different failure zones, does vary depending on exactly how your cluster is set up.
When you set up your cluster, you might also need to consider whether and how
your setup can restore service if all the failure zones in a region go
off-line at the same time. For example, do you rely on there being at least
one node able to run Pods in a zone?
Make sure that any cluster-critical repair work does not rely on there being at least one healthy node in your cluster. For example: if all nodes are unhealthy, you might need to run a repair Job with a special toleration so that the repair can complete enough to bring at least one node into service.
Kubernetes doesn't come with an answer for this challenge; however, it's something to consider.
To learn how the scheduler places Pods in a cluster, honoring the configured constraints, visit Scheduling and Eviction.
3 - Validate node setup
Node Conformance Test
Node conformance test is a containerized test framework that provides a system verification and functionality test for a node. The test validates whether the node meets the minimum requirements for Kubernetes; a node that passes the test is qualified to join a Kubernetes cluster.
To run node conformance test, a node must satisfy the same prerequisites as a standard Kubernetes node. At a minimum, the node should have the following daemons installed:
- Container Runtime (Docker)
Running Node Conformance Test
To run the node conformance test, perform the following steps:
- Work out the value of the
--kubeconfigoption for the kubelet; for example:
--kubeconfig=/var/lib/kubelet/config.yaml. Because the test framework starts a local control plane to test the kubelet, use
http://localhost:8080as the URL of the API server. There are some other kubelet command line parameters you may want to use:
--pod-cidr: If you are using
kubenet, you should specify an arbitrary CIDR to Kubelet, for example
--cloud-provider: If you are using
--cloud-provider=gce, you should remove the flag to run the test.
- Run the node conformance test with command:
# $CONFIG_DIR is the pod manifest path of your Kubelet. # $LOG_DIR is the test output path. sudo docker run -it --rm --privileged --net=host \ -v /:/rootfs -v $CONFIG_DIR:$CONFIG_DIR -v $LOG_DIR:/var/result \ k8s.gcr.io/node-test:0.2
Running Node Conformance Test for Other Architectures
Kubernetes also provides node conformance test docker images for other architectures:
Running Selected Test
To run specific tests, overwrite the environment variable
FOCUS with the
regular expression of tests you want to run.
sudo docker run -it --rm --privileged --net=host \ -v /:/rootfs:ro -v $CONFIG_DIR:$CONFIG_DIR -v $LOG_DIR:/var/result \ -e FOCUS=MirrorPod \ # Only run MirrorPod test k8s.gcr.io/node-test:0.2
To skip specific tests, overwrite the environment variable
SKIP with the
regular expression of tests you want to skip.
sudo docker run -it --rm --privileged --net=host \ -v /:/rootfs:ro -v $CONFIG_DIR:$CONFIG_DIR -v $LOG_DIR:/var/result \ -e SKIP=MirrorPod \ # Run all conformance tests but skip MirrorPod test k8s.gcr.io/node-test:0.2
Node conformance test is a containerized version of node e2e test. By default, it runs all conformance tests.
Theoretically, you can run any node e2e test if you configure the container and mount required volumes properly. But it is strongly recommended to only run conformance test, because it requires much more complex configuration to run non-conformance test.
- The test leaves some docker images on the node, including the node conformance test image and images of containers used in the functionality test.
- The test leaves dead containers on the node. These containers are created during the functionality test.
4 - PKI certificates and requirements
Kubernetes requires PKI certificates for authentication over TLS. If you install Kubernetes with kubeadm, the certificates that your cluster requires are automatically generated. You can also generate your own certificates -- for example, to keep your private keys more secure by not storing them on the API server. This page explains the certificates that your cluster requires.
How certificates are used by your cluster
Kubernetes requires PKI for the following operations:
- Client certificates for the kubelet to authenticate to the API server
- Server certificate for the API server endpoint
- Client certificates for administrators of the cluster to authenticate to the API server
- Client certificates for the API server to talk to the kubelets
- Client certificate for the API server to talk to etcd
- Client certificate/kubeconfig for the controller manager to talk to the API server
- Client certificate/kubeconfig for the scheduler to talk to the API server.
- Client and server certificates for the front-proxy
front-proxycertificates are required only if you run kube-proxy to support an extension API server.
etcd also implements mutual TLS to authenticate clients and peers.
Where certificates are stored
If you install Kubernetes with kubeadm, certificates are stored in
/etc/kubernetes/pki. All paths in this documentation are relative to that directory.
Configure certificates manually
If you don't want kubeadm to generate the required certificates, you can create them in either of the following ways.
Single root CA
You can create a single root CA, controlled by an administrator. This root CA can then create multiple intermediate CAs, and delegate all further creation to Kubernetes itself.
|ca.crt,key||kubernetes-ca||Kubernetes general CA|
|etcd/ca.crt,key||etcd-ca||For all etcd-related functions|
|front-proxy-ca.crt,key||kubernetes-front-proxy-ca||For the front-end proxy|
On top of the above CAs, it is also necessary to get a public/private key pair for service account management,
If you don't wish to copy the CA private keys to your cluster, you can generate all certificates yourself.
|Default CN||Parent CA||O (in Subject)||kind||hosts (SAN)|
: any other IP or DNS name you contact your cluster on (as used by kubeadm
the load balancer stable IP and/or DNS name,
kind maps to one or more of the x509 key usage types:
|server||digital signature, key encipherment, server auth|
|client||digital signature, key encipherment, client auth|
Note: Hosts/SAN listed above are the recommended ones for getting a working cluster; if required by a specific setup, it is possible to add additional SANs on all the server certificates.
For kubeadm users only:
- The scenario where you are copying to your cluster CA certificates without private keys is referred as external CA in the kubeadm documentation.
- If you are comparing the above list with a kubeadm generated PKI, please be aware that
kube-etcd-healthcheck-clientcertificates are not generated in case of external etcd.
Certificates should be placed in a recommended path (as used by kubeadm). Paths should be specified using the given argument regardless of location.
|Default CN||recommended key path||recommended cert path||command||key argument||cert argument|
|kubernetes-ca||ca.key||ca.crt||kube-controller-manager||--cluster-signing-key-file||--client-ca-file, --root-ca-file, --cluster-signing-cert-file|
Same considerations apply for the service account key pair:
|private key path||public key path||command||argument|
Configure certificates for user accounts
You must manually configure these administrator account and service accounts:
|filename||credential name||Default CN||O (in Subject)|
Note: The value of
kubelet.confmust match precisely the value of the node name provided by the kubelet as it registers with the apiserver. For further details, read the Node Authorization.
For each config, generate an x509 cert/key pair with the given CN and O.
kubectlas follows for each config:
KUBECONFIG=<filename> kubectl config set-cluster default-cluster --server=https://<host ip>:6443 --certificate-authority <path-to-kubernetes-ca> --embed-certs KUBECONFIG=<filename> kubectl config set-credentials <credential-name> --client-key <path-to-key>.pem --client-certificate <path-to-cert>.pem --embed-certs KUBECONFIG=<filename> kubectl config set-context default-system --cluster default-cluster --user <credential-name> KUBECONFIG=<filename> kubectl config use-context default-system
These files are used as follows:
|admin.conf||kubectl||Configures administrator user for the cluster|
|kubelet.conf||kubelet||One required for each node in the cluster.|
|controller-manager.conf||kube-controller-manager||Must be added to manifest in |
|scheduler.conf||kube-scheduler||Must be added to manifest in |