This section of the Kubernetes documentation contains pages that
show how to do individual tasks. A task page shows how to do a
single thing, typically by giving a short sequence of steps.
The Kubernetes command-line tool, kubectl, allows
you to run commands against Kubernetes clusters.
You can use kubectl to deploy applications, inspect and manage cluster resources,
and view logs. For more information including a complete list of kubectl operations, see the
kubectl reference documentation.
kubectl is installable on a variety of Linux platforms, macOS and Windows.
Find your preferred operating system below.
Like kind, minikube is a tool that lets you run Kubernetes
locally. minikube runs a single-node Kubernetes cluster on your personal
computer (including Windows, macOS and Linux PCs) so that you can try out
Kubernetes, or for daily development work.
You can follow the official
Get Started! guide if your focus is
on getting the tool installed.
You can use the kubeadm tool to create and manage Kubernetes clusters.
It performs the actions necessary to get a minimum viable, secure cluster up and running in a user friendly way.
You must use a kubectl version that is within one minor version difference of your cluster. For example, a v1.24 client can communicate with v1.23, v1.24, and v1.25 control planes.
Using the latest compatible version of kubectl helps avoid unforeseen issues.
Install kubectl on Linux
The following methods exist for installing kubectl on Linux:
If you are on Ubuntu or another Linux distribution that supports the snap package manager, kubectl is available as a snap application.
snap install kubectl --classic
kubectl version --client
If you are on Linux and using Homebrew package manager, kubectl is available for installation.
brew install kubectl
kubectl version --client
Verify kubectl configuration
In order for kubectl to find and access a Kubernetes cluster, it needs a
kubeconfig file,
which is created automatically when you create a cluster using
kube-up.sh
or successfully deploy a Minikube cluster.
By default, kubectl configuration is located at ~/.kube/config.
Check that kubectl is properly configured by getting the cluster state:
kubectl cluster-info
If you see a URL response, kubectl is correctly configured to access your cluster.
If you see a message similar to the following, kubectl is not configured correctly or is not able to connect to a Kubernetes cluster.
The connection to the server <server-name:port> was refused - did you specify the right host or port?
For example, if you are intending to run a Kubernetes cluster on your laptop (locally), you will need a tool like Minikube to be installed first and then re-run the commands stated above.
If kubectl cluster-info returns the url response but you can't access your cluster, to check whether it is configured properly, use:
kubectl cluster-info dump
Optional kubectl configurations and plugins
Enable shell autocompletion
kubectl provides autocompletion support for Bash, Zsh, Fish, and PowerShell, which can save you a lot of typing.
Below are the procedures to set up autocompletion for Bash, Fish, and Zsh.
The kubectl completion script for Bash can be generated with the command kubectl completion bash. Sourcing the completion script in your shell enables kubectl autocompletion.
However, the completion script depends on bash-completion, which means that you have to install this software first (you can test if you have bash-completion already installed by running type _init_completion).
Install bash-completion
bash-completion is provided by many package managers (see here). You can install it with apt-get install bash-completion or yum install bash-completion, etc.
The above commands create /usr/share/bash-completion/bash_completion, which is the main script of bash-completion. Depending on your package manager, you have to manually source this file in your ~/.bashrc file.
To find out, reload your shell and run type _init_completion. If the command succeeds, you're already set, otherwise add the following to your ~/.bashrc file:
source /usr/share/bash-completion/bash_completion
Reload your shell and verify that bash-completion is correctly installed by typing type _init_completion.
Enable kubectl autocompletion
Bash
You now need to ensure that the kubectl completion script gets sourced in all your shell sessions. There are two ways in which you can do this:
kubectl completion bash | sudo tee /etc/bash_completion.d/kubectl > /dev/null
If you have an alias for kubectl, you can extend shell completion to work with that alias:
echo'alias k=kubectl' >>~/.bashrc
echo'complete -F __start_kubectl k' >>~/.bashrc
Note: bash-completion sources all completion scripts in /etc/bash_completion.d.
Both approaches are equivalent. After reloading your shell, kubectl autocompletion should be working.
The kubectl completion script for Fish can be generated with the command kubectl completion fish. Sourcing the completion script in your shell enables kubectl autocompletion.
To do so in all your shell sessions, add the following line to your ~/.config/fish/config.fish file:
kubectl completion fish | source
After reloading your shell, kubectl autocompletion should be working.
The kubectl completion script for Zsh can be generated with the command kubectl completion zsh. Sourcing the completion script in your shell enables kubectl autocompletion.
To do so in all your shell sessions, add the following to your ~/.zshrc file:
source <(kubectl completion zsh)
If you have an alias for kubectl, kubectl autocompletion will automatically work with it.
After reloading your shell, kubectl autocompletion should be working.
If you get an error like 2: command not found: compdef, then add the following to the beginning of your ~/.zshrc file:
autoload -Uz compinit
compinit
Install kubectl convert plugin
A plugin for Kubernetes command-line tool kubectl, which allows you to convert manifests between different API
versions. This can be particularly helpful to migrate manifests to a non-deprecated api version with newer Kubernetes release.
For more info, visit migrate to non deprecated apis
You must use a kubectl version that is within one minor version difference of your cluster. For example, a v1.24 client can communicate with v1.23, v1.24, and v1.25 control planes.
Using the latest compatible version of kubectl helps avoid unforeseen issues.
Install kubectl on macOS
The following methods exist for installing kubectl on macOS:
Note: Make sure /usr/local/bin is in your PATH environment variable.
Test to ensure the version you installed is up-to-date:
kubectl version --client
Or use this for detailed view of version:
kubectl version --client --output=yaml
Install with Homebrew on macOS
If you are on macOS and using Homebrew package manager, you can install kubectl with Homebrew.
Run the installation command:
brew install kubectl
or
brew install kubernetes-cli
Test to ensure the version you installed is up-to-date:
kubectl version --client
Install with Macports on macOS
If you are on macOS and using Macports package manager, you can install kubectl with Macports.
Run the installation command:
sudo port selfupdate
sudo port install kubectl
Test to ensure the version you installed is up-to-date:
kubectl version --client
Verify kubectl configuration
In order for kubectl to find and access a Kubernetes cluster, it needs a
kubeconfig file,
which is created automatically when you create a cluster using
kube-up.sh
or successfully deploy a Minikube cluster.
By default, kubectl configuration is located at ~/.kube/config.
Check that kubectl is properly configured by getting the cluster state:
kubectl cluster-info
If you see a URL response, kubectl is correctly configured to access your cluster.
If you see a message similar to the following, kubectl is not configured correctly or is not able to connect to a Kubernetes cluster.
The connection to the server <server-name:port> was refused - did you specify the right host or port?
For example, if you are intending to run a Kubernetes cluster on your laptop (locally), you will need a tool like Minikube to be installed first and then re-run the commands stated above.
If kubectl cluster-info returns the url response but you can't access your cluster, to check whether it is configured properly, use:
kubectl cluster-info dump
Optional kubectl configurations and plugins
Enable shell autocompletion
kubectl provides autocompletion support for Bash, Zsh, Fish, and PowerShell which can save you a lot of typing.
Below are the procedures to set up autocompletion for Bash, Fish, and Zsh.
The kubectl completion script for Bash can be generated with kubectl completion bash. Sourcing this script in your shell enables kubectl completion.
However, the kubectl completion script depends on bash-completion which you thus have to previously install.
Warning: There are two versions of bash-completion, v1 and v2. V1 is for Bash 3.2 (which is the default on macOS), and v2 is for Bash 4.1+. The kubectl completion script doesn't work correctly with bash-completion v1 and Bash 3.2. It requires bash-completion v2 and Bash 4.1+. Thus, to be able to correctly use kubectl completion on macOS, you have to install and use Bash 4.1+ (instructions). The following instructions assume that you use Bash 4.1+ (that is, any Bash version of 4.1 or newer).
Upgrade Bash
The instructions here assume you use Bash 4.1+. You can check your Bash's version by running:
echo$BASH_VERSION
If it is too old, you can install/upgrade it using Homebrew:
brew install bash
Reload your shell and verify that the desired version is being used:
echo$BASH_VERSION$SHELL
Homebrew usually installs it at /usr/local/bin/bash.
Install bash-completion
Note: As mentioned, these instructions assume you use Bash 4.1+, which means you will install bash-completion v2 (in contrast to Bash 3.2 and bash-completion v1, in which case kubectl completion won't work).
You can test if you have bash-completion v2 already installed with type _init_completion. If not, you can install it with Homebrew:
brew install bash-completion@2
As stated in the output of this command, add the following to your ~/.bash_profile file:
If you have an alias for kubectl, you can extend shell completion to work with that alias:
echo'alias k=kubectl' >>~/.bash_profile
echo'complete -F __start_kubectl k' >>~/.bash_profile
If you installed kubectl with Homebrew (as explained here), then the kubectl completion script should already be in /usr/local/etc/bash_completion.d/kubectl. In that case, you don't need to do anything.
Note: The Homebrew installation of bash-completion v2 sources all the files in the BASH_COMPLETION_COMPAT_DIR directory, that's why the latter two methods work.
In any case, after reloading your shell, kubectl completion should be working.
The kubectl completion script for Fish can be generated with the command kubectl completion fish. Sourcing the completion script in your shell enables kubectl autocompletion.
To do so in all your shell sessions, add the following line to your ~/.config/fish/config.fish file:
kubectl completion fish | source
After reloading your shell, kubectl autocompletion should be working.
The kubectl completion script for Zsh can be generated with the command kubectl completion zsh. Sourcing the completion script in your shell enables kubectl autocompletion.
To do so in all your shell sessions, add the following to your ~/.zshrc file:
source <(kubectl completion zsh)
If you have an alias for kubectl, kubectl autocompletion will automatically work with it.
After reloading your shell, kubectl autocompletion should be working.
If you get an error like 2: command not found: compdef, then add the following to the beginning of your ~/.zshrc file:
autoload -Uz compinit
compinit
Install kubectl convert plugin
A plugin for Kubernetes command-line tool kubectl, which allows you to convert manifests between different API
versions. This can be particularly helpful to migrate manifests to a non-deprecated api version with newer Kubernetes release.
For more info, visit migrate to non deprecated apis
You must use a kubectl version that is within one minor version difference of your cluster. For example, a v1.24 client can communicate with v1.23, v1.24, and v1.25 control planes.
Using the latest compatible version of kubectl helps avoid unforeseen issues.
Install kubectl on Windows
The following methods exist for installing kubectl on Windows:
Append or prepend the kubectl binary folder to your PATH environment variable.
Test to ensure the version of kubectl is the same as downloaded:
kubectl version --client
Or use this for detailed view of version:
kubectl version --client --output=yaml
Note:Docker Desktop for Windows adds its own version of kubectl to PATH.
If you have installed Docker Desktop before, you may need to place your PATH entry before the one added by the Docker Desktop installer or remove the Docker Desktop's kubectl.
Install on Windows using Chocolatey or Scoop
To install kubectl on Windows you can use either Chocolatey package manager or Scoop command-line installer.
Test to ensure the version you installed is up-to-date:
kubectl version --client
Navigate to your home directory:
# If you're using cmd.exe, run: cd %USERPROFILE%cd ~
Create the .kube directory:
mkdir .kube
Change to the .kube directory you just created:
cd .kube
Configure kubectl to use a remote Kubernetes cluster:
New-Item config -type file
Note: Edit the config file with a text editor of your choice, such as Notepad.
Verify kubectl configuration
In order for kubectl to find and access a Kubernetes cluster, it needs a
kubeconfig file,
which is created automatically when you create a cluster using
kube-up.sh
or successfully deploy a Minikube cluster.
By default, kubectl configuration is located at ~/.kube/config.
Check that kubectl is properly configured by getting the cluster state:
kubectl cluster-info
If you see a URL response, kubectl is correctly configured to access your cluster.
If you see a message similar to the following, kubectl is not configured correctly or is not able to connect to a Kubernetes cluster.
The connection to the server <server-name:port> was refused - did you specify the right host or port?
For example, if you are intending to run a Kubernetes cluster on your laptop (locally), you will need a tool like Minikube to be installed first and then re-run the commands stated above.
If kubectl cluster-info returns the url response but you can't access your cluster, to check whether it is configured properly, use:
kubectl cluster-info dump
Optional kubectl configurations and plugins
Enable shell autocompletion
kubectl provides autocompletion support for Bash, Zsh, Fish, and PowerShell, which can save you a lot of typing.
Below are the procedures to set up autocompletion for PowerShell.
The kubectl completion script for PowerShell can be generated with the command kubectl completion powershell.
To do so in all your shell sessions, add the following line to your $PROFILE file:
This command will regenerate the auto-completion script on every PowerShell start up. You can also add the generated script directly to your $PROFILE file.
To add the generated script to your $PROFILE file, run the following line in your powershell prompt:
kubectl completion powershell >> $PROFILE
After reloading your shell, kubectl autocompletion should be working.
Install kubectl convert plugin
A plugin for Kubernetes command-line tool kubectl, which allows you to convert manifests between different API
versions. This can be particularly helpful to migrate manifests to a non-deprecated api version with newer Kubernetes release.
For more info, visit migrate to non deprecated apis
Snippets to be included in the main kubectl-installs-*.md pages.
1.4.1 - bash auto-completion on Linux
Some optional configuration for bash auto-completion on Linux.
Introduction
The kubectl completion script for Bash can be generated with the command kubectl completion bash. Sourcing the completion script in your shell enables kubectl autocompletion.
However, the completion script depends on bash-completion, which means that you have to install this software first (you can test if you have bash-completion already installed by running type _init_completion).
Install bash-completion
bash-completion is provided by many package managers (see here). You can install it with apt-get install bash-completion or yum install bash-completion, etc.
The above commands create /usr/share/bash-completion/bash_completion, which is the main script of bash-completion. Depending on your package manager, you have to manually source this file in your ~/.bashrc file.
To find out, reload your shell and run type _init_completion. If the command succeeds, you're already set, otherwise add the following to your ~/.bashrc file:
source /usr/share/bash-completion/bash_completion
Reload your shell and verify that bash-completion is correctly installed by typing type _init_completion.
Enable kubectl autocompletion
Bash
You now need to ensure that the kubectl completion script gets sourced in all your shell sessions. There are two ways in which you can do this:
kubectl completion bash | sudo tee /etc/bash_completion.d/kubectl > /dev/null
If you have an alias for kubectl, you can extend shell completion to work with that alias:
echo'alias k=kubectl' >>~/.bashrc
echo'complete -F __start_kubectl k' >>~/.bashrc
Note: bash-completion sources all completion scripts in /etc/bash_completion.d.
Both approaches are equivalent. After reloading your shell, kubectl autocompletion should be working.
1.4.2 - bash auto-completion on macOS
Some optional configuration for bash auto-completion on macOS.
Introduction
The kubectl completion script for Bash can be generated with kubectl completion bash. Sourcing this script in your shell enables kubectl completion.
However, the kubectl completion script depends on bash-completion which you thus have to previously install.
Warning: There are two versions of bash-completion, v1 and v2. V1 is for Bash 3.2 (which is the default on macOS), and v2 is for Bash 4.1+. The kubectl completion script doesn't work correctly with bash-completion v1 and Bash 3.2. It requires bash-completion v2 and Bash 4.1+. Thus, to be able to correctly use kubectl completion on macOS, you have to install and use Bash 4.1+ (instructions). The following instructions assume that you use Bash 4.1+ (that is, any Bash version of 4.1 or newer).
Upgrade Bash
The instructions here assume you use Bash 4.1+. You can check your Bash's version by running:
echo$BASH_VERSION
If it is too old, you can install/upgrade it using Homebrew:
brew install bash
Reload your shell and verify that the desired version is being used:
echo$BASH_VERSION$SHELL
Homebrew usually installs it at /usr/local/bin/bash.
Install bash-completion
Note: As mentioned, these instructions assume you use Bash 4.1+, which means you will install bash-completion v2 (in contrast to Bash 3.2 and bash-completion v1, in which case kubectl completion won't work).
You can test if you have bash-completion v2 already installed with type _init_completion. If not, you can install it with Homebrew:
brew install bash-completion@2
As stated in the output of this command, add the following to your ~/.bash_profile file:
If you have an alias for kubectl, you can extend shell completion to work with that alias:
echo'alias k=kubectl' >>~/.bash_profile
echo'complete -F __start_kubectl k' >>~/.bash_profile
If you installed kubectl with Homebrew (as explained here), then the kubectl completion script should already be in /usr/local/etc/bash_completion.d/kubectl. In that case, you don't need to do anything.
Note: The Homebrew installation of bash-completion v2 sources all the files in the BASH_COMPLETION_COMPAT_DIR directory, that's why the latter two methods work.
In any case, after reloading your shell, kubectl completion should be working.
1.4.3 - fish auto-completion
Optional configuration to enable fish shell auto-completion.
The kubectl completion script for Fish can be generated with the command kubectl completion fish. Sourcing the completion script in your shell enables kubectl autocompletion.
To do so in all your shell sessions, add the following line to your ~/.config/fish/config.fish file:
kubectl completion fish | source
After reloading your shell, kubectl autocompletion should be working.
1.4.4 - kubectl-convert overview
A kubectl plugin that allows you to convert manifests from one version of a Kubernetes API to a different version.
A plugin for Kubernetes command-line tool kubectl, which allows you to convert manifests between different API
versions. This can be particularly helpful to migrate manifests to a non-deprecated api version with newer Kubernetes release.
For more info, visit migrate to non deprecated apis
1.4.5 - PowerShell auto-completion
Some optional configuration for powershell auto-completion.
The kubectl completion script for PowerShell can be generated with the command kubectl completion powershell.
To do so in all your shell sessions, add the following line to your $PROFILE file:
This command will regenerate the auto-completion script on every PowerShell start up. You can also add the generated script directly to your $PROFILE file.
To add the generated script to your $PROFILE file, run the following line in your powershell prompt:
kubectl completion powershell >> $PROFILE
After reloading your shell, kubectl autocompletion should be working.
1.4.6 - verify kubectl install
How to verify kubectl.
In order for kubectl to find and access a Kubernetes cluster, it needs a
kubeconfig file,
which is created automatically when you create a cluster using
kube-up.sh
or successfully deploy a Minikube cluster.
By default, kubectl configuration is located at ~/.kube/config.
Check that kubectl is properly configured by getting the cluster state:
kubectl cluster-info
If you see a URL response, kubectl is correctly configured to access your cluster.
If you see a message similar to the following, kubectl is not configured correctly or is not able to connect to a Kubernetes cluster.
The connection to the server <server-name:port> was refused - did you specify the right host or port?
For example, if you are intending to run a Kubernetes cluster on your laptop (locally), you will need a tool like Minikube to be installed first and then re-run the commands stated above.
If kubectl cluster-info returns the url response but you can't access your cluster, to check whether it is configured properly, use:
Some optional configuration for zsh auto-completion.
The kubectl completion script for Zsh can be generated with the command kubectl completion zsh. Sourcing the completion script in your shell enables kubectl autocompletion.
To do so in all your shell sessions, add the following to your ~/.zshrc file:
source <(kubectl completion zsh)
If you have an alias for kubectl, kubectl autocompletion will automatically work with it.
After reloading your shell, kubectl autocompletion should be working.
If you get an error like 2: command not found: compdef, then add the following to the beginning of your ~/.zshrc file:
autoload -Uz compinit
compinit
2 - Administer a Cluster
Learn common tasks for administering a cluster.
2.1 - Administration with kubeadm
2.1.1 - Certificate Management with kubeadm
FEATURE STATE:Kubernetes v1.15 [stable]
Client certificates generated by kubeadm expire after 1 year.
This page explains how to manage certificate renewals with kubeadm. It also covers other tasks related
to kubeadm certificate management.
By default, kubeadm generates all the certificates needed for a cluster to run.
You can override this behavior by providing your own certificates.
To do so, you must place them in whatever directory is specified by the
--cert-dir flag or the certificatesDir field of kubeadm's ClusterConfiguration.
By default this is /etc/kubernetes/pki.
If a given certificate and private key pair exists before running kubeadm init,
kubeadm does not overwrite them. This means you can, for example, copy an existing
CA into /etc/kubernetes/pki/ca.crt and /etc/kubernetes/pki/ca.key,
and kubeadm will use this CA for signing the rest of the certificates.
External CA mode
It is also possible to provide only the ca.crt file and not the
ca.key file (this is only available for the root CA file, not other cert pairs).
If all other certificates and kubeconfig files are in place, kubeadm recognizes
this condition and activates the "External CA" mode. kubeadm will proceed without the
CA key on disk.
Instead, run the controller-manager standalone with --controllers=csrsigner and
point to the CA certificate and key.
You can use the check-expiration subcommand to check when certificates expire:
kubeadm certs check-expiration
The output is similar to this:
CERTIFICATE EXPIRES RESIDUAL TIME CERTIFICATE AUTHORITY EXTERNALLY MANAGED
admin.conf Dec 30, 2020 23:36 UTC 364d no
apiserver Dec 30, 2020 23:36 UTC 364d ca no
apiserver-etcd-client Dec 30, 2020 23:36 UTC 364d etcd-ca no
apiserver-kubelet-client Dec 30, 2020 23:36 UTC 364d ca no
controller-manager.conf Dec 30, 2020 23:36 UTC 364d no
etcd-healthcheck-client Dec 30, 2020 23:36 UTC 364d etcd-ca no
etcd-peer Dec 30, 2020 23:36 UTC 364d etcd-ca no
etcd-server Dec 30, 2020 23:36 UTC 364d etcd-ca no
front-proxy-client Dec 30, 2020 23:36 UTC 364d front-proxy-ca no
scheduler.conf Dec 30, 2020 23:36 UTC 364d no
CERTIFICATE AUTHORITY EXPIRES RESIDUAL TIME EXTERNALLY MANAGED
ca Dec 28, 2029 23:36 UTC 9y no
etcd-ca Dec 28, 2029 23:36 UTC 9y no
front-proxy-ca Dec 28, 2029 23:36 UTC 9y no
The command shows expiration/residual time for the client certificates in the /etc/kubernetes/pki folder and for the client certificate embedded in the KUBECONFIG files used by kubeadm (admin.conf, controller-manager.conf and scheduler.conf).
Additionally, kubeadm informs the user if the certificate is externally managed; in this case, the user should take care of managing certificate renewal manually/using other tools.
Warning:kubeadm cannot manage certificates signed by an external CA.
On nodes created with kubeadm init, prior to kubeadm version 1.17, there is a
bug where you manually have to modify the contents of kubelet.conf. After kubeadm init finishes, you should update kubelet.conf to point to the
rotated kubelet client certificates, by replacing client-certificate-data and client-key-data with:
kubeadm renews all the certificates during control plane upgrade.
This feature is designed for addressing the simplest use cases;
if you don't have specific requirements on certificate renewal and perform Kubernetes version upgrades regularly (less than 1 year in between each upgrade), kubeadm will take care of keeping your cluster up to date and reasonably secure.
Note: It is a best practice to upgrade your cluster frequently in order to stay secure.
If you have more complex requirements for certificate renewal, you can opt out from the default behavior by passing --certificate-renewal=false to kubeadm upgrade apply or to kubeadm upgrade node.
Warning: Prior to kubeadm version 1.17 there is a bug
where the default value for --certificate-renewal is false for the kubeadm upgrade node
command. In that case, you should explicitly set --certificate-renewal=true.
Manual certificate renewal
You can renew your certificates manually at any time with the kubeadm certs renew command.
This command performs the renewal using CA (or front-proxy-CA) certificate and key stored in /etc/kubernetes/pki.
After running the command you should restart the control plane Pods. This is required since
dynamic certificate reload is currently not supported for all components and certificates.
Static Pods are managed by the local kubelet
and not by the API Server, thus kubectl cannot be used to delete and restart them.
To restart a static Pod you can temporarily remove its manifest file from /etc/kubernetes/manifests/
and wait for 20 seconds (see the fileCheckFrequency value in KubeletConfiguration struct.
The kubelet will terminate the Pod if it's no longer in the manifest directory.
You can then move the file back and after another fileCheckFrequency period, the kubelet will recreate
the Pod and the certificate renewal for the component can complete.
Warning: If you are running an HA cluster, this command needs to be executed on all the control-plane nodes.
Note:certs renew uses the existing certificates as the authoritative source for attributes (Common Name, Organization, SAN, etc.) instead of the kubeadm-config ConfigMap. It is strongly recommended to keep them both in sync.
kubeadm certs renew provides the following options:
The Kubernetes certificates normally reach their expiration date after one year.
--csr-only can be used to renew certificates with an external CA by generating certificate signing requests (without actually renewing certificates in place); see next paragraph for more information.
It's also possible to renew a single certificate instead of all.
Renew certificates with the Kubernetes certificates API
This section provides more details about how to execute manual certificate renewal using the Kubernetes certificates API.
Caution: These are advanced topics for users who need to integrate their organization's certificate infrastructure into a kubeadm-built cluster. If the default kubeadm configuration satisfies your needs, you should let kubeadm manage certificates instead.
Set up a signer
The Kubernetes Certificate Authority does not work out of the box.
You can configure an external signer such as cert-manager, or you can use the built-in signer.
This section provide more details about how to execute manual certificate renewal using an external CA.
To better integrate with external CAs, kubeadm can also produce certificate signing requests (CSRs).
A CSR represents a request to a CA for a signed certificate for a client.
In kubeadm terms, any certificate that would normally be signed by an on-disk CA can be produced as a CSR instead. A CA, however, cannot be produced as a CSR.
Create certificate signing requests (CSR)
You can create certificate signing requests with kubeadm certs renew --csr-only.
Both the CSR and the accompanying private key are given in the output.
You can pass in a directory with --csr-dir to output the CSRs to the specified location.
If --csr-dir is not specified, the default certificate directory (/etc/kubernetes/pki) is used.
Certificates can be renewed with kubeadm certs renew --csr-only.
As with kubeadm init, an output directory can be specified with the --csr-dir flag.
A CSR contains a certificate's name, domains, and IPs, but it does not specify usages.
It is the responsibility of the CA to specify the correct cert usages
when issuing a certificate.
After a certificate is signed using your preferred method, the certificate and the private key must be copied to the PKI directory (by default /etc/kubernetes/pki).
Certificate authority (CA) rotation
Kubeadm does not support rotation or replacement of CA certificates out of the box.
By default the kubelet serving certificate deployed by kubeadm is self-signed.
This means a connection from external services like the
metrics-server to a
kubelet cannot be secured with TLS.
To configure the kubelets in a new kubeadm cluster to obtain properly signed serving
certificates you must pass the following minimal configuration to kubeadm init:
If you have already created the cluster you must adapt it by doing the following:
Find and edit the kubelet-config-1.24 ConfigMap in the kube-system namespace.
In that ConfigMap, the kubelet key has a
KubeletConfiguration
document as its value. Edit the KubeletConfiguration document to set serverTLSBootstrap: true.
On each node, add the serverTLSBootstrap: true field in /var/lib/kubelet/config.yaml
and restart the kubelet with systemctl restart kubelet
The field serverTLSBootstrap: true will enable the bootstrap of kubelet serving
certificates by requesting them from the certificates.k8s.io API. One known limitation
is that the CSRs (Certificate Signing Requests) for these certificates cannot be automatically
approved by the default signer in the kube-controller-manager -
kubernetes.io/kubelet-serving.
This will require action from the user or a third party controller.
These CSRs can be viewed using:
kubectl get csr
NAME AGE SIGNERNAME REQUESTOR CONDITION
csr-9wvgt 112s kubernetes.io/kubelet-serving system:node:worker-1 Pending
csr-lz97v 1m58s kubernetes.io/kubelet-serving system:node:control-plane-1 Pending
To approve them you can do the following:
kubectl certificate approve <CSR-name>
By default, these serving certificate will expire after one year. Kubeadm sets the
KubeletConfiguration field rotateCertificates to true, which means that close
to expiration a new set of CSRs for the serving certificates will be created and must
be approved to complete the rotation. To understand more see
Certificate Rotation.
If you are looking for a solution for automatic approval of these CSRs it is recommended
that you contact your cloud provider and ask if they have a CSR signer that verifies
the node identity with an out of band mechanism.
Note:
This section links to third party projects that provide functionality required by Kubernetes. The Kubernetes project authors aren't responsible for these projects, which are listed alphabetically. To add a project to this list, read the content guide before submitting a change. More information.
Such a controller is not a secure mechanism unless it not only verifies the CommonName
in the CSR but also verifies the requested IPs and domain names. This would prevent
a malicious actor that has access to a kubelet client certificate to create
CSRs requesting serving certificates for any IP or domain name.
Generating kubeconfig files for additional users
During cluster creation, kubeadm signs the certificate in the admin.conf to have
Subject: O = system:masters, CN = kubernetes-admin.
system:masters
is a break-glass, super user group that bypasses the authorization layer (e.g. RBAC).
Sharing the admin.conf with additional users is not recommended!
Instead, you can use the kubeadm kubeconfig user
command to generate kubeconfig files for additional users.
The command accepts a mixture of command line flags and
kubeadm configuration options.
The generated kubeconfig will be written to stdout and can be piped to a file
using kubeadm kubeconfig user ... > somefile.conf.
Example configuration file that can be used with --config:
# example.yamlapiVersion:kubeadm.k8s.io/v1beta3kind:ClusterConfiguration# Will be used as the target "cluster" in the kubeconfigclusterName:"kubernetes"# Will be used as the "server" (IP or DNS name) of this cluster in the kubeconfigcontrolPlaneEndpoint:"some-dns-address:6443"# The cluster CA key and certificate will be loaded from this local directorycertificatesDir:"/etc/kubernetes/pki"
Make sure that these settings match the desired target cluster settings.
To see the settings of an existing cluster use:
kubectl get cm kubeadm-config -n kube-system -o=jsonpath="{.data.ClusterConfiguration}"
The following example will generate a kubeconfig file with credentials valid for 24 hours
for a new user johndoe that is part of the appdevs group:
The Container runtimes page
explains that the systemd driver is recommended for kubeadm based setups instead
of the cgroupfs driver, because kubeadm manages the kubelet as a systemd service.
The page also provides details on how to setup a number of different container runtimes with the
systemd driver by default.
Configuring the kubelet cgroup driver
kubeadm allows you to pass a KubeletConfiguration structure during kubeadm init.
This KubeletConfiguration can include the cgroupDriver field which controls the cgroup
driver of the kubelet.
Note: In v1.22, if the user is not setting the cgroupDriver field under KubeletConfiguration,
kubeadm will default it to systemd.
A minimal example of configuring the field explicitly:
Such a configuration file can then be passed to the kubeadm command:
kubeadm init --config kubeadm-config.yaml
Note:
Kubeadm uses the same KubeletConfiguration for all nodes in the cluster.
The KubeletConfiguration is stored in a ConfigMap
object under the kube-system namespace.
Executing the sub commands init, join and upgrade would result in kubeadm
writing the KubeletConfiguration as a file under /var/lib/kubelet/config.yaml
and passing it to the local node kubelet.
Using the cgroupfs driver
As this guide explains using the cgroupfs driver with kubeadm is not recommended.
To continue using cgroupfs and to prevent kubeadm upgrade from modifying the
KubeletConfiguration cgroup driver on existing setups, you must be explicit
about its value. This applies to a case where you do not wish future versions
of kubeadm to apply the systemd driver by default.
See the below section on "Modify the kubelet ConfigMap" for details on
how to be explicit about the value.
If you wish to configure a container runtime to use the cgroupfs driver,
you must refer to the documentation of the container runtime of your choice.
Migrating to the systemd driver
To change the cgroup driver of an existing kubeadm cluster to systemd in-place,
a similar procedure to a kubelet upgrade is required. This must include both
steps outlined below.
Note: Alternatively, it is possible to replace the old nodes in the cluster with new ones
that use the systemd driver. This requires executing only the first step below
before joining the new nodes and ensuring the workloads can safely move to the new
nodes before deleting the old nodes.
Modify the kubelet ConfigMap
Call kubectl edit cm kubelet-config -n kube-system.
Either modify the existing cgroupDriver value or add a new field that looks like this:
cgroupDriver:systemd
This field must be present under the kubelet: section of the ConfigMap.
Update the cgroup driver on all nodes
For each node in the cluster:
Drain the node using kubectl drain <node-name> --ignore-daemonsets
Stop the kubelet using systemctl stop kubelet
Stop the container runtime
Modify the container runtime cgroup driver to systemd
Set cgroupDriver: systemd in /var/lib/kubelet/config.yaml
Execute these steps on nodes one at a time to ensure workloads
have sufficient time to schedule on different nodes.
Once the process is complete ensure that all nodes and workloads are healthy.
2.1.3 - Reconfiguring a kubeadm cluster
kubeadm does not support automated ways of reconfiguring components that
were deployed on managed nodes. One way of automating this would be
by using a custom operator.
To modify the components configuration you must manually edit associated cluster
objects and files on disk.
This guide shows the correct sequence of steps that need to be performed
to achieve kubeadm cluster reconfiguration.
Before you begin
You need a cluster that was deployed using kubeadm
Have administrator credentials (/etc/kubernetes/admin.conf) and network connectivity
to a running kube-apiserver in the cluster from a host that has kubectl installed
Have a text editor installed on all hosts
Reconfiguring the cluster
kubeadm writes a set of cluster wide component configuration options in
ConfigMaps and other objects. These objects must be manually edited. The command kubectl edit
can be used for that.
The kubectl edit command will open a text editor where you can edit and save the object directly.
You can use the environment variables KUBECONFIG and KUBE_EDITOR to specify the location of
the kubectl consumed kubeconfig file and preferred text editor.
Note: Upon saving any changes to these cluster objects, components running on nodes may not be
automatically updated. The steps below instruct you on how to perform that manually.
Warning: Component configuration in ConfigMaps is stored as unstructured data (YAML string).
This means that validation will not be performed upon updating the contents of a ConfigMap.
You have to be careful to follow the documented API format for a particular
component configuration and avoid introducing typos and YAML indentation mistakes.
Applying cluster configuration changes
Updating the ClusterConfiguration
During cluster creation and upgrade, kubeadm writes its
ClusterConfiguration
in a ConfigMap called kubeadm-config in the kube-system namespace.
To change a particular option in the ClusterConfiguration you can edit the ConfigMap with this command:
kubectl edit cm -n kube-system kubeadm-config
The configuration is located under the data.ClusterConfiguration key.
Note: The ClusterConfiguration includes a variety of options that affect the configuration of individual
components such as kube-apiserver, kube-scheduler, kube-controller-manager, CoreDNS, etcd and kube-proxy.
Changes to the configuration must be reflected on node components manually.
Reflecting ClusterConfiguration changes on control plane nodes
kubeadm manages the control plane components as static Pod manifests located in
the directory /etc/kubernetes/manifests.
Any changes to the ClusterConfiguration under the apiServer, controllerManager, scheduler or etcd
keys must be reflected in the associated files in the manifests directory on a control plane node.
Such changes may include:
extraArgs - requires updating the list of flags passed to a component container
extraMounts - requires updated the volume mounts for a component container
*SANs - requires writing new certificates with updated Subject Alternative Names.
Before proceeding with these changes, make sure you have backed up the directory /etc/kubernetes/.
The <config-file> contents must match the updated ClusterConfiguration.
The <component-name> value must be the name of the component.
Note: Updating a file in /etc/kubernetes/manifests will tell the kubelet to restart the static Pod for the corresponding component.
Try doing these changes one node at a time to leave the cluster without downtime.
Applying kubelet configuration changes
Updating the KubeletConfiguration
During cluster creation and upgrade, kubeadm writes its
KubeletConfiguration
in a ConfigMap called kubelet-config in the kube-system namespace.
You can edit the ConfigMap with this command:
kubectl edit cm -n kube-system kubelet-config
The configuration is located under the data.kubelet key.
Reflecting the kubelet changes
To reflect the change on kubeadm nodes you must do the following:
Log in to a kubeadm node
Run kubeadm upgrade node phase kubelet-config to download the latest kubelet-config
ConfigMap contents into the local file /var/lib/kubelet/config.conf
Edit the file /var/lib/kubelet/kubeadm-flags.env to apply additional configuration with
flags
Restart the kubelet service with systemctl restart kubelet
Note: Do these changes one node at a time to allow workloads to be rescheduled properly.
Note: During kubeadm upgrade, kubeadm downloads the KubeletConfiguration from the
kubelet-config ConfigMap and overwrite the contents of /var/lib/kubelet/config.conf.
This means that node local configuration must be applied either by flags in
/var/lib/kubelet/kubeadm-flags.env or by manually updating the contents of
/var/lib/kubelet/config.conf after kubeadm upgrade, and then restarting the kubelet.
Applying kube-proxy configuration changes
Updating the KubeProxyConfiguration
During cluster creation and upgrade, kubeadm writes its
KubeProxyConfiguration
in a ConfigMap in the kube-system namespace called kube-proxy.
This ConfigMap is used by the kube-proxy DaemonSet in the kube-system namespace.
To change a particular option in the KubeProxyConfiguration, you can edit the ConfigMap with this command:
kubectl edit cm -n kube-system kube-proxy
The configuration is located under the data.config.conf key.
Reflecting the kube-proxy changes
Once the kube-proxy ConfigMap is updated, you can restart all kube-proxy Pods:
Obtain the Pod names:
kubectl get po -n kube-system | grep kube-proxy
Delete a Pod with:
kubectl delete po -n kube-system <pod-name>
New Pods that use the updated ConfigMap will be created.
Note: Because kubeadm deploys kube-proxy as a DaemonSet, node specific configuration is unsupported.
Applying CoreDNS configuration changes
Updating the CoreDNS Deployment and Service
kubeadm deploys CoreDNS as a Deployment called coredns and with a Service kube-dns,
both in the kube-system namespace.
To update any of the CoreDNS settings, you can edit the Deployment and
Service objects:
Once the CoreDNS changes are applied you can delete the CoreDNS Pods:
Obtain the Pod names:
kubectl get po -n kube-system | grep coredns
Delete a Pod with:
kubectl delete po -n kube-system <pod-name>
New Pods with the updated CoreDNS configuration will be created.
Note: kubeadm does not allow CoreDNS configuration during cluster creation and upgrade.
This means that if you execute kubeadm upgrade apply, your changes to the CoreDNS
objects will be lost and must be reapplied.
Persisting the reconfiguration
During the execution of kubeadm upgrade on a managed node, kubeadm might overwrite configuration
that was applied after the cluster was created (reconfiguration).
Persisting Node object reconfiguration
kubeadm writes Labels, Taints, CRI socket and other information on the Node object for a particular
Kubernetes node. To change any of the contents of this Node object you can use:
kubectl edit no <node-name>
During kubeadm upgrade the contents of such a Node might get overwritten.
If you would like to persist your modifications to the Node object after upgrade,
you can prepare a kubectl patch
and apply it to the Node object:
kubectl patch no <node-name> --patch-file <patch-file>
Persisting control plane component reconfiguration
The main source of control plane configuration is the ClusterConfiguration
object stored in the cluster. To extend the static Pod manifests configuration,
patches can be used.
These patch files must remain as files on the control plane nodes to ensure that
they can be used by the kubeadm upgrade ... --patches <directory>.
If reconfiguration is done to the ClusterConfiguration and static Pod manifests on disk,
the set of node specific patches must be updated accordingly.
Persisting kubelet reconfiguration
Any changes to the KubeletConfiguration stored in /var/lib/kubelet/config.conf will be overwritten on
kubeadm upgrade by downloading the contents of the cluster wide kubelet-config ConfigMap.
To persist kubelet node specific configuration either the file /var/lib/kubelet/config.conf
has to be updated manually post-upgrade or the file /var/lib/kubelet/kubeadm-flags.env can include flags.
The kubelet flags override the associated KubeletConfiguration options, but note that
some of the flags are deprecated.
A kubelet restart will be required after changing /var/lib/kubelet/config.conf or
/var/lib/kubelet/kubeadm-flags.env.
This page explains how to upgrade a Kubernetes cluster created with kubeadm from version
1.23.x to version 1.24.x, and from version
1.24.x to 1.24.y (where y > x). Skipping MINOR versions
when upgrading is unsupported. For more details, please visit Version Skew Policy.
To see information about upgrading clusters created using older versions of kubeadm,
please refer to following pages instead:
The cluster should use a static control plane and etcd pods or external etcd.
Make sure to back up any important components, such as app-level state stored in a database.
kubeadm upgrade does not touch your workloads, only components internal to Kubernetes, but backups are always a best practice.
The instructions below outline when to drain each node during the upgrade process.
If you are performing a minor version upgrade for any kubelet, you must
first drain the node (or nodes) that you are upgrading. In the case of control plane nodes,
they could be running CoreDNS Pods or other critical workloads. For more information see
Draining nodes.
All containers are restarted after upgrade, because the container spec hash value is changed.
To verify that the kubelet service has successfully restarted after the kubelet has been upgraded,
you can execute systemctl status kubelet or view the service logs with journalctl -xeu kubelet.
apt update
apt-cache madison kubeadm
# find the latest 1.24 version in the list
# it should look like 1.24.x-00, where x is the latest patch
yum list --showduplicates kubeadm --disableexcludes=kubernetes
# find the latest 1.24 version in the list
# it should look like 1.24.x-0, where x is the latest patch
Upgrading control plane nodes
The upgrade procedure on control plane nodes should be executed one node at a time.
Pick a control plane node that you wish to upgrade first. It must have the /etc/kubernetes/admin.conf file.
# replace x in 1.24.x-00 with the latest patch version
apt-mark unhold kubeadm && \
apt-get update && apt-get install -y kubeadm=1.24.x-00 && \
apt-mark hold kubeadm
# replace x in 1.24.x-0 with the latest patch version
yum install -y kubeadm-1.24.x-0 --disableexcludes=kubernetes
Verify that the download works and has the expected version:
kubeadm version
Verify the upgrade plan:
kubeadm upgrade plan
This command checks that your cluster can be upgraded, and fetches the versions you can upgrade to.
It also shows a table with the component config version states.
Note:kubeadm upgrade also automatically renews the certificates that it manages on this node.
To opt-out of certificate renewal the flag --certificate-renewal=false can be used.
For more information see the certificate management guide.
Note: If kubeadm upgrade plan shows any component configs that require manual upgrade, users must provide
a config file with replacement configs to kubeadm upgrade apply via the --config command line flag.
Failing to do so will cause kubeadm upgrade apply to exit with an error and not perform an upgrade.
Choose a version to upgrade to, and run the appropriate command. For example:
# replace x with the patch version you picked for this upgrade
sudo kubeadm upgrade apply v1.24.x
Once the command finishes you should see:
[upgrade/successful] SUCCESS! Your cluster was upgraded to "v1.24.x". Enjoy!
[upgrade/kubelet] Now that your control plane is upgraded, please proceed with upgrading your kubelets if you haven't already done so.
Manually upgrade your CNI provider plugin.
Your Container Network Interface (CNI) provider may have its own upgrade instructions to follow.
Check the addons page to
find your CNI provider and see whether additional upgrade steps are required.
This step is not required on additional control plane nodes if the CNI provider runs as a DaemonSet.
For the other control plane nodes
Same as the first control plane node but use:
sudo kubeadm upgrade node
instead of:
sudo kubeadm upgrade apply
Also calling kubeadm upgrade plan and upgrading the CNI provider plugin is no longer needed.
Drain the node
Prepare the node for maintenance by marking it unschedulable and evicting the workloads:
# replace <node-to-drain> with the name of your node you are draining
kubectl drain <node-to-drain> --ignore-daemonsets
# replace x in 1.24.x-00 with the latest patch version
apt-mark unhold kubelet kubectl && \
apt-get update && apt-get install -y kubelet=1.24.x-00 kubectl=1.24.x-00 && \
apt-mark hold kubelet kubectl
# replace x in 1.24.x-0 with the latest patch version
yum install -y kubelet-1.24.x-0 kubectl-1.24.x-0 --disableexcludes=kubernetes
Bring the node back online by marking it schedulable:
# replace <node-to-drain> with the name of your node
kubectl uncordon <node-to-drain>
Upgrade worker nodes
The upgrade procedure on worker nodes should be executed one node at a time or few nodes at a time,
without compromising the minimum required capacity for running your workloads.
# replace x in 1.24.x-00 with the latest patch version
apt-mark unhold kubeadm && \
apt-get update && apt-get install -y kubeadm=1.24.x-00 && \
apt-mark hold kubeadm
# replace x in 1.24.x-0 with the latest patch version
yum install -y kubeadm-1.24.x-0 --disableexcludes=kubernetes
Call "kubeadm upgrade"
For worker nodes this upgrades the local kubelet configuration:
sudo kubeadm upgrade node
Drain the node
Prepare the node for maintenance by marking it unschedulable and evicting the workloads:
# replace <node-to-drain> with the name of your node you are draining
kubectl drain <node-to-drain> --ignore-daemonsets
# replace x in 1.24.x-00 with the latest patch version
apt-mark unhold kubelet kubectl && \
apt-get update && apt-get install -y kubelet=1.24.x-00 kubectl=1.24.x-00 && \
apt-mark hold kubelet kubectl
# replace x in 1.24.x-0 with the latest patch version
yum install -y kubelet-1.24.x-0 kubectl-1.24.x-0 --disableexcludes=kubernetes
Bring the node back online by marking it schedulable:
# replace <node-to-drain> with the name of your node
kubectl uncordon <node-to-drain>
Verify the status of the cluster
After the kubelet is upgraded on all nodes verify that all nodes are available again by running the following command
from anywhere kubectl can access the cluster:
kubectl get nodes
The STATUS column should show Ready for all your nodes, and the version number should be updated.
Recovering from a failure state
If kubeadm upgrade fails and does not roll back, for example because of an unexpected shutdown during execution, you can run kubeadm upgrade again.
This command is idempotent and eventually makes sure that the actual state is the desired state you declare.
To recover from a bad state, you can also run kubeadm upgrade apply --force without changing the version that your cluster is running.
During upgrade kubeadm writes the following backup folders under /etc/kubernetes/tmp:
kubeadm-backup-etcd-<date>-<time>
kubeadm-backup-manifests-<date>-<time>
kubeadm-backup-etcd contains a backup of the local etcd member data for this control plane Node.
In case of an etcd upgrade failure and if the automatic rollback does not work, the contents of this folder
can be manually restored in /var/lib/etcd. In case external etcd is used this backup folder will be empty.
kubeadm-backup-manifests contains a backup of the static Pod manifest files for this control plane Node.
In case of a upgrade failure and if the automatic rollback does not work, the contents of this folder can be
manually restored in /etc/kubernetes/manifests. If for some reason there is no difference between a pre-upgrade
and post-upgrade manifest file for a certain component, a backup file for it will not be written.
How it works
kubeadm upgrade apply does the following:
Checks that your cluster is in an upgradeable state:
The API server is reachable
All nodes are in the Ready state
The control plane is healthy
Enforces the version skew policies.
Makes sure the control plane images are available or available to pull to the machine.
Generates replacements and/or uses user supplied overwrites if component configs require version upgrades.
Upgrades the control plane components or rollbacks if any of them fails to come up.
Applies the new CoreDNS and kube-proxy manifests and makes sure that all necessary RBAC rules are created.
Creates new certificate and key files of the API server and backs up old files if they're about to expire in 180 days.
kubeadm upgrade node does the following on additional control plane nodes:
Fetches the kubeadm ClusterConfiguration from the cluster.
Optionally backups the kube-apiserver certificate.
Upgrades the static Pod manifests for the control plane components.
Upgrades the kubelet configuration for this node.
kubeadm upgrade node does the following on worker nodes:
Fetches the kubeadm ClusterConfiguration from the cluster.
Upgrades the kubelet configuration for this node.
2.1.5 - Adding Windows nodes
FEATURE STATE:Kubernetes v1.18 [beta]
You can use Kubernetes to run a mixture of Linux and Windows nodes, so you can mix Pods that run on Linux on with Pods that run on Windows. This page shows how to register Windows nodes to your cluster.
Before you begin
Your Kubernetes server must be at or later than version 1.17.
To check the version, enter kubectl version.
Obtain a Windows Server 2019 license
(or higher) in order to configure the Windows node that hosts Windows containers.
If you are using VXLAN/Overlay networking you must have also have KB4489899 installed.
Configure networking so Pods and Services on Linux and Windows can communicate with each other
Getting Started: Adding a Windows Node to Your Cluster
Networking Configuration
Once you have a Linux-based Kubernetes control-plane node you are ready to choose a networking solution. This guide illustrates using Flannel in VXLAN mode for simplicity.
Configuring Flannel
Prepare Kubernetes control plane for Flannel
Some minor preparation is recommended on the Kubernetes control plane in our cluster. It is recommended to enable bridged IPv4 traffic to iptables chains when using Flannel. The following command must be run on all Linux nodes:
Note: The VNI must be set to 4096 and port 4789 for Flannel on Linux to interoperate with Flannel on Windows. See the VXLAN documentation.
for an explanation of these fields.
Note: To use L2Bridge/Host-gateway mode instead change the value of Type to "host-gw" and omit VNI and Port.
Apply the Flannel manifest and validate
Let's apply the Flannel configuration:
kubectl apply -f kube-flannel.yml
After a few minutes, you should see all the pods as running if the Flannel pod network was deployed.
kubectl get pods -n kube-system
The output should include the Linux flannel DaemonSet as running:
NAMESPACE NAME READY STATUS RESTARTS AGE
...
kube-system kube-flannel-ds-54954 1/1 Running 0 1m
Add Windows Flannel and kube-proxy DaemonSets
Now you can add Windows-compatible versions of Flannel and kube-proxy. In order
to ensure that you get a compatible version of kube-proxy, you'll need to substitute
the tag of the image. The following example shows usage for Kubernetes v1.24.0,
but you should adjust the version for your own deployment.
If you're using a different interface rather than Ethernet (i.e. "Ethernet0 2") on the Windows nodes, you have to modify the line:
wins cli process run --path /k/flannel/setup.exe --args "--mode=overlay --interface=Ethernet"
in the flannel-host-gw.yml or flannel-overlay.yml file and specify your interface accordingly.
# Example
curl -L https://github.com/kubernetes-sigs/sig-windows-tools/releases/latest/download/flannel-overlay.yml | sed 's/Ethernet/Ethernet0 2/g' | kubectl apply -f -
Joining a Windows worker node
Note: All code snippets in Windows sections are to be run in a PowerShell environment
with elevated permissions (Administrator) on the Windows worker node.
Install crictl from the cri-tools project
which is required so that kubeadm can talk to the CRI endpoint.
Run kubeadm to join the node
Use the command that was given to you when you ran kubeadm init on a control plane host.
If you no longer have this command, or the token has expired, you can run kubeadm token create --print-join-command
(on a control plane host) to generate a new token and join command.
Install cri-dockerd which is required so that the kubelet
can communicate with Docker on a CRI compatible endpoint.
Note: Docker Engine does not implement the CRI
which is a requirement for a container runtime to work with Kubernetes.
For that reason, an additional service cri-dockerd
has to be installed. cri-dockerd is a project based on the legacy built-in
Docker Engine support that was removed from the kubelet in version 1.24.
Install crictl from the cri-tools project
which is required so that kubeadm can talk to the CRI endpoint.
Use the command that was given to you when you ran kubeadm init on a control plane host.
If you no longer have this command, or the token has expired, you can run kubeadm token create --print-join-command
(on a control plane host) to generate a new token and join command.
Verifying your installation
You should now be able to view the Windows node in your cluster by running:
kubectl get nodes -o wide
If your new node is in the NotReady state it is likely because the flannel image is still downloading.
You can check the progress as before by checking on the flannel pods in the kube-system namespace:
kubectl -n kube-system get pods -l app=flannel
Once the flannel Pod is running, your node should enter the Ready state and then be available to handle workloads.
You need to have a Kubernetes cluster, and the kubectl command-line tool must
be configured to communicate with your cluster. It is recommended to run this tutorial on a cluster with at least two nodes that are not acting as control plane hosts. If you do not already have a
cluster, you can create one by using
minikube
or you can use one of these Kubernetes playgrounds:
This section presents information you need to know when migrating from
dockershim to other container runtimes.
Since the announcement of dockershim deprecation
in Kubernetes 1.20, there were questions on how this will affect various workloads and Kubernetes
installations. Our Dockershim Removal FAQ is there to help you
to understand the problem better.
Dockershim was removed from Kubernetes with the release of v1.24.
If you use Docker via dockershim as your container runtime, and wish to upgrade to v1.24,
it is recommended that you either migrate to another runtime or find an alternative means to obtain Docker Engine support.
Check out container runtimes
section to know your options. Make sure to
report issues you encountered
with the migration. So the issue can be fixed in a timely manner and your cluster would be
ready for dockershim removal.
Your cluster might have more than one kind of node, although this is not a common
configuration.
Check out container runtimes
to understand your options for a container runtime.
There is a
GitHub issue
to track discussion about the deprecation and removal of dockershim.
If you found a defect or other technical concern relating to migrating away from dockershim,
you can report an issue
to the Kubernetes project.
2.2.1 - Changing the Container Runtime on a Node from Docker Engine to containerd
This task outlines the steps needed to update your container runtime to containerd from Docker. It
is applicable for cluster operators running Kubernetes 1.23 or earlier. Also this covers an
example scenario for migrating from dockershim to containerd and alternative container runtimes
can be picked from this page.
Before you begin
Note:
This section links to third party projects that provide functionality required by Kubernetes. The Kubernetes project authors aren't responsible for these projects, which are listed alphabetically. To add a project to this list, read the content guide before submitting a change. More information.
Install the containerd.io package from the official Docker repositories.
Instructions for setting up the Docker repository for your respective Linux distribution and
installing the containerd.io package can be found at
Getting started with containerd.
Configure the kubelet to use containerd as its container runtime
Edit the file /var/lib/kubelet/kubeadm-flags.env and add the containerd runtime to the flags.
--container-runtime=remote and
--container-runtime-endpoint=unix:///run/containerd/containerd.sock".
Users using kubeadm should be aware that the kubeadm tool stores the CRI socket for each host as
an annotation in the Node object for that host. To change it you can execute the following command
on a machine that has the kubeadm /etc/kubernetes/admin.conf file.
kubectl edit no <node-name>
This will start a text editor where you can edit the Node object.
To choose a text editor you can set the KUBE_EDITOR environment variable.
Change the value of kubeadm.alpha.kubernetes.io/cri-socket from /var/run/dockershim.sock
to the CRI socket path of your choice (for example unix:///run/containerd/containerd.sock).
Note that new CRI socket paths must be prefixed with unix:// ideally.
Save the changes in the text editor, which will update the Node object.
Restart the kubelet
systemctl start kubelet
Verify that the node is healthy
Run kubectl get nodes -o wide and containerd appears as the runtime for the node we just changed.
Remove Docker Engine
Note:
This section links to third party projects that provide functionality required by Kubernetes. The Kubernetes project authors aren't responsible for these projects, which are listed alphabetically. To add a project to this list, read the content guide before submitting a change. More information.
2.2.2 - Migrate Docker Engine nodes from dockershim to cri-dockerd
Note:
This section links to third party projects that provide functionality required by Kubernetes. The Kubernetes project authors aren't responsible for these projects, which are listed alphabetically. To add a project to this list, read the content guide before submitting a change. More information.
This page shows you how to migrate your Docker Engine nodes to use cri-dockerd
instead of dockershim. You should follow these steps in these scenarios:
You want to switch away from dockershim and still use Docker Engine to run
containers in Kubernetes.
You want to upgrade to Kubernetes v1.24 and your
existing cluster relies on dockershim, in which case you must migrate
from dockershim and cri-dockerd is one of your options.
To learn more about the removal of dockershim, read the FAQ page.
What is cri-dockerd?
In Kubernetes 1.23 and earlier, you could use Docker Engine with Kubernetes,
relying on a built-in component of Kubernetes named dockershim.
The dockershim component was removed in the Kubernetes 1.24 release; however,
a third-party replacement, cri-dockerd, is available. The cri-dockerd adapter
lets you use Docker Engine through the Container Runtime Interface.
If you want to migrate to cri-dockerd so that you can continue using Docker
Engine as your container runtime, you should do the following for each affected
node:
Install cri-dockerd.
Cordon and drain the node.
Configure the kubelet to use cri-dockerd.
Restart the kubelet.
Verify that the node is healthy.
Test the migration on non-critical nodes first.
You should perform the following steps for each node that you want to migrate
to cri-dockerd.
Cordon the node to stop new Pods scheduling on it:
kubectl cordon <NODE_NAME>
Replace <NODE_NAME> with the name of the node.
Drain the node to safely evict running Pods:
kubectl drain <NODE_NAME> \
--ignore-daemonsets
Configure the kubelet to use cri-dockerd
The following steps apply to clusters set up using the kubeadm tool. If you use
a different tool, you should modify the kubelet using the configuration
instructions for that tool.
Open /var/lib/kubelet/kubeadm-flags.env on each affected node.
Modify the --container-runtime-endpoint flag to
unix:///var/run/cri-dockerd.sock.
The kubeadm tool stores the node's socket as an annotation on the Node object
in the control plane. To modify this socket for each affected node:
Edit the YAML representation of the Node object:
KUBECONFIG=/path/to/admin.conf kubectl edit no <NODE_NAME>
Replace the following:
/path/to/admin.conf: the path to the kubectl configuration file,
admin.conf.
<NODE_NAME>: the name of the node you want to modify.
Change kubeadm.alpha.kubernetes.io/cri-socket from
/var/run/dockershim.sock to unix:///var/run/cri-dockerd.sock.
Save the changes. The Node object is updated on save.
Restart the kubelet
systemctl restart kubelet
Verify that the node is healthy
To check whether the node uses the cri-dockerd endpoint, follow the
instructions in Find out which runtime you use.
The --container-runtime-endpoint flag for the kubelet should be unix:///var/run/cri-dockerd.sock.
2.2.3 - Find Out What Container Runtime is Used on a Node
This page outlines steps to find out what container runtime
the nodes in your cluster use.
Depending on the way you run your cluster, the container runtime for the nodes may
have been pre-configured or you need to configure it. If you're using a managed
Kubernetes service, there might be vendor-specific ways to check what container runtime is
configured for the nodes. The method described on this page should work whenever
the execution of kubectl is allowed.
Before you begin
Install and configure kubectl. See Install Tools section for details.
Find out the container runtime used on a Node
Use kubectl to fetch and show node information:
kubectl get nodes -o wide
The output is similar to the following. The column CONTAINER-RUNTIME outputs
the runtime and its version.
For Docker Engine, the output is similar to this:
NAME STATUS VERSION CONTAINER-RUNTIME
node-1 Ready v1.16.15 docker://19.3.1
node-2 Ready v1.16.15 docker://19.3.1
node-3 Ready v1.16.15 docker://19.3.1
If your runtime shows as Docker Engine, you still might not be affected by the
removal of dockershim in Kubernetes 1.24. Check the runtime
endpoint to see if you use dockershim. If you don't use
dockershim, you aren't affected.
For containerd, the output is similar to this:
NAME STATUS VERSION CONTAINER-RUNTIME
node-1 Ready v1.19.6 containerd://1.4.1
node-2 Ready v1.19.6 containerd://1.4.1
node-3 Ready v1.19.6 containerd://1.4.1
Find out more information about container runtimes
on Container Runtimes
page.
Find out what container runtime endpoint you use
The container runtime talks to the kubelet over a Unix socket using the CRI
protocol, which is based on the gRPC
framework. The kubelet acts as a client, and the runtime acts as the server.
In some cases, you might find it useful to know which socket your nodes use. For
example, with the removal of dockershim in Kubernetes 1.24 and later, you might
want to know whether you use Docker Engine with dockershim.
Note: If you currently use Docker Engine in your nodes with cri-dockerd, you aren't
affected by the dockershim removal.
You can check which socket you use by checking the kubelet configuration on your
nodes.
Read the starting commands for the kubelet process:
tr \\0 ' ' < /proc/"$(pgrep kubelet)"/cmdline
If you don't have tr or pgrep, check the command line for the kubelet
process manually.
In the output, look for the --container-runtime flag and the
--container-runtime-endpoint flag.
If your nodes use Kubernetes v1.23 and earlier and these flags aren't
present or if the --container-runtime flag is not remote,
you use the dockershim socket with Docker Engine.
If the --container-runtime-endpoint flag is present, check the socket
name to find out which runtime you use. For example,
unix:///run/containerd/containerd.sock is the containerd endpoint.
If you use Docker Engine with the dockershim, migrate to a different runtime,
or, if you want to continue using Docker Engine in v1.24 and later, migrate to a
CRI-compatible adapter like cri-dockerd.
2.2.4 - Troubleshooting CNI plugin-related errors
To avoid CNI plugin-related errors, verify that you are using or upgrading to a
container runtime that has been tested to work correctly with your version of
Kubernetes.
For example, the following container runtimes are being prepared, or have already been prepared, for Kubernetes v1.24:
containerd v1.6.4 and later, v1.5.11 and later
The CRI-O v1.24.0 and later
About the "Incompatible CNI versions" and "Failed to destroy network for sandbox" errors
Service issues exist for pod CNI network setup and tear down in containerd
v1.6.0-v1.6.3 when the CNI plugins have not been upgraded and/or the CNI config
version is not declared in the CNI config files. The containerd team reports, "these issues are resolved in containerd v1.6.4."
With containerd v1.6.0-v1.6.3, if you do not upgrade the CNI plugins and/or
declare the CNI config version, you might encounter the following "Incompatible
CNI versions" or "Failed to destroy network for sandbox" error conditions.
Incompatible CNI versions error
If the version of your CNI plugin does not correctly match the plugin version in
the config because the config version is later than the plugin version, the
containerd log will likely show an error message on startup of a pod similar
to:
If the version of the plugin is missing in the CNI plugin config, the pod may
run. However, stopping the pod generates an error similar to:
ERRO[2022-04-26T00:43:24.518165483Z] StopPodSandbox for "b" failed
error="failed to destroy network for sandbox \"bbc85f891eaf060c5a879e27bba9b6b06450210161dfdecfbb2732959fb6500a\": invalid version \"\": the version is empty"
This error leaves the pod in the not-ready state with a network namespace still
attached. To recover from this problem, edit the CNI config file to add
the missing version information. The next attempt to stop the pod should
be successful.
Updating your CNI plugins and CNI config files
If you're using containerd v1.6.0-v1.6.3 and encountered "Incompatible CNI
versions" or "Failed to destroy network for sandbox" errors, consider updating
your CNI plugins and editing the CNI config files.
Here's an overview of the typical steps for each node:
After stopping your container runtime and kubelet services, perform the
following upgrade operations:
If you're running CNI plugins, upgrade them to the latest version.
If you're using non-CNI plugins, replace them with CNI plugins. Use the
latest version of the plugins.
Update the plugin configuration file to specify or match a version of the
CNI specification that the plugin supports, as shown in the following "An
example containerd configuration
file" section.
For containerd, ensure that you have installed the latest version (v1.0.0
or later) of the CNI loopback plugin.
Upgrade node components (for example, the kubelet) to Kubernetes v1.24
Upgrade to or install the most current version of the container runtime.
Bring the node back into your cluster by restarting your container runtime
and kubelet. Uncordon the node (kubectl uncordon <nodename>).
An example containerd configuration file
The following example shows a configuration for containerd runtime v1.6.x,
which supports a recent version of the CNI specification (v1.0.0).
Please see the documentation from your plugin and networking provider for
further instructions on configuring your system.
On Kubernetes, containerd runtime adds a loopback interface, lo, to pods as a
default behavior. The containerd runtime configures the loopback interface via a
CNI plugin, loopback. The loopback plugin is distributed as part of the
containerd release packages that have the cni designation. containerd
v1.6.0 and later includes a CNI v1.0.0-compatible loopback plugin as well as
other default CNI plugins. The configuration for the loopback plugin is done
internally by containerd, and is set to use CNI v1.0.0. This also means that the
version of the loopback plugin must be v1.0.0 or later when this newer version
containerd is started.
The following bash command generates an example CNI config. Here, the 1.0.0
value for the config version is assigned to the cniVersion field for use when
containerd invokes the CNI bridge plugin.
Update the IP address ranges in the preceding example with ones that are based
on your use case and network addressing plan.
2.2.5 - Check whether dockershim removal affects you
The dockershim component of Kubernetes allows to use Docker as a Kubernetes's
container runtime.
Kubernetes' built-in dockershim component was removed in release v1.24.
This page explains how your cluster could be using Docker as a container runtime,
provides details on the role that dockershim plays when in use, and shows steps
you can take to check whether any workloads could be affected by dockershim removal.
Finding if your app has a dependencies on Docker
If you are using Docker for building your application containers, you can still
run these containers on any container runtime. This use of Docker does not count
as a dependency on Docker as a container runtime.
When alternative container runtime is used, executing Docker commands may either
not work or yield unexpected output. This is how you can find whether you have a
dependency on Docker:
Make sure no privileged Pods execute Docker commands (like docker ps),
restart the Docker service (commands such as systemctl restart docker.service),
or modify Docker-specific files such as /etc/docker/daemon.json.
Check for any private registries or image mirror settings in the Docker
configuration file (like /etc/docker/daemon.json). Those typically need to
be reconfigured for another container runtime.
Check that scripts and apps running on nodes outside of your Kubernetes
infrastructure do not execute Docker commands. It might be:
SSH to nodes to troubleshoot;
Node startup scripts;
Monitoring and security agents installed on nodes directly.
Make sure there is no indirect dependencies on dockershim behavior.
This is an edge case and unlikely to affect your application. Some tooling may be configured
to react to Docker-specific behaviors, for example, raise alert on specific metrics or search for
a specific log message as part of troubleshooting instructions.
If you have such tooling configured, test the behavior on test
cluster before migration.
Dependency on Docker explained
A container runtime is software that can
execute the containers that make up a Kubernetes pod. Kubernetes is responsible for orchestration
and scheduling of Pods; on each node, the kubelet
uses the container runtime interface as an abstraction so that you can use any compatible
container runtime.
In its earliest releases, Kubernetes offered compatibility with one container runtime: Docker.
Later in the Kubernetes project's history, cluster operators wanted to adopt additional container runtimes.
The CRI was designed to allow this kind of flexibility - and the kubelet began supporting CRI. However,
because Docker existed before the CRI specification was invented, the Kubernetes project created an
adapter component, dockershim. The dockershim adapter allows the kubelet to interact with Docker as
if Docker were a CRI compatible runtime.
Switching to Containerd as a container runtime eliminates the middleman. All the
same containers can be run by container runtimes like Containerd as before. But
now, since containers schedule directly with the container runtime, they are not visible to Docker.
So any Docker tooling or fancy UI you might have used
before to check on these containers is no longer available.
You cannot get container information using docker ps or docker inspect
commands. As you cannot list containers, you cannot get logs, stop containers,
or execute something inside container using docker exec.
Note: If you're running workloads via Kubernetes, the best way to stop a container is through
the Kubernetes API rather than directly through the container runtime (this advice applies
for all container runtimes, not only Docker).
You can still pull images or build them using docker build command. But images
built or pulled by Docker would not be visible to container runtime and
Kubernetes. They needed to be pushed to some registry to allow them to be used
by Kubernetes.
2.2.6 - Migrating telemetry and security agents from dockershim
Kubernetes' support for direct integration with Docker Engine is deprecated, and will be removed. Most apps do not have a direct dependency on runtime hosting containers. However, there are still a lot of telemetry and monitoring agents that has a dependency on docker to collect containers metadata, logs and metrics. This document aggregates information on how to detect these dependencies and links on how to migrate these agents to use generic tools or alternative runtimes.
Telemetry and security agents
Within a Kubernetes cluster there are a few different ways to run telemetry or security agents.
Some agents have a direct dependency on Docker Engine when they run as DaemonSets or
directly on nodes.
Why do some telemetry agents communicate with Docker Engine?
Historically, Kubernetes was written to work specifically with Docker Engine.
Kubernetes took care of networking and scheduling, relying on Docker Engine for launching
and running containers (within Pods) on a node. Some information that is relevant to telemetry,
such as a pod name, is only available from Kubernetes components. Other data, such as container
metrics, is not the responsibility of the container runtime. Early telemetry agents needed to query the
container runtime and Kubernetes to report an accurate picture. Over time, Kubernetes gained
the ability to support multiple runtimes, and now supports any runtime that is compatible with
the container runtime interface.
Some telemetry agents rely specifically on Docker Engine tooling. For example, an agent
might run a command such as
docker ps
or docker top to list
containers and processes or docker logs
to receive streamed logs. If nodes in your existing cluster use
Docker Engine, and you switch to a different container runtime,
these commands will not work any longer.
Identify DaemonSets that depend on Docker Engine
If a pod wants to make calls to the dockerd running on the node, the pod must either:
mount the filesystem containing the Docker daemon's privileged socket, as a
volume; or
mount the specific path of the Docker daemon's privileged socket directly, also as a volume.
For example: on COS images, Docker exposes its Unix domain socket at
/var/run/docker.sock This means that the pod spec will include a
hostPath volume mount of /var/run/docker.sock.
Here's a sample shell script to find Pods that have a mount directly mapping the
Docker socket. This script outputs the namespace and name of the pod. You can
remove the grep '/var/run/docker.sock' to review other mounts.
Note: There are alternative ways for a pod to access Docker on the host. For instance, the parent
directory /var/run may be mounted instead of the full path (like in this
example).
The script above only detects the most common uses.
Detecting Docker dependency from node agents
In case your cluster nodes are customized and install additional security and
telemetry agents on the node, make sure to check with the vendor of the agent whether it has dependency on Docker.
Telemetry and security agent vendors
We keep the work in progress version of migration instructions for various telemetry and security agent vendors
in Google doc.
Please contact the vendor to get up to date instructions for migrating from dockershim.
2.3 - Generate Certificates Manually
When using client certificate authentication, you can generate certificates
manually through easyrsa, openssl or cfssl.
easyrsa
easyrsa can manually generate certificates for your cluster.
Download, unpack, and initialize the patched version of easyrsa3.
curl -LO https://storage.googleapis.com/kubernetes-release/easy-rsa/easy-rsa.tar.gz
tar xzf easy-rsa.tar.gz
cd easy-rsa-master/easyrsa3
./easyrsa init-pki
Generate a new certificate authority (CA). --batch sets automatic mode;
--req-cn specifies the Common Name (CN) for the CA's new root certificate.
Generate server certificate and key.
The argument --subject-alt-name sets the possible IPs and DNS names the API server will
be accessed with. The MASTER_CLUSTER_IP is usually the first IP from the service CIDR
that is specified as the --service-cluster-ip-range argument for both the API server and
the controller manager component. The argument --days is used to set the number of days
after which the certificate expires.
The sample below also assumes that you are using cluster.local as the default
DNS domain name.
Create a config file for generating a Certificate Signing Request (CSR).
Be sure to substitute the values marked with angle brackets (e.g. <MASTER_IP>)
with real values before saving this to a file (e.g. csr.conf).
Note that the value for MASTER_CLUSTER_IP is the service cluster IP for the
API server as described in previous subsection.
The sample below also assumes that you are using cluster.local as the default
DNS domain name.
Finally, add the same parameters into the API server start parameters.
cfssl
cfssl is another tool for certificate generation.
Download, unpack and prepare the command line tools as shown below.
Note that you may need to adapt the sample commands based on the hardware
architecture and cfssl version you are using.
Create a JSON config file for CA certificate signing request (CSR), for example,
ca-csr.json. Be sure to replace the values marked with angle brackets with
real values you want to use.
Generate CA key (ca-key.pem) and certificate (ca.pem):
../cfssl gencert -initca ca-csr.json | ../cfssljson -bare ca
Create a JSON config file for generating keys and certificates for the API
server, for example, server-csr.json. Be sure to replace the values in angle brackets with
real values you want to use. The MASTER_CLUSTER_IP is the service cluster
IP for the API server as described in previous subsection.
The sample below also assumes that you are using cluster.local as the default
DNS domain name.
A client node may refuse to recognize a self-signed CA certificate as valid.
For a non-production deployment, or for a deployment that runs behind a company
firewall, you can distribute a self-signed CA certificate to all clients and
refresh the local list for valid certificates.
Updating certificates in /etc/ssl/certs...
1 added, 0 removed; done.
Running hooks in /etc/ca-certificates/update.d....
done.
Certificates API
You can use the certificates.k8s.io API to provision
x509 certificates to use for authentication as documented
here.
2.4 - Manage Memory, CPU, and API Resources
2.4.1 - Configure Default Memory Requests and Limits for a Namespace
Define a default memory resource limit for a namespace, so that every new Pod in that namespace has a memory resource limit configured.
This page shows how to configure default memory requests and limits for a
namespace.
A Kubernetes cluster can be divided into namespaces. Once you have a namespace that
has a default memory
limit,
and you then try to create a Pod with a container that does not specify its own memory
limit, then the
control plane assigns the default
memory limit to that container.
Kubernetes assigns a default memory request under certain conditions that are explained later in this topic.
Before you begin
You need to have a Kubernetes cluster, and the kubectl command-line tool must
be configured to communicate with your cluster. It is recommended to run this tutorial on a cluster with at least two nodes that are not acting as control plane hosts. If you do not already have a
cluster, you can create one by using
minikube
or you can use one of these Kubernetes playgrounds:
Now if you create a Pod in the default-mem-example namespace, and any container
within that Pod does not specify its own values for memory request and memory limit,
then the control plane
applies default values: a memory request of 256MiB and a memory limit of 512MiB.
Here's an example manifest for a Pod that has one container. The container
does not specify a memory request and limit.
kubectl get pod default-mem-demo --output=yaml --namespace=default-mem-example
The output shows that the Pod's container has a memory request of 256 MiB and
a memory limit of 512 MiB. These are the default values specified by the LimitRange.
kubectl get pod default-mem-demo-2 --output=yaml --namespace=default-mem-example
The output shows that the container's memory request is set to match its memory limit.
Notice that the container was not assigned the default memory request value of 256Mi.
kubectl get pod default-mem-demo-3 --output=yaml --namespace=default-mem-example
The output shows that the container's memory request is set to the value specified in the
container's manifest. The container is limited to use no more than 512MiB of
memory, which matches the default memory limit for the namespace.
If your namespace has a memory resource quota
configured,
it is helpful to have a default value in place for memory limit.
Here are three of the restrictions that a resource quota imposes on a namespace:
For every Pod that runs in the namespace, the Pod and each of its containers must have a memory limit.
(If you specify a memory limit for every container in a Pod, Kubernetes can infer the Pod-level memory
limit by adding up the limits for its containers).
Memory limits apply a resource reservation on the node where the Pod in question is scheduled.
The total amount of memory reserved for all Pods in the namespace must not exceed a specified limit.
The total amount of memory actually used by all Pods in the namespace must also not exceed a specified limit.
When you add a LimitRange:
If any Pod in that namespace that includes a container does not specify its own memory limit,
the control plane applies the default memory limit to that container, and the Pod can be
allowed to run in a namespace that is restricted by a memory ResourceQuota.
2.4.2 - Configure Default CPU Requests and Limits for a Namespace
Define a default CPU resource limits for a namespace, so that every new Pod in that namespace has a CPU resource limit configured.
This page shows how to configure default CPU requests and limits for a
namespace.
A Kubernetes cluster can be divided into namespaces. If you create a Pod within a
namespace that has a default CPU
limit, and any container in that Pod does not specify
its own CPU limit, then the
control plane assigns the default
CPU limit to that container.
Kubernetes assigns a default CPU
request,
but only under certain conditions that are explained later in this page.
Before you begin
You need to have a Kubernetes cluster, and the kubectl command-line tool must
be configured to communicate with your cluster. It is recommended to run this tutorial on a cluster with at least two nodes that are not acting as control plane hosts. If you do not already have a
cluster, you can create one by using
minikube
or you can use one of these Kubernetes playgrounds:
Now if you create a Pod in the default-cpu-example namespace, and any container
in that Pod does not specify its own values for CPU request and CPU limit,
then the control plane applies default values: a CPU request of 0.5 and a default
CPU limit of 1.
Here's a manifest for a Pod that has one container. The container
does not specify a CPU request and limit.
kubectl get pod default-cpu-demo --output=yaml --namespace=default-cpu-example
The output shows that the Pod's only container has a CPU request of 500m cpu
(which you can read as “500 millicpu”), and a CPU limit of 1 cpu.
These are the default values specified by the LimitRange.
kubectl get pod default-cpu-demo-2 --output=yaml --namespace=default-cpu-example
The output shows that the container's CPU request is set to match its CPU limit.
Notice that the container was not assigned the default CPU request value of 0.5 cpu:
resources:
limits:
cpu: "1"
requests:
cpu: "1"
What if you specify a container's request, but not its limit?
Here's an example manifest for a Pod that has one container. The container
specifies a CPU request, but not a limit:
kubectl get pod default-cpu-demo-3 --output=yaml --namespace=default-cpu-example
The output shows that the container's CPU request is set to the value you specified at
the time you created the Pod (in other words: it matches the manifest).
However, the same container's CPU limit is set to 1 cpu, which is the default CPU limit
for that namespace.
resources:
limits:
cpu: "1"
requests:
cpu: 750m
Motivation for default CPU limits and requests
If your namespace has a CPU resource quota
configured,
it is helpful to have a default value in place for CPU limit.
Here are two of the restrictions that a CPU resource quota imposes on a namespace:
For every Pod that runs in the namespace, each of its containers must have a CPU limit.
CPU limits apply a resource reservation on the node where the Pod in question is scheduled.
The total amount of CPU that is reserved for use by all Pods in the namespace must not
exceed a specified limit.
When you add a LimitRange:
If any Pod in that namespace that includes a container does not specify its own CPU limit,
the control plane applies the default CPU limit to that container, and the Pod can be
allowed to run in a namespace that is restricted by a CPU ResourceQuota.
2.4.3 - Configure Minimum and Maximum Memory Constraints for a Namespace
Define a range of valid memory resource limits for a namespace, so that every new Pod in that namespace falls within the range you configure.
This page shows how to set minimum and maximum values for memory used by containers
running in a namespace. You specify minimum and maximum memory values in a
LimitRange
object. If a Pod does not meet the constraints imposed by the LimitRange,
it cannot be created in the namespace.
Before you begin
You need to have a Kubernetes cluster, and the kubectl command-line tool must
be configured to communicate with your cluster. It is recommended to run this tutorial on a cluster with at least two nodes that are not acting as control plane hosts. If you do not already have a
cluster, you can create one by using
minikube
or you can use one of these Kubernetes playgrounds:
kubectl get limitrange mem-min-max-demo-lr --namespace=constraints-mem-example --output=yaml
The output shows the minimum and maximum memory constraints as expected. But
notice that even though you didn't specify default values in the configuration
file for the LimitRange, they were created automatically.
Now whenever you define a Pod within the constraints-mem-example namespace, Kubernetes
performs these steps:
If any container in that Pod does not specify its own memory request and limit, assign
the default memory request and limit to that container.
Verify that every container in that Pod requests at least 500 MiB of memory.
Verify that every container in that Pod requests no more than 1024 MiB (1 GiB)
of memory.
Here's a manifest for a Pod that has one container. Within the Pod spec, the sole
container specifies a memory request of 600 MiB and a memory limit of 800 MiB. These satisfy the
minimum and maximum memory constraints imposed by the LimitRange.
Verify that the Pod is running and that its container is healthy:
kubectl get pod constraints-mem-demo --namespace=constraints-mem-example
View detailed information about the Pod:
kubectl get pod constraints-mem-demo --output=yaml --namespace=constraints-mem-example
The output shows that the container within that Pod has a memory request of 600 MiB and
a memory limit of 800 MiB. These satisfy the constraints imposed by the LimitRange for
this namespace:
The output shows that the Pod does not get created, because it defines a container that
requests more memory than is allowed:
Error from server (Forbidden): error when creating "examples/admin/resource/memory-constraints-pod-2.yaml":
pods "constraints-mem-demo-2" is forbidden: maximum memory usage per Container is 1Gi, but limit is 1536Mi.
Attempt to create a Pod that does not meet the minimum memory request
Here's a manifest for a Pod that has one container. That container specifies a
memory request of 100 MiB and a memory limit of 800 MiB.
The output shows that the Pod does not get created, because it defines a container
that requests less memory than the enforced minimum:
Error from server (Forbidden): error when creating "examples/admin/resource/memory-constraints-pod-3.yaml":
pods "constraints-mem-demo-3" is forbidden: minimum memory usage per Container is 500Mi, but request is 100Mi.
Create a Pod that does not specify any memory request or limit
Here's a manifest for a Pod that has one container. The container does not
specify a memory request, and it does not specify a memory limit.
Because your Pod did not define any memory request and limit for that container, the cluster
applied a
default memory request and limit
from the LimitRange.
This means that the definition of that Pod shows those values. You can check it using
kubectl describe:
# Look for the "Requests:" section of the output
kubectl describe pod constraints-mem-demo-4 --namespace=constraints-mem-example
At this point, your Pod might be running or it might not be running. Recall that a prerequisite
for this task is that your Nodes have at least 1 GiB of memory. If each of your Nodes has only
1 GiB of memory, then there is not enough allocatable memory on any Node to accommodate a memory
request of 1 GiB. If you happen to be using Nodes with 2 GiB of memory, then you probably have
enough space to accommodate the 1 GiB request.
Delete your Pod:
kubectl delete pod constraints-mem-demo-4 --namespace=constraints-mem-example
Enforcement of minimum and maximum memory constraints
The maximum and minimum memory constraints imposed on a namespace by a LimitRange are enforced only
when a Pod is created or updated. If you change the LimitRange, it does not affect
Pods that were created previously.
Motivation for minimum and maximum memory constraints
As a cluster administrator, you might want to impose restrictions on the amount of memory that Pods can use.
For example:
Each Node in a cluster has 2 GiB of memory. You do not want to accept any Pod that requests
more than 2 GiB of memory, because no Node in the cluster can support the request.
A cluster is shared by your production and development departments.
You want to allow production workloads to consume up to 8 GiB of memory, but
you want development workloads to be limited to 512 MiB. You create separate namespaces
for production and development, and you apply memory constraints to each namespace.
2.4.4 - Configure Minimum and Maximum CPU Constraints for a Namespace
Define a range of valid CPU resource limits for a namespace, so that every new Pod in that namespace falls within the range you configure.
This page shows how to set minimum and maximum values for the CPU resources used by containers
and Pods in a namespace. You specify minimum
and maximum CPU values in a
LimitRange
object. If a Pod does not meet the constraints imposed by the LimitRange, it cannot be created
in the namespace.
Before you begin
You need to have a Kubernetes cluster, and the kubectl command-line tool must
be configured to communicate with your cluster. It is recommended to run this tutorial on a cluster with at least two nodes that are not acting as control plane hosts. If you do not already have a
cluster, you can create one by using
minikube
or you can use one of these Kubernetes playgrounds:
kubectl get limitrange cpu-min-max-demo-lr --output=yaml --namespace=constraints-cpu-example
The output shows the minimum and maximum CPU constraints as expected. But
notice that even though you didn't specify default values in the configuration
file for the LimitRange, they were created automatically.
Now whenever you create a Pod in the constraints-cpu-example namespace (or some other client
of the Kubernetes API creates an equivalent Pod), Kubernetes performs these steps:
If any container in that Pod does not specify its own CPU request and limit, the control plane
assigns the default CPU request and limit to that container.
Verify that every container in that Pod specifies a CPU request that is greater than or equal to 200 millicpu.
Verify that every container in that Pod specifies a CPU limit that is less than or equal to 800 millicpu.
Note: When creating a LimitRange object, you can specify limits on huge-pages
or GPUs as well. However, when both default and defaultRequest are specified
on these resources, the two values must be the same.
Here's a manifest for a Pod that has one container. The container manifest
specifies a CPU request of 500 millicpu and a CPU limit of 800 millicpu. These satisfy the
minimum and maximum CPU constraints imposed by the LimitRange.
Verify that the Pod is running and that its container is healthy:
kubectl get pod constraints-cpu-demo --namespace=constraints-cpu-example
View detailed information about the Pod:
kubectl get pod constraints-cpu-demo --output=yaml --namespace=constraints-cpu-example
The output shows that the Pod's only container has a CPU request of 500 millicpu and CPU limit
of 800 millicpu. These satisfy the constraints imposed by the LimitRange.
resources:limits:cpu:800mrequests:cpu:500m
Delete the Pod
kubectl delete pod constraints-cpu-demo --namespace=constraints-cpu-example
Attempt to create a Pod that exceeds the maximum CPU constraint
Here's a manifest for a Pod that has one container. The container specifies a
CPU request of 500 millicpu and a cpu limit of 1.5 cpu.
The output shows that the Pod does not get created, because it defines an unacceptable container.
That container is not acceptable because it specifies a CPU limit that is too large:
Error from server (Forbidden): error when creating "examples/admin/resource/cpu-constraints-pod-2.yaml":
pods "constraints-cpu-demo-2" is forbidden: maximum cpu usage per Container is 800m, but limit is 1500m.
Attempt to create a Pod that does not meet the minimum CPU request
Here's a manifest for a Pod that has one container. The container specifies a
CPU request of 100 millicpu and a CPU limit of 800 millicpu.
The output shows that the Pod does not get created, because it defines an unacceptable container.
That container is not acceptable because it specifies a CPU request that is lower than the
enforced minimum:
Error from server (Forbidden): error when creating "examples/admin/resource/cpu-constraints-pod-3.yaml":
pods "constraints-cpu-demo-3" is forbidden: minimum cpu usage per Container is 200m, but request is 100m.
Create a Pod that does not specify any CPU request or limit
Here's a manifest for a Pod that has one container. The container does not
specify a CPU request, nor does it specify a CPU limit.
kubectl get pod constraints-cpu-demo-4 --namespace=constraints-cpu-example --output=yaml
The output shows that the Pod's single container has a CPU request of 800 millicpu and a
CPU limit of 800 millicpu.
How did that container get those values?
resources:limits:cpu:800mrequests:cpu:800m
Because that container did not specify its own CPU request and limit, the control plane
applied the
default CPU request and limit
from the LimitRange for this namespace.
At this point, your Pod might be running or it might not be running. Recall that a prerequisite for this task is that your cluster must have at least 1 CPU available for use. If each of your Nodes has only 1 CPU, then there might not be enough allocatable CPU on any Node to accommodate a request of 800 millicpu. If you happen to be using Nodes with 2 CPU, then you probably have enough CPU to accommodate the 800 millicpu request.
Delete your Pod:
kubectl delete pod constraints-cpu-demo-4 --namespace=constraints-cpu-example
Enforcement of minimum and maximum CPU constraints
The maximum and minimum CPU constraints imposed on a namespace by a LimitRange are enforced only
when a Pod is created or updated. If you change the LimitRange, it does not affect
Pods that were created previously.
Motivation for minimum and maximum CPU constraints
As a cluster administrator, you might want to impose restrictions on the CPU resources that Pods can use.
For example:
Each Node in a cluster has 2 CPU. You do not want to accept any Pod that requests
more than 2 CPU, because no Node in the cluster can support the request.
A cluster is shared by your production and development departments.
You want to allow production workloads to consume up to 3 CPU, but you want development workloads to be limited
to 1 CPU. You create separate namespaces for production and development, and you apply CPU constraints to
each namespace.
2.4.5 - Configure Memory and CPU Quotas for a Namespace
Define overall memory and CPU resource limits for a namespace.
This page shows how to set quotas for the total amount memory and CPU that
can be used by all Pods running in a namespace.
You specify quotas in a
ResourceQuota
object.
Before you begin
You need to have a Kubernetes cluster, and the kubectl command-line tool must
be configured to communicate with your cluster. It is recommended to run this tutorial on a cluster with at least two nodes that are not acting as control plane hosts. If you do not already have a
cluster, you can create one by using
minikube
or you can use one of these Kubernetes playgrounds:
Verify that the Pod is running and that its (only) container is healthy:
kubectl get pod quota-mem-cpu-demo --namespace=quota-mem-cpu-example
Once again, view detailed information about the ResourceQuota:
kubectl get resourcequota mem-cpu-demo --namespace=quota-mem-cpu-example --output=yaml
The output shows the quota along with how much of the quota has been used.
You can see that the memory and CPU requests and limits for your Pod do not
exceed the quota.
In the manifest, you can see that the Pod has a memory request of 700 MiB.
Notice that the sum of the used memory request and this new memory
request exceeds the memory request quota: 600 MiB + 700 MiB > 1 GiB.
The second Pod does not get created. The output shows that creating the second Pod
would cause the memory request total to exceed the memory request quota.
Error from server (Forbidden): error when creating "examples/admin/resource/quota-mem-cpu-pod-2.yaml":
pods "quota-mem-cpu-demo-2" is forbidden: exceeded quota: mem-cpu-demo,
requested: requests.memory=700Mi,used: requests.memory=600Mi, limited: requests.memory=1Gi
Discussion
As you have seen in this exercise, you can use a ResourceQuota to restrict
the memory request total for all Pods running in a namespace.
You can also restrict the totals for memory limit, cpu request, and cpu limit.
Instead of managing total resource use within a namespace, you might want to restrict
individual Pods, or the containers in those Pods. To achieve that kind of limiting, use a
LimitRange.
Restrict how many Pods you can create within a namespace.
This page shows how to set a quota for the total number of Pods that can run
in a Namespace. You specify quotas in a
ResourceQuota
object.
Before you begin
You need to have a Kubernetes cluster, and the kubectl command-line tool must
be configured to communicate with your cluster. It is recommended to run this tutorial on a cluster with at least two nodes that are not acting as control plane hosts. If you do not already have a
cluster, you can create one by using
minikube
or you can use one of these Kubernetes playgrounds:
kubectl get deployment pod-quota-demo --namespace=quota-pod-example --output=yaml
The output shows that even though the Deployment specifies three replicas, only two
Pods were created because of the quota you defined earlier:
spec:...replicas:3...status:availableReplicas:2...lastUpdateTime:2021-04-02T20:57:05Zmessage: 'unable to create pods:pods "pod-quota-demo-1650323038-" is forbidden:exceeded quota: pod-demo, requested: pods=1, used: pods=2, limited:pods=2'
Choice of resource
In this task you have defined a ResourceQuota that limited the total number of Pods, but
you could also limit the total number of other kinds of object. For example, you
might decide to limit how many CronJobs
that can live in a single namespace.
You need to have a Kubernetes cluster, and the kubectl command-line tool must
be configured to communicate with your cluster. It is recommended to run this tutorial on a cluster with at least two nodes that are not acting as control plane hosts. If you do not already have a
cluster, you can create one by using
minikube
or you can use one of these Kubernetes playgrounds:
To get familiar with Cilium easily you can follow the
Cilium Kubernetes Getting Started Guide
to perform a basic DaemonSet installation of Cilium in minikube.
To start minikube, minimal version required is >= v1.5.2, run the with the
following arguments:
minikube version
minikube version: v1.5.2
minikube start --network-plugin=cni
For minikube you can install Cilium using its CLI tool. Cilium will
automatically detect the cluster configuration and will install the appropriate
components for a successful installation:
🔮 Auto-detected Kubernetes kind: minikube
✨ Running "minikube" validation checks
✅ Detected minikube version "1.20.0"
ℹ️ Cilium version not set, using default version "v1.10.0"
🔮 Auto-detected cluster name: minikube
🔮 Auto-detected IPAM mode: cluster-pool
🔮 Auto-detected datapath mode: tunnel
🔑 Generating CA...
2021/05/27 02:54:44 [INFO] generate received request
2021/05/27 02:54:44 [INFO] received CSR
2021/05/27 02:54:44 [INFO] generating key: ecdsa-256
2021/05/27 02:54:44 [INFO] encoded CSR
2021/05/27 02:54:44 [INFO] signed certificate with serial number 48713764918856674401136471229482703021230538642
🔑 Generating certificates for Hubble...
2021/05/27 02:54:44 [INFO] generate received request
2021/05/27 02:54:44 [INFO] received CSR
2021/05/27 02:54:44 [INFO] generating key: ecdsa-256
2021/05/27 02:54:44 [INFO] encoded CSR
2021/05/27 02:54:44 [INFO] signed certificate with serial number 3514109734025784310086389188421560613333279574
🚀 Creating Service accounts...
🚀 Creating Cluster roles...
🚀 Creating ConfigMap...
🚀 Creating Agent DaemonSet...
🚀 Creating Operator Deployment...
⌛ Waiting for Cilium to be installed...
The remainder of the Getting Started Guide explains how to enforce both L3/L4
(i.e., IP address + port) security policies, as well as L7 (e.g., HTTP) security
policies using an example application.
Deploying Cilium for Production Use
For detailed instructions around deploying Cilium for production, see:
Cilium Kubernetes Installation Guide
This documentation includes detailed requirements, instructions and example
production DaemonSet files.
Understanding Cilium components
Deploying a cluster with Cilium adds Pods to the kube-system namespace. To see
this list of Pods run:
kubectl get pods --namespace=kube-system -l k8s-app=cilium
You'll see a list of Pods similar to this:
NAME READY STATUS RESTARTS AGE
cilium-kkdhz 1/1 Running 0 3m23s
...
A cilium Pod runs on each node in your cluster and enforces network policy
on the traffic to/from Pods on that node using Linux BPF.
What's next
Once your cluster is running, you can follow the
Declare Network Policy
to try out Kubernetes NetworkPolicy with Cilium.
Have fun, and if you have questions, contact us using the
Cilium Slack Channel.
2.5.4 - Use Kube-router for NetworkPolicy
This page shows how to use Kube-router for NetworkPolicy.
Before you begin
You need to have a Kubernetes cluster running. If you do not already have a cluster, you can create one by using any of the cluster installers like Kops, Bootkube, Kubeadm etc.
Installing Kube-router addon
The Kube-router Addon comes with a Network Policy Controller that watches Kubernetes API server for any NetworkPolicy and pods updated and configures iptables rules and ipsets to allow or block traffic as directed by the policies. Please follow the trying Kube-router with cluster installers guide to install Kube-router addon.
What's next
Once you have installed the Kube-router addon, you can follow the Declare Network Policy to try out Kubernetes NetworkPolicy.
2.5.5 - Romana for NetworkPolicy
This page shows how to use Romana for NetworkPolicy.
The Weave Net addon for Kubernetes comes with a
Network Policy Controller
that automatically monitors Kubernetes for any NetworkPolicy annotations on all
namespaces and configures iptables rules to allow or block traffic as directed by the policies.
Test the installation
Verify that the weave works.
Enter the following command:
kubectl get pods -n kube-system -o wide
The output is similar to this:
NAME READY STATUS RESTARTS AGE IP NODE
weave-net-1t1qg 2/2 Running 0 9d 192.168.2.10 worknode3
weave-net-231d7 2/2 Running 1 7d 10.2.0.17 worknodegpu
weave-net-7nmwt 2/2 Running 3 9d 192.168.2.131 masternode
weave-net-pmw8w 2/2 Running 0 9d 192.168.2.216 worknode2
Each Node has a weave Pod, and all Pods are Running and 2/2 READY. (2/2 means that each Pod has weave and weave-npc.)
This page shows how to access clusters using the Kubernetes API.
Before you begin
You need to have a Kubernetes cluster, and the kubectl command-line tool must
be configured to communicate with your cluster. It is recommended to run this tutorial on a cluster with at least two nodes that are not acting as control plane hosts. If you do not already have a
cluster, you can create one by using
minikube
or you can use one of these Kubernetes playgrounds:
When accessing the Kubernetes API for the first time, use the
Kubernetes command-line tool, kubectl.
To access a cluster, you need to know the location of the cluster and have credentials
to access it. Typically, this is automatically set-up when you work through
a Getting started guide,
or someone else setup the cluster and provided you with credentials and a location.
Check the location and credentials that kubectl knows about with this command:
kubectl config view
Many of the examples provide an introduction to using
kubectl. Complete documentation is found in the kubectl manual.
Directly accessing the REST API
kubectl handles locating and authenticating to the API server. If you want to directly access the REST API with an http client like
curl or wget, or a browser, there are multiple ways you can locate and authenticate against the API server:
Run kubectl in proxy mode (recommended). This method is recommended, since it uses the stored apiserver location and verifies the identity of the API server using a self-signed cert. No man-in-the-middle (MITM) attack is possible using this method.
Alternatively, you can provide the location and credentials directly to the http client. This works with client code that is confused by proxies. To protect against man in the middle attacks, you'll need to import a root cert into your browser.
Using the Go or Python client libraries provides accessing kubectl in proxy mode.
Using kubectl proxy
The following command runs kubectl in a mode where it acts as a reverse proxy. It handles
locating the API server and authenticating.
It is possible to avoid using kubectl proxy by passing an authentication token
directly to the API server, like this:
Using grep/cut approach:
# Check all possible clusters, as your .KUBECONFIG may have multiple contexts:
kubectl config view -o jsonpath='{"Cluster name\tServer\n"}{range .clusters[*]}{.name}{"\t"}{.cluster.server}{"\n"}{end}'# Select name of cluster you want to interact with from above output:exportCLUSTER_NAME="some_server_name"# Point to the API server referring the cluster nameAPISERVER=$(kubectl config view -o jsonpath="{.clusters[?(@.name==\"$CLUSTER_NAME\")].cluster.server}")# Create a secret to hold a token for the default service account
kubectl apply -f - <<EOF
apiVersion: v1
kind: Secret
metadata:
name: default-token
annotations:
kubernetes.io/service-account.name: default
type: kubernetes.io/service-account-token
EOF# Wait for the token controller to populate the secret with a token:while ! kubectl describe secret default-token | grep -E '^token' >/dev/null; doecho"waiting for token..." >&2
sleep 1done# Get the token valueTOKEN=$(kubectl get secret default-token -o jsonpath='{.data.token}' | base64 --decode)# Explore the API with TOKEN
curl -X GET $APISERVER/api --header "Authorization: Bearer $TOKEN" --insecure
The above example uses the --insecure flag. This leaves it subject to MITM
attacks. When kubectl accesses the cluster it uses a stored root certificate
and client certificates to access the server. (These are installed in the
~/.kube directory). Since cluster certificates are typically self-signed, it
may take special configuration to get your http client to use root
certificate.
On some clusters, the API server does not require authentication; it may serve
on localhost, or be protected by a firewall. There is not a standard
for this. Controlling Access to the Kubernetes API
describes how you can configure this as a cluster administrator.
Programmatic access to the API
Kubernetes officially supports client libraries for Go, Python, Java, dotnet, JavaScript, and Haskell. There are other client libraries that are provided and maintained by their authors, not the Kubernetes team. See client libraries for accessing the API from other languages and how they authenticate.
Go client
To get the library, run the following command: go get k8s.io/client-go@kubernetes-<kubernetes-version-number> See https://github.com/kubernetes/client-go/releases to see which versions are supported.
Write an application atop of the client-go clients.
Note: client-go defines its own API objects, so if needed, import API definitions from client-go rather than from the main repository. For example, import "k8s.io/client-go/kubernetes" is correct.
The Go client can use the same kubeconfig file
as the kubectl CLI does to locate and authenticate to the API server. See this example:
package main
import (
"context""fmt""k8s.io/apimachinery/pkg/apis/meta/v1""k8s.io/client-go/kubernetes""k8s.io/client-go/tools/clientcmd"
)
funcmain() {
// uses the current context in kubeconfig
// path-to-kubeconfig -- for example, /root/.kube/config
config, _ := clientcmd.BuildConfigFromFlags("", "<path-to-kubeconfig>")
// creates the clientset
clientset, _ := kubernetes.NewForConfig(config)
// access the API to list pods
pods, _ := clientset.CoreV1().Pods("").List(context.TODO(), v1.ListOptions{})
fmt.Printf("There are %d pods in the cluster\n", len(pods.Items))
}
The Python client can use the same kubeconfig file
as the kubectl CLI does to locate and authenticate to the API server. See this example:
fromkubernetesimport client, config
config.load_kube_config()
v1=client.CoreV1Api()
print("Listing pods with their IPs:")
ret = v1.list_pod_for_all_namespaces(watch=False)
for i in ret.items:
print("%s\t%s\t%s"% (i.status.pod_ip, i.metadata.namespace, i.metadata.name))
The Java client can use the same kubeconfig file
as the kubectl CLI does to locate and authenticate to the API server. See this example:
packageio.kubernetes.client.examples;importio.kubernetes.client.ApiClient;importio.kubernetes.client.ApiException;importio.kubernetes.client.Configuration;importio.kubernetes.client.apis.CoreV1Api;importio.kubernetes.client.models.V1Pod;importio.kubernetes.client.models.V1PodList;importio.kubernetes.client.util.ClientBuilder;importio.kubernetes.client.util.KubeConfig;importjava.io.FileReader;importjava.io.IOException;/**
* A simple example of how to use the Java API from an application outside a kubernetes cluster
*
* <p>Easiest way to run this: mvn exec:java
* -Dexec.mainClass="io.kubernetes.client.examples.KubeConfigFileClientExample"
*
*/publicclassKubeConfigFileClientExample{publicstaticvoidmain(String[] args)throws IOException, ApiException {// file path to your KubeConfig
String kubeConfigPath ="~/.kube/config";// loading the out-of-cluster config, a kubeconfig from file-system
ApiClient client =
ClientBuilder.kubeconfig(KubeConfig.loadKubeConfig(new FileReader(kubeConfigPath))).build();// set the global default api-client to the in-cluster one from above
Configuration.setDefaultApiClient(client);// the CoreV1Api loads default api-client from global configuration.
CoreV1Api api =new CoreV1Api();// invokes the CoreV1Api client
V1PodList list = api.listPodForAllNamespaces(null,null,null,null,null,null,null,null,null);
System.out.println("Listing all pods: ");for(V1Pod item : list.getItems()){
System.out.println(item.getMetadata().getName());}}}
This page shows how to specify extended resources for a Node.
Extended resources allow cluster administrators to advertise node-level
resources that would otherwise be unknown to Kubernetes.
Before you begin
You need to have a Kubernetes cluster, and the kubectl command-line tool must
be configured to communicate with your cluster. It is recommended to run this tutorial on a cluster with at least two nodes that are not acting as control plane hosts. If you do not already have a
cluster, you can create one by using
minikube
or you can use one of these Kubernetes playgrounds:
Choose one of your Nodes to use for this exercise.
Advertise a new extended resource on one of your Nodes
To advertise a new extended resource on a Node, send an HTTP PATCH request to
the Kubernetes API server. For example, suppose one of your Nodes has four dongles
attached. Here's an example of a PATCH request that advertises four dongle resources
for your Node.
Note that Kubernetes does not need to know what a dongle is or what a dongle is for.
The preceding PATCH request tells Kubernetes that your Node has four things that
you call dongles.
Start a proxy, so that you can easily send requests to the Kubernetes API server:
kubectl proxy
In another command window, send the HTTP PATCH request.
Replace <your-node-name> with the name of your Node:
Note: In the preceding request, ~1 is the encoding for the character / in
the patch path. The operation path value in JSON-Patch is interpreted as a
JSON-Pointer. For more details, see
IETF RFC 6901, section 3.
The output shows that the Node has a capacity of 4 dongles:
Extended resources are similar to memory and CPU resources. For example,
just as a Node has a certain amount of memory and CPU to be shared by all components
running on the Node, it can have a certain number of dongles to be shared
by all components running on the Node. And just as application developers
can create Pods that request a certain amount of memory and CPU, they can
create Pods that request a certain number of dongles.
Extended resources are opaque to Kubernetes; Kubernetes does not
know anything about what they are. Kubernetes knows only that a Node
has a certain number of them. Extended resources must be advertised in integer
amounts. For example, a Node can advertise four dongles, but not 4.5 dongles.
Storage example
Suppose a Node has 800 GiB of a special kind of disk storage. You could
create a name for the special storage, say example.com/special-storage.
Then you could advertise it in chunks of a certain size, say 100 GiB. In that case,
your Node would advertise that it has eight resources of type
example.com/special-storage.
Capacity:...example.com/special-storage:8
If you want to allow arbitrary requests for special storage, you
could advertise special storage in chunks of size 1 byte. In that case, you would advertise
800Gi resources of type example.com/special-storage.
Capacity:...example.com/special-storage:800Gi
Then a Container could request any number of bytes of special storage, up to 800Gi.
Clean up
Here is a PATCH request that removes the dongle advertisement from a Node.
This page shows how to enable and configure autoscaling of the DNS service in
your Kubernetes cluster.
Before you begin
You need to have a Kubernetes cluster, and the kubectl command-line tool must
be configured to communicate with your cluster. It is recommended to run this tutorial on a cluster with at least two nodes that are not acting as control plane hosts. If you do not already have a
cluster, you can create one by using
minikube
or you can use one of these Kubernetes playgrounds:
NAME READY UP-TO-DATE AVAILABLE AGE
...
dns-autoscaler 1/1 1 1 ...
...
If you see "dns-autoscaler" in the output, DNS horizontal autoscaling is
already enabled, and you can skip to
Tuning autoscaling parameters.
Get the name of your DNS Deployment
List the DNS deployments in your cluster in the kube-system namespace:
kubectl get deployment -l k8s-app=kube-dns --namespace=kube-system
The output is similar to this:
NAME READY UP-TO-DATE AVAILABLE AGE
...
coredns 2/2 2 2 ...
...
If you don't see a Deployment for DNS services, you can also look for it by name:
kubectl get deployment --namespace=kube-system
and look for a deployment named coredns or kube-dns.
Your scale target is
Deployment/<your-deployment-name>
where <your-deployment-name> is the name of your DNS Deployment. For example, if
the name of your Deployment for DNS is coredns, your scale target is Deployment/coredns.
Note: CoreDNS is the default DNS service for Kubernetes. CoreDNS sets the label
k8s-app=kube-dns so that it can work in clusters that originally used
kube-dns.
Enable DNS horizontal autoscaling
In this section, you create a new Deployment. The Pods in the Deployment run a
container based on the cluster-proportional-autoscaler-amd64 image.
Create a file named dns-horizontal-autoscaler.yaml with this content:
kind:ServiceAccountapiVersion:v1metadata:name:kube-dns-autoscalernamespace:kube-system---kind:ClusterRoleapiVersion:rbac.authorization.k8s.io/v1metadata:name:system:kube-dns-autoscalerrules:- apiGroups:[""]resources:["nodes"]verbs:["list","watch"]- apiGroups:[""]resources:["replicationcontrollers/scale"]verbs:["get","update"]- apiGroups:["apps"]resources:["deployments/scale","replicasets/scale"]verbs:["get","update"]# Remove the configmaps rule once below issue is fixed:# kubernetes-incubator/cluster-proportional-autoscaler#16- apiGroups:[""]resources:["configmaps"]verbs:["get","create"]---kind:ClusterRoleBindingapiVersion:rbac.authorization.k8s.io/v1metadata:name:system:kube-dns-autoscalersubjects:- kind:ServiceAccountname:kube-dns-autoscalernamespace:kube-systemroleRef:kind:ClusterRolename:system:kube-dns-autoscalerapiGroup:rbac.authorization.k8s.io---apiVersion:apps/v1kind:Deploymentmetadata:name:kube-dns-autoscalernamespace:kube-systemlabels:k8s-app:kube-dns-autoscalerkubernetes.io/cluster-service:"true"spec:selector:matchLabels:k8s-app:kube-dns-autoscalertemplate:metadata:labels:k8s-app:kube-dns-autoscalerspec:priorityClassName:system-cluster-criticalsecurityContext:seccompProfile:type:RuntimeDefaultsupplementalGroups:[65534]fsGroup:65534nodeSelector:kubernetes.io/os:linuxcontainers:- name:autoscalerimage:k8s.gcr.io/cpa/cluster-proportional-autoscaler:1.8.4resources:requests:cpu:"20m"memory:"10Mi"command:- /cluster-proportional-autoscaler- --namespace=kube-system- --configmap=kube-dns-autoscaler# Should keep target in sync with cluster/addons/dns/kube-dns.yaml.base- --target=<SCALE_TARGET># When cluster is using large nodes(with more cores), "coresPerReplica" should dominate.# If using small nodes, "nodesPerReplica" should dominate.- --default-params={"linear":{"coresPerReplica":256,"nodesPerReplica":16,"preventSinglePointFailure":true,"includeUnschedulableNodes":true}}- --logtostderr=true- --v=2tolerations:- key:"CriticalAddonsOnly"operator:"Exists"serviceAccountName:kube-dns-autoscaler
In the file, replace <SCALE_TARGET> with your scale target.
Go to the directory that contains your configuration file, and enter this
command to create the Deployment:
Modify the fields according to your needs. The "min" field indicates the
minimal number of DNS backends. The actual number of backends is
calculated using this equation:
Note that the values of both coresPerReplica and nodesPerReplica are
floats.
The idea is that when a cluster is using nodes that have many cores,
coresPerReplica dominates. When a cluster is using nodes that have fewer
cores, nodesPerReplica dominates.
After the manifest file is deleted, the Addon Manager will delete the
dns-autoscaler Deployment.
Understanding how DNS horizontal autoscaling works
The cluster-proportional-autoscaler application is deployed separately from
the DNS service.
An autoscaler Pod runs a client that polls the Kubernetes API server for the
number of nodes and cores in the cluster.
A desired replica count is calculated and applied to the DNS backends based on
the current schedulable nodes and cores and the given scaling parameters.
The scaling parameters and data points are provided via a ConfigMap to the
autoscaler, and it refreshes its parameters table every poll interval to be up
to date with the latest desired scaling parameters.
Changes to the scaling parameters are allowed without rebuilding or restarting
the autoscaler Pod.
The autoscaler provides a controller interface to support two control
patterns: linear and ladder.
This page shows how to change the default Storage Class that is used to
provision volumes for PersistentVolumeClaims that have no special requirements.
Before you begin
You need to have a Kubernetes cluster, and the kubectl command-line tool must
be configured to communicate with your cluster. It is recommended to run this tutorial on a cluster with at least two nodes that are not acting as control plane hosts. If you do not already have a
cluster, you can create one by using
minikube
or you can use one of these Kubernetes playgrounds:
Depending on the installation method, your Kubernetes cluster may be deployed with
an existing StorageClass that is marked as default. This default StorageClass
is then used to dynamically provision storage for PersistentVolumeClaims
that do not require any specific storage class. See
PersistentVolumeClaim documentation
for details.
The pre-installed default StorageClass may not fit well with your expected workload;
for example, it might provision storage that is too expensive. If this is the case,
you can either change the default StorageClass or disable it completely to avoid
dynamic provisioning of storage.
Deleting the default StorageClass may not work, as it may be re-created
automatically by the addon manager running in your cluster. Please consult the docs for your installation
for details about addon manager and how to disable individual addons.
Changing the default StorageClass
List the StorageClasses in your cluster:
kubectl get storageclass
The output is similar to this:
NAME PROVISIONER AGE
standard (default) kubernetes.io/gce-pd 1d
gold kubernetes.io/gce-pd 1d
The default StorageClass is marked by (default).
Mark the default StorageClass as non-default:
The default StorageClass has an annotation
storageclass.kubernetes.io/is-default-class set to true. Any other value
or absence of the annotation is interpreted as false.
To mark a StorageClass as non-default, you need to change its value to false:
kubectl patch storageclass standard -p '{"metadata": {"annotations":{"storageclass.kubernetes.io/is-default-class":"false"}}}'
where standard is the name of your chosen StorageClass.
Mark a StorageClass as default:
Similar to the previous step, you need to add/set the annotation
storageclass.kubernetes.io/is-default-class=true.
Please note that at most one StorageClass can be marked as default. If two
or more of them are marked as default, a PersistentVolumeClaim without storageClassName explicitly specified cannot be created.
Verify that your chosen StorageClass is default:
kubectl get storageclass
The output is similar to this:
NAME PROVISIONER AGE
standard kubernetes.io/gce-pd 1d
gold (default) kubernetes.io/gce-pd 1d
2.10 - Change the Reclaim Policy of a PersistentVolume
This page shows how to change the reclaim policy of a Kubernetes
PersistentVolume.
Before you begin
You need to have a Kubernetes cluster, and the kubectl command-line tool must
be configured to communicate with your cluster. It is recommended to run this tutorial on a cluster with at least two nodes that are not acting as control plane hosts. If you do not already have a
cluster, you can create one by using
minikube
or you can use one of these Kubernetes playgrounds:
PersistentVolumes can have various reclaim policies, including "Retain",
"Recycle", and "Delete". For dynamically provisioned PersistentVolumes,
the default reclaim policy is "Delete". This means that a dynamically provisioned
volume is automatically deleted when a user deletes the corresponding
PersistentVolumeClaim. This automatic behavior might be inappropriate if the volume
contains precious data. In that case, it is more appropriate to use the "Retain"
policy. With the "Retain" policy, if a user deletes a PersistentVolumeClaim,
the corresponding PersistentVolume will not be deleted. Instead, it is moved to the
Released phase, where all of its data can be manually recovered.
where <your-pv-name> is the name of your chosen PersistentVolume.
Note:
On Windows, you must double quote any JSONPath template that contains spaces (not single
quote as shown above for bash). This in turn means that you must use a single quote or escaped
double quote around any literals in the template. For example:
In the preceding output, you can see that the volume bound to claim
default/claim3 has reclaim policy Retain. It will not be automatically
deleted when a user deletes claim default/claim3.
Since cloud providers develop and release at a different pace compared to the Kubernetes project, abstracting the provider-specific code to the cloud-controller-manager binary allows cloud vendors to evolve independently from the core Kubernetes code.
The cloud-controller-manager can be linked to any cloud provider that satisfies cloudprovider.Interface. For backwards compatibility, the cloud-controller-manager provided in the core Kubernetes project uses the same cloud libraries as kube-controller-manager. Cloud providers already supported in Kubernetes core are expected to use the in-tree cloud-controller-manager to transition out of Kubernetes core.
Administration
Requirements
Every cloud has their own set of requirements for running their own cloud provider integration, it should not be too different from the requirements when running kube-controller-manager. As a general rule of thumb you'll need:
cloud authentication/authorization: your cloud may require a token or IAM rules to allow access to their APIs
kubernetes authentication/authorization: cloud-controller-manager may need RBAC rules set to speak to the kubernetes apiserver
high availability: like kube-controller-manager, you may want a high available setup for cloud controller manager using leader election (on by default).
Running cloud-controller-manager
Successfully running cloud-controller-manager requires some changes to your cluster configuration.
kube-apiserver and kube-controller-manager MUST NOT specify the --cloud-provider flag. This ensures that it does not run any cloud specific loops that would be run by cloud controller manager. In the future, this flag will be deprecated and removed.
kubelet must run with --cloud-provider=external. This is to ensure that the kubelet is aware that it must be initialized by the cloud controller manager before it is scheduled any work.
Keep in mind that setting up your cluster to use cloud controller manager will change your cluster behaviour in a few ways:
kubelets specifying --cloud-provider=external will add a taint node.cloudprovider.kubernetes.io/uninitialized with an effect NoSchedule during initialization. This marks the node as needing a second initialization from an external controller before it can be scheduled work. Note that in the event that cloud controller manager is not available, new nodes in the cluster will be left unschedulable. The taint is important since the scheduler may require cloud specific information about nodes such as their region or type (high cpu, gpu, high memory, spot instance, etc).
cloud information about nodes in the cluster will no longer be retrieved using local metadata, but instead all API calls to retrieve node information will go through cloud controller manager. This may mean you can restrict access to your cloud API on the kubelets for better security. For larger clusters you may want to consider if cloud controller manager will hit rate limits since it is now responsible for almost all API calls to your cloud from within the cluster.
The cloud controller manager can implement:
Node controller - responsible for updating kubernetes nodes using cloud APIs and deleting kubernetes nodes that were deleted on your cloud.
Service controller - responsible for loadbalancers on your cloud against services of type LoadBalancer.
Route controller - responsible for setting up network routes on your cloud
any other features you would like to implement if you are running an out-of-tree provider.
Examples
If you are using a cloud that is currently supported in Kubernetes core and would like to adopt cloud controller manager, see the cloud controller manager in kubernetes core.
For cloud controller managers not in Kubernetes core, you can find the respective projects in repositories maintained by cloud vendors or by SIGs.
For providers already in Kubernetes core, you can run the in-tree cloud controller manager as a DaemonSet in your cluster, use the following as a guideline:
# This is an example of how to setup cloud-controller-manager as a Daemonset in your cluster.# It assumes that your masters can run pods and has the role node-role.kubernetes.io/master# Note that this Daemonset will not work straight out of the box for your cloud, this is# meant to be a guideline.---apiVersion:v1kind:ServiceAccountmetadata:name:cloud-controller-managernamespace:kube-system---apiVersion:rbac.authorization.k8s.io/v1kind:ClusterRoleBindingmetadata:name:system:cloud-controller-managerroleRef:apiGroup:rbac.authorization.k8s.iokind:ClusterRolename:cluster-adminsubjects:- kind:ServiceAccountname:cloud-controller-managernamespace:kube-system---apiVersion:apps/v1kind:DaemonSetmetadata:labels:k8s-app:cloud-controller-managername:cloud-controller-managernamespace:kube-systemspec:selector:matchLabels:k8s-app:cloud-controller-managertemplate:metadata:labels:k8s-app:cloud-controller-managerspec:serviceAccountName:cloud-controller-managercontainers:- name:cloud-controller-manager# for in-tree providers we use k8s.gcr.io/cloud-controller-manager# this can be replaced with any other image for out-of-tree providersimage:k8s.gcr.io/cloud-controller-manager:v1.8.0command:- /usr/local/bin/cloud-controller-manager- --cloud-provider=[YOUR_CLOUD_PROVIDER] # Add your own cloud provider here!- --leader-elect=true- --use-service-account-credentials# these flags will vary for every cloud provider- --allocate-node-cidrs=true- --configure-cloud-routes=true- --cluster-cidr=172.17.0.0/16tolerations:# this is required so CCM can bootstrap itself- key:node.cloudprovider.kubernetes.io/uninitializedvalue:"true"effect:NoSchedule# these tolerations are to have the daemonset runnable on control plane nodes# remove them if your control plane nodes should not run pods- key:node-role.kubernetes.io/control-planeoperator:Existseffect:NoSchedule- key:node-role.kubernetes.io/masteroperator:Existseffect:NoSchedule# this is to restrict CCM to only run on master nodes# the node selector may vary depending on your cluster setupnodeSelector:node-role.kubernetes.io/master:""
Limitations
Running cloud controller manager comes with a few possible limitations. Although these limitations are being addressed in upcoming releases, it's important that you are aware of these limitations for production workloads.
Support for Volumes
Cloud controller manager does not implement any of the volume controllers found in kube-controller-manager as the volume integrations also require coordination with kubelets. As we evolve CSI (container storage interface) and add stronger support for flex volume plugins, necessary support will be added to cloud controller manager so that clouds can fully integrate with volumes. Learn more about out-of-tree CSI volume plugins here.
Scalability
The cloud-controller-manager queries your cloud provider's APIs to retrieve information for all nodes. For very large clusters, consider possible bottlenecks such as resource requirements and API rate limiting.
Chicken and Egg
The goal of the cloud controller manager project is to decouple development of cloud features from the core Kubernetes project. Unfortunately, many aspects of the Kubernetes project has assumptions that cloud provider features are tightly integrated into the project. As a result, adopting this new architecture can create several situations where a request is being made for information from a cloud provider, but the cloud controller manager may not be able to return that information without the original request being complete.
A good example of this is the TLS bootstrapping feature in the Kubelet. TLS bootstrapping assumes that the Kubelet has the ability to ask the cloud provider (or a local metadata service) for all its address types (private, public, etc) but cloud controller manager cannot set a node's address types without being initialized in the first place which requires that the kubelet has TLS certificates to communicate with the apiserver.
As this initiative evolves, changes will be made to address these issues in upcoming releases.
This page shows how to configure quotas for API objects, including
PersistentVolumeClaims and Services. A quota restricts the number of
objects, of a particular type, that can be created in a namespace.
You specify quotas in a
ResourceQuota
object.
Before you begin
You need to have a Kubernetes cluster, and the kubectl command-line tool must
be configured to communicate with your cluster. It is recommended to run this tutorial on a cluster with at least two nodes that are not acting as control plane hosts. If you do not already have a
cluster, you can create one by using
minikube
or you can use one of these Kubernetes playgrounds:
View detailed information about the ResourceQuota:
kubectl get resourcequota object-quota-demo --namespace=quota-object-example --output=yaml
The output shows that in the quota-object-example namespace, there can be at most
one PersistentVolumeClaim, at most two Services of type LoadBalancer, and no Services
of type NodePort.
2.13 - Control CPU Management Policies on the Node
FEATURE STATE:Kubernetes v1.12 [beta]
Kubernetes keeps many aspects of how pods execute on nodes abstracted
from the user. This is by design. However, some workloads require
stronger guarantees in terms of latency and/or performance in order to operate
acceptably. The kubelet provides methods to enable more complex workload
placement policies while keeping the abstraction free from explicit placement
directives.
Before you begin
You need to have a Kubernetes cluster, and the kubectl command-line tool must
be configured to communicate with your cluster. It is recommended to run this tutorial on a cluster with at least two nodes that are not acting as control plane hosts. If you do not already have a
cluster, you can create one by using
minikube
or you can use one of these Kubernetes playgrounds:
By default, the kubelet uses CFS quota
to enforce pod CPU limits. When the node runs many CPU-bound pods,
the workload can move to different CPU cores depending on
whether the pod is throttled and which CPU cores are available at
scheduling time. Many workloads are not sensitive to this migration and thus
work fine without any intervention.
However, in workloads where CPU cache affinity and scheduling latency
significantly affect workload performance, the kubelet allows alternative CPU
management policies to determine some placement preferences on the node.
Configuration
The CPU Manager policy is set with the --cpu-manager-policy kubelet
flag or the cpuManagerPolicy field in KubeletConfiguration.
There are two supported policies:
static: allows pods with certain resource characteristics to be
granted increased CPU affinity and exclusivity on the node.
The CPU manager periodically writes resource updates through the CRI in
order to reconcile in-memory CPU assignments with cgroupfs. The reconcile
frequency is set through a new Kubelet configuration value
--cpu-manager-reconcile-period. If not specified, it defaults to the same
duration as --node-status-update-frequency.
The behavior of the static policy can be fine-tuned using the --cpu-manager-policy-options flag.
The flag takes a comma-separated list of key=value policy options.
This feature can be disabled completely using the CPUManagerPolicyOptions feature gate.
The policy options are split into two groups: alpha quality (hidden by default) and beta quality
(visible by default). The groups are guarded respectively by the CPUManagerPolicyAlphaOptions
and CPUManagerPolicyBetaOptions feature gates. Diverging from the Kubernetes standard, these
feature gates guard groups of options, because it would have been too cumbersome to add a feature
gate for each individual option.
Changing the CPU Manager Policy
Since the CPU manger policy can only be applied when kubelet spawns new pods, simply changing from
"none" to "static" won't apply to existing pods. So in order to properly change the CPU manager
policy on a node, perform the following steps:
Remove the old CPU manager state file. The path to this file is
/var/lib/kubelet/cpu_manager_state by default. This clears the state maintained by the
CPUManager so that the cpu-sets set up by the new policy won’t conflict with it.
Edit the kubelet configuration to change the CPU manager policy to the desired value.
Start kubelet.
Repeat this process for every node that needs its CPU manager policy changed. Skipping this
process will result in kubelet crashlooping with the following error:
could not restore state from checkpoint: configured policy "static" differs from state checkpoint policy "none", please drain this node and delete the CPU manager checkpoint file "/var/lib/kubelet/cpu_manager_state" before restarting Kubelet
None policy
The none policy explicitly enables the existing default CPU
affinity scheme, providing no affinity beyond what the OS scheduler does
automatically. Limits on CPU usage for
Guaranteed pods and
Burstable pods
are enforced using CFS quota.
Static policy
The static policy allows containers in Guaranteed pods with integer CPU
requests access to exclusive CPUs on the node. This exclusivity is enforced
using the cpuset cgroup controller.
Note: System services such as the container runtime and the kubelet itself can continue to run on these exclusive CPUs. The exclusivity only extends to other pods.
Note: CPU Manager doesn't support offlining and onlining of
CPUs at runtime. Also, if the set of online CPUs changes on the node,
the node must be drained and CPU manager manually reset by deleting the
state file cpu_manager_state in the kubelet root directory.
This policy manages a shared pool of CPUs that initially contains all CPUs in the
node. The amount of exclusively allocatable CPUs is equal to the total
number of CPUs in the node minus any CPU reservations by the kubelet --kube-reserved or
--system-reserved options. From 1.17, the CPU reservation list can be specified
explicitly by kubelet --reserved-cpus option. The explicit CPU list specified by
--reserved-cpus takes precedence over the CPU reservation specified by
--kube-reserved and --system-reserved. CPUs reserved by these options are taken, in
integer quantity, from the initial shared pool in ascending order by physical
core ID. This shared pool is the set of CPUs on which any containers in
BestEffort and Burstable pods run. Containers in Guaranteed pods with fractional
CPU requests also run on CPUs in the shared pool. Only containers that are
both part of a Guaranteed pod and have integer CPU requests are assigned
exclusive CPUs.
Note: The kubelet requires a CPU reservation greater than zero be made
using either --kube-reserved and/or --system-reserved or --reserved-cpus when
the static policy is enabled. This is because zero CPU reservation would allow the shared
pool to become empty.
As Guaranteed pods whose containers fit the requirements for being statically
assigned are scheduled to the node, CPUs are removed from the shared pool and
placed in the cpuset for the container. CFS quota is not used to bound
the CPU usage of these containers as their usage is bound by the scheduling domain
itself. In others words, the number of CPUs in the container cpuset is equal to the integer
CPU limit specified in the pod spec. This static assignment increases CPU
affinity and decreases context switches due to throttling for the CPU-bound
workload.
Consider the containers in the following pod specs:
spec:containers:- name:nginximage:nginx
This pod runs in the BestEffort QoS class because no resource requests or
limits are specified. It runs in the shared pool.
This pod runs in the Burstable QoS class because resource requests do not
equal limits and the cpu quantity is not specified. It runs in the shared
pool.
This pod runs in the Guaranteed QoS class because requests are equal to limits.
And the container's resource limit for the CPU resource is an integer greater than
or equal to one. The nginx container is granted 2 exclusive CPUs.
This pod runs in the Guaranteed QoS class because requests are equal to limits.
But the container's resource limit for the CPU resource is a fraction. It runs in
the shared pool.
This pod runs in the Guaranteed QoS class because only limits are specified
and requests are set equal to limits when not explicitly specified. And the
container's resource limit for the CPU resource is an integer greater than or
equal to one. The nginx container is granted 2 exclusive CPUs.
Static policy options
You can toggle groups of options on and off based upon their maturity level
using the following feature gates:
CPUManagerPolicyBetaOptions default enabled. Disable to hide beta-level options.
CPUManagerPolicyAlphaOptions default disabled. Enable to show alpha-level options.
You will still have to enable each option using the CPUManagerPolicyOptions kubelet option.
The following policy options exist for the static CPUManager policy:
full-pcpus-only (beta, visible by default)
distribute-cpus-across-numa (alpha, hidden by default)
If the full-pcpus-only policy option is specified, the static policy will always allocate full physical cores.
By default, without this option, the static policy allocates CPUs using a topology-aware best-fit allocation.
On SMT enabled systems, the policy can allocate individual virtual cores, which correspond to hardware threads.
This can lead to different containers sharing the same physical cores; this behaviour in turn contributes
to the noisy neighbours problem.
With the option enabled, the pod will be admitted by the kubelet only if the CPU request of all its containers
can be fulfilled by allocating full physical cores.
If the pod does not pass the admission, it will be put in Failed state with the message SMTAlignmentError.
If the distribute-cpus-across-numapolicy option is specified, the static
policy will evenly distribute CPUs across NUMA nodes in cases where more than
one NUMA node is required to satisfy the allocation.
By default, the CPUManager will pack CPUs onto one NUMA node until it is
filled, with any remaining CPUs simply spilling over to the next NUMA node.
This can cause undesired bottlenecks in parallel code relying on barriers (and
similar synchronization primitives), as this type of code tends to run only as
fast as its slowest worker (which is slowed down by the fact that fewer CPUs
are available on at least one NUMA node).
By distributing CPUs evenly across NUMA nodes, application developers can more
easily ensure that no single worker suffers from NUMA effects more than any
other, improving the overall performance of these types of applications.
The full-pcpus-only option can be enabled by adding full-pcups-only=true to
the CPUManager policy options.
Likewise, the distribute-cpus-across-numa option can be enabled by adding
distribute-cpus-across-numa=true to the CPUManager policy options.
When both are set, they are "additive" in the sense that CPUs will be
distributed across NUMA nodes in chunks of full-pcpus rather than individual
cores.
2.14 - Control Topology Management Policies on a node
FEATURE STATE:Kubernetes v1.18 [beta]
An increasing number of systems leverage a combination of CPUs and hardware accelerators to support latency-critical execution and high-throughput parallel computation. These include workloads in fields such as telecommunications, scientific computing, machine learning, financial services and data analytics. Such hybrid systems comprise a high performance environment.
In order to extract the best performance, optimizations related to CPU isolation, memory and device locality are required. However, in Kubernetes, these optimizations are handled by a disjoint set of components.
Topology Manager is a Kubelet component that aims to coordinate the set of components that are responsible for these optimizations.
Before you begin
You need to have a Kubernetes cluster, and the kubectl command-line tool must
be configured to communicate with your cluster. It is recommended to run this tutorial on a cluster with at least two nodes that are not acting as control plane hosts. If you do not already have a
cluster, you can create one by using
minikube
or you can use one of these Kubernetes playgrounds:
Your Kubernetes server must be at or later than version v1.18.
To check the version, enter kubectl version.
How Topology Manager Works
Prior to the introduction of Topology Manager, the CPU and Device Manager in Kubernetes make resource allocation decisions independently of each other.
This can result in undesirable allocations on multiple-socketed systems, performance/latency sensitive applications will suffer due to these undesirable allocations.
Undesirable in this case meaning for example, CPUs and devices being allocated from different NUMA Nodes thus, incurring additional latency.
The Topology Manager is a Kubelet component, which acts as a source of truth so that other Kubelet components can make topology aligned resource allocation choices.
The Topology Manager provides an interface for components, called Hint Providers, to send and receive topology information. Topology Manager has a set of node level policies which are explained below.
The Topology manager receives Topology information from the Hint Providers as a bitmask denoting NUMA Nodes available and a preferred allocation indication. The Topology Manager policies perform a set of operations on the hints provided and converge on the hint determined by the policy to give the optimal result, if an undesirable hint is stored the preferred field for the hint will be set to false. In the current policies preferred is the narrowest preferred mask.
The selected hint is stored as part of the Topology Manager. Depending on the policy configured the pod can be accepted or rejected from the node based on the selected hint.
The hint is then stored in the Topology Manager for use by the Hint Providers when making the resource allocation decisions.
Enable the Topology Manager feature
Support for the Topology Manager requires TopologyManagerfeature gate to be enabled. It is enabled by default starting with Kubernetes 1.18.
Topology Manager Scopes and Policies
The Topology Manager currently:
Aligns Pods of all QoS classes.
Aligns the requested resources that Hint Provider provides topology hints for.
If these conditions are met, the Topology Manager will align the requested resources.
In order to customise how this alignment is carried out, the Topology Manager provides two distinct knobs: scope and policy.
The scope defines the granularity at which you would like resource alignment to be performed (e.g. at the pod or container level). And the policy defines the actual strategy used to carry out the alignment (e.g. best-effort, restricted, single-numa-node, etc.).
Details on the various scopes and policies available today can be found below.
Note: To align CPU resources with other requested resources in a Pod Spec, the CPU Manager should be enabled and proper CPU Manager policy should be configured on a Node. See control CPU Management Policies.
Note: To align memory (and hugepages) resources with other requested resources in a Pod Spec, the Memory Manager should be enabled and proper Memory Manager policy should be configured on a Node. Examine Memory Manager documentation.
Topology Manager Scopes
The Topology Manager can deal with the alignment of resources in a couple of distinct scopes:
container (default)
pod
Either option can be selected at a time of the kubelet startup, with --topology-manager-scope flag.
container scope
The container scope is used by default.
Within this scope, the Topology Manager performs a number of sequential resource alignments, i.e., for each container (in a pod) a separate alignment is computed. In other words, there is no notion of grouping the containers to a specific set of NUMA nodes, for this particular scope. In effect, the Topology Manager performs an arbitrary alignment of individual containers to NUMA nodes.
The notion of grouping the containers was endorsed and implemented on purpose in the following scope, for example the pod scope.
pod scope
To select the pod scope, start the kubelet with the command line option --topology-manager-scope=pod.
This scope allows for grouping all containers in a pod to a common set of NUMA nodes. That is, the Topology Manager treats a pod as a whole and attempts to allocate the entire pod (all containers) to either a single NUMA node or a common set of NUMA nodes. The following examples illustrate the alignments produced by the Topology Manager on different occasions:
all containers can be and are allocated to a single NUMA node;
all containers can be and are allocated to a shared set of NUMA nodes.
The total amount of particular resource demanded for the entire pod is calculated according to effective requests/limits formula, and thus, this total value is equal to the maximum of:
the sum of all app container requests,
the maximum of init container requests,
for a resource.
Using the pod scope in tandem with single-numa-node Topology Manager policy is specifically valuable for workloads that are latency sensitive or for high-throughput applications that perform IPC. By combining both options, you are able to place all containers in a pod onto a single NUMA node; hence, the inter-NUMA communication overhead can be eliminated for that pod.
In the case of single-numa-node policy, a pod is accepted only if a suitable set of NUMA nodes is present among possible allocations. Reconsider the example above:
a set containing only a single NUMA node - it leads to pod being admitted,
whereas a set containing more NUMA nodes - it results in pod rejection (because instead of one NUMA node, two or more NUMA nodes are required to satisfy the allocation).
To recap, Topology Manager first computes a set of NUMA nodes and then tests it against Topology Manager policy, which either leads to the rejection or admission of the pod.
Topology Manager Policies
Topology Manager supports four allocation policies. You can set a policy via a Kubelet flag, --topology-manager-policy.
There are four supported policies:
none (default)
best-effort
restricted
single-numa-node
Note: If Topology Manager is configured with the pod scope, the container, which is considered by the policy, is reflecting requirements of the entire pod, and thus each container from the pod will result with the same topology alignment decision.
none policy
This is the default policy and does not perform any topology alignment.
best-effort policy
For each container in a Pod, the kubelet, with best-effort topology
management policy, calls each Hint Provider to discover their resource availability.
Using this information, the Topology Manager stores the
preferred NUMA Node affinity for that container. If the affinity is not preferred,
Topology Manager will store this and admit the pod to the node anyway.
The Hint Providers can then use this information when making the
resource allocation decision.
restricted policy
For each container in a Pod, the kubelet, with restricted topology
management policy, calls each Hint Provider to discover their resource availability.
Using this information, the Topology Manager stores the
preferred NUMA Node affinity for that container. If the affinity is not preferred,
Topology Manager will reject this pod from the node. This will result in a pod in a Terminated state with a pod admission failure.
Once the pod is in a Terminated state, the Kubernetes scheduler will not attempt to reschedule the pod. It is recommended to use a ReplicaSet or Deployment to trigger a redeploy of the pod.
An external control loop could be also implemented to trigger a redeployment of pods that have the Topology Affinity error.
If the pod is admitted, the Hint Providers can then use this information when making the
resource allocation decision.
single-numa-node policy
For each container in a Pod, the kubelet, with single-numa-node topology
management policy, calls each Hint Provider to discover their resource availability.
Using this information, the Topology Manager determines if a single NUMA Node affinity is possible.
If it is, Topology Manager will store this and the Hint Providers can then use this information when making the
resource allocation decision.
If, however, this is not possible then the Topology Manager will reject the pod from the node. This will result in a pod in a Terminated state with a pod admission failure.
Once the pod is in a Terminated state, the Kubernetes scheduler will not attempt to reschedule the pod. It is recommended to use a Deployment with replicas to trigger a redeploy of the Pod.
An external control loop could be also implemented to trigger a redeployment of pods that have the Topology Affinity error.
Pod Interactions with Topology Manager Policies
Consider the containers in the following pod specs:
spec:containers:- name:nginximage:nginx
This pod runs in the BestEffort QoS class because no resource requests or
limits are specified.
This pod runs in the Burstable QoS class because requests are less than limits.
If the selected policy is anything other than none, Topology Manager would consider these Pod specifications. The Topology Manager would consult the Hint Providers to get topology hints. In the case of the static, the CPU Manager policy would return default topology hint, because these Pods do not have explicitly request CPU resources.
This pod runs in the BestEffort QoS class because there are no CPU and memory requests.
The Topology Manager would consider the above pods. The Topology Manager would consult the Hint Providers, which are CPU and Device Manager to get topology hints for the pods.
In the case of the Guaranteed pod with integer CPU request, the static CPU Manager policy would return topology hints relating to the exclusive CPU and the Device Manager would send back hints for the requested device.
In the case of the Guaranteed pod with sharing CPU request, the static CPU Manager policy would return default topology hint as there is no exclusive CPU request and the Device Manager would send back hints for the requested device.
In the above two cases of the Guaranteed pod, the none CPU Manager policy would return default topology hint.
In the case of the BestEffort pod, the static CPU Manager policy would send back the default topology hint as there is no CPU request and the Device Manager would send back the hints for each of the requested devices.
Using this information the Topology Manager calculates the optimal hint for the pod and stores this information, which will be used by the Hint Providers when they are making their resource assignments.
Known Limitations
The maximum number of NUMA nodes that Topology Manager allows is 8. With more than 8 NUMA nodes there will be a state explosion when trying to enumerate the possible NUMA affinities and generating their hints.
The scheduler is not topology-aware, so it is possible to be scheduled on a node and then fail on the node due to the Topology Manager.
2.15 - Customizing DNS Service
This page explains how to configure your DNS
Pod(s) and customize the
DNS resolution process in your cluster.
Before you begin
You need to have a Kubernetes cluster, and the kubectl command-line tool must
be configured to communicate with your cluster. It is recommended to run this tutorial on a cluster with at least two nodes that are not acting as control plane hosts. If you do not already have a
cluster, you can create one by using
minikube
or you can use one of these Kubernetes playgrounds:
Your cluster must be running the CoreDNS add-on.
Migrating to CoreDNS
explains how to use kubeadm to migrate from kube-dns.
Your Kubernetes server must be at or later than version v1.12.
To check the version, enter kubectl version.
Introduction
DNS is a built-in Kubernetes service launched automatically
using the addon managercluster add-on.
As of Kubernetes v1.12, CoreDNS is the recommended DNS Server, replacing kube-dns. If your cluster
originally used kube-dns, you may still have kube-dns deployed rather than CoreDNS.
Note: The CoreDNS Service is named kube-dns in the metadata.name field.
This is so that there is greater interoperability with workloads that relied on the legacy kube-dns Service name to resolve addresses internal to the cluster. Using a Service named kube-dns abstracts away the implementation detail of which DNS provider is running behind that common name.
If you are running CoreDNS as a Deployment, it will typically be exposed as a Kubernetes Service with a static IP address.
The kubelet passes DNS resolver information to each container with the --cluster-dns=<dns-service-ip> flag.
DNS names also need domains. You configure the local domain in the kubelet
with the flag --cluster-domain=<default-local-domain>.
The DNS server supports forward lookups (A and AAAA records), port lookups (SRV records), reverse IP address lookups (PTR records),
and more. For more information, see DNS for Services and Pods.
If a Pod's dnsPolicy is set to default, it inherits the name resolution
configuration from the node that the Pod runs on. The Pod's DNS resolution
should behave the same as the node.
But see Known issues.
If you don't want this, or if you want a different DNS config for pods, you can
use the kubelet's --resolv-conf flag. Set this flag to "" to prevent Pods from
inheriting DNS. Set it to a valid file path to specify a file other than
/etc/resolv.conf for DNS inheritance.
CoreDNS
CoreDNS is a general-purpose authoritative DNS server that can serve as cluster DNS, complying with the dns specifications.
CoreDNS ConfigMap options
CoreDNS is a DNS server that is modular and pluggable, and each plugin adds new functionality to CoreDNS.
This can be configured by maintaining a Corefile, which is the CoreDNS
configuration file. As a cluster administrator, you can modify the
ConfigMap for the CoreDNS Corefile to change how DNS service discovery
behaves for that cluster.
In Kubernetes, CoreDNS is installed with the following default Corefile configuration:
health: Health of CoreDNS is reported to http://localhost:8080/health. In this extended syntax lameduck will make the process unhealthy then wait for 5 seconds before the process is shut down.
ready: An HTTP endpoint on port 8181 will return 200 OK, when all plugins that are able to signal readiness have done so.
kubernetes: CoreDNS will reply to DNS queries based on IP of the services and pods of Kubernetes. You can find more details about that plugin on the CoreDNS website. ttl allows you to set a custom TTL for responses. The default is 5 seconds. The minimum TTL allowed is 0 seconds, and the maximum is capped at 3600 seconds. Setting TTL to 0 will prevent records from being cached.
The pods insecure option is provided for backward compatibility with kube-dns. You can use the pods verified option, which returns an A record only if there exists a pod in same namespace with matching IP. The pods disabled option can be used if you don't use pod records.
prometheus: Metrics of CoreDNS are available at http://localhost:9153/metrics in Prometheus format (also known as OpenMetrics).
forward: Any queries that are not within the cluster domain of Kubernetes will be forwarded to predefined resolvers (/etc/resolv.conf).
loop: Detects simple forwarding loops and halts the CoreDNS process if a loop is found.
reload: Allows automatic reload of a changed Corefile. After you edit the ConfigMap configuration, allow two minutes for your changes to take effect.
loadbalance: This is a round-robin DNS loadbalancer that randomizes the order of A, AAAA, and MX records in the answer.
You can modify the default CoreDNS behavior by modifying the ConfigMap.
Configuration of Stub-domain and upstream nameserver using CoreDNS
CoreDNS has the ability to configure stubdomains and upstream nameservers using the forward plugin.
Example
If a cluster operator has a Consul domain server located at 10.150.0.1, and all Consul names have the suffix .consul.local. To configure it in CoreDNS, the cluster administrator creates the following stanza in the CoreDNS ConfigMap.
To explicitly force all non-cluster DNS lookups to go through a specific nameserver at 172.16.0.1, point the forward to the nameserver instead of /etc/resolv.conf
forward . 172.16.0.1
The final ConfigMap along with the default Corefile configuration looks like:
The kubeadm tool supports automatic translation from the kube-dns ConfigMap
to the equivalent CoreDNS ConfigMap.
Note: While kube-dns accepts an FQDN for stubdomain and nameserver (eg: ns.foo.com), CoreDNS does not support this feature.
During translation, all FQDN nameservers will be omitted from the CoreDNS config.
CoreDNS configuration equivalent to kube-dns
CoreDNS supports the features of kube-dns and more.
A ConfigMap created for kube-dns to support StubDomainsand upstreamNameservers translates to the forward plugin in CoreDNS.
Example
This example ConfigMap for kube-dns specifies stubdomains and upstreamnameservers:
This page provides hints on diagnosing DNS problems.
Before you begin
You need to have a Kubernetes cluster, and the kubectl command-line tool must
be configured to communicate with your cluster. It is recommended to run this tutorial on a cluster with at least two nodes that are not acting as control plane hosts. If you do not already have a
cluster, you can create one by using
minikube
or you can use one of these Kubernetes playgrounds:
Note: This example creates a pod in the default namespace. DNS name resolution for
services depends on the namespace of the pod. For more information, review
DNS for Services and Pods.
Use the kubectl get pods command to verify that the DNS pod is running.
kubectl get pods --namespace=kube-system -l k8s-app=kube-dns
NAME READY STATUS RESTARTS AGE
...
coredns-7b96bf9f76-5hsxb 1/1 Running 0 1h
coredns-7b96bf9f76-mvmmt 1/1 Running 0 1h
...
Note: The value for label k8s-app is kube-dns for both CoreDNS and kube-dns deployments.
If you see that no CoreDNS Pod is running or that the Pod has failed/completed,
the DNS add-on may not be deployed by default in your current environment and you
will have to deploy it manually.
Check for errors in the DNS pod
Use the kubectl logs command to see logs for the DNS containers.
See if there are any suspicious or unexpected messages in the logs.
Is DNS service up?
Verify that the DNS service is up by using the kubectl get service command.
kubectl get svc --namespace=kube-system
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
...
kube-dns ClusterIP 10.0.0.10 <none> 53/UDP,53/TCP 1h
...
Note: The service name is kube-dns for both CoreDNS and kube-dns deployments.
If you have created the Service or in the case it should be created by default
but it does not appear, see
debugging Services for
more information.
Are DNS endpoints exposed?
You can verify that DNS endpoints are exposed by using the kubectl get endpoints
command.
kubectl get endpoints kube-dns --namespace=kube-system
NAME ENDPOINTS AGE
kube-dns 10.180.3.17:53,10.180.3.17:53 1h
If you do not see the endpoints, see the endpoints section in the
debugging Services documentation.
For additional Kubernetes DNS examples, see the
cluster-dns examples
in the Kubernetes GitHub repository.
Are DNS queries being received/processed?
You can verify if queries are being received by CoreDNS by adding the log plugin to the CoreDNS configuration (aka Corefile).
The CoreDNS Corefile is held in a ConfigMap named coredns. To edit it, use the command:
kubectl -n kube-system edit configmap coredns
Then add log in the Corefile section per the example below:
After saving the changes, it may take up to minute or two for Kubernetes to propagate these changes to the CoreDNS pods.
Next, make some queries and view the logs per the sections above in this document. If CoreDNS pods are receiving the queries, you should see them in the logs.
Some Linux distributions (e.g. Ubuntu) use a local DNS resolver by default (systemd-resolved).
Systemd-resolved moves and replaces /etc/resolv.conf with a stub file that can cause a fatal forwarding
loop when resolving names in upstream servers. This can be fixed manually by using kubelet's --resolv-conf flag
to point to the correct resolv.conf (With systemd-resolved, this is /run/systemd/resolve/resolv.conf).
kubeadm automatically detects systemd-resolved, and adjusts the kubelet flags accordingly.
Kubernetes installs do not configure the nodes' resolv.conf files to use the
cluster DNS by default, because that process is inherently distribution-specific.
This should probably be implemented eventually.
Linux's libc (a.k.a. glibc) has a limit for the DNS nameserver records to 3 by default. What's more, for the glibc versions which are older than glibc-2.17-222 (the new versions update see this issue), the allowed number of DNS search records has been limited to 6 (see this bug from 2005). Kubernetes needs to consume 1 nameserver record and 3 search records. This means that if a local installation already uses 3 nameservers or uses more than 3 searches while your glibc version is in the affected list, some of those settings will be lost. To work around the DNS nameserver records limit, the node can run dnsmasq, which will provide more nameserver entries. You can also use kubelet's --resolv-conf flag. To fix the DNS search records limit, consider upgrading your linux distribution or upgrading to an unaffected version of glibc.
If you are using Alpine version 3.3 or earlier as your base image, DNS may not
work properly due to a known issue with Alpine.
Kubernetes issue 30215
details more information on this.
This document helps you get started using the Kubernetes NetworkPolicy API to declare network policies that govern how pods communicate with each other.
Note:
This section links to third party projects that provide functionality required by Kubernetes. The Kubernetes project authors aren't responsible for these projects, which are listed alphabetically. To add a project to this list, read the content guide before submitting a change. More information.
Before you begin
You need to have a Kubernetes cluster, and the kubectl command-line tool must
be configured to communicate with your cluster. It is recommended to run this tutorial on a cluster with at least two nodes that are not acting as control plane hosts. If you do not already have a
cluster, you can create one by using
minikube
or you can use one of these Kubernetes playgrounds:
Your Kubernetes server must be at or later than version v1.8.
To check the version, enter kubectl version.
Make sure you've configured a network provider with network policy support. There are a number of network providers that support NetworkPolicy, including:
Create an nginx deployment and expose it via a service
To see how Kubernetes network policy works, start off by creating an nginx Deployment.
kubectl create deployment nginx --image=nginx
deployment.apps/nginx created
Expose the Deployment through a Service called nginx.
kubectl expose deployment nginx --port=80
service/nginx exposed
The above commands create a Deployment with an nginx Pod and expose the Deployment through a Service named nginx. The nginx Pod and Deployment are found in the default namespace.
kubectl get svc,pod
NAME CLUSTER-IP EXTERNAL-IP PORT(S) AGE
service/kubernetes 10.100.0.1 <none> 443/TCP 46m
service/nginx 10.100.0.16 <none> 80/TCP 33s
NAME READY STATUS RESTARTS AGE
pod/nginx-701339712-e0qfq 1/1 Running 0 35s
Test the service by accessing it from another Pod
You should be able to access the new nginx service from other Pods. To access the nginx Service from another Pod in the default namespace, start a busybox container:
kubectl run busybox --rm -ti --image=busybox:1.28 -- /bin/sh
In your shell, run the following command:
wget --spider --timeout=1 nginx
Connecting to nginx (10.100.0.16:80)
remote file exists
Limit access to the nginx service
To limit the access to the nginx service so that only Pods with the label access: true can query it, create a NetworkPolicy object as follows:
Note: NetworkPolicy includes a podSelector which selects the grouping of Pods to which the policy applies. You can see this policy selects Pods with the label app=nginx. The label was automatically added to the Pod in the nginx Deployment. An empty podSelector selects all pods in the namespace.
Assign the policy to the service
Use kubectl to create a NetworkPolicy from the above nginx-policy.yaml file:
networkpolicy.networking.k8s.io/access-nginx created
Test access to the service when access label is not defined
When you attempt to access the nginx Service from a Pod without the correct labels, the request times out:
kubectl run busybox --rm -ti --image=busybox:1.28 -- /bin/sh
In your shell, run the command:
wget --spider --timeout=1 nginx
Connecting to nginx (10.100.0.16:80)
wget: download timed out
Define access label and test again
You can create a Pod with the correct labels to see that the request is allowed:
kubectl run busybox --rm -ti --labels="access=true" --image=busybox:1.28 -- /bin/sh
In your shell, run the command:
wget --spider --timeout=1 nginx
Connecting to nginx (10.100.0.16:80)
remote file exists
2.18 - Developing Cloud Controller Manager
FEATURE STATE:Kubernetes v1.11 [beta]
The cloud-controller-manager is a Kubernetes control plane component
that embeds cloud-specific control logic. The cloud controller manager lets you link your
cluster into your cloud provider's API, and separates out the components that interact
with that cloud platform from components that only interact with your cluster.
By decoupling the interoperability logic between Kubernetes and the underlying cloud
infrastructure, the cloud-controller-manager component enables cloud providers to release
features at a different pace compared to the main Kubernetes project.
Background
Since cloud providers develop and release at a different pace compared to the Kubernetes project, abstracting the provider-specific code to the cloud-controller-manager binary allows cloud vendors to evolve independently from the core Kubernetes code.
The Kubernetes project provides skeleton cloud-controller-manager code with Go interfaces to allow you (or your cloud provider) to plug in your own implementations. This means that a cloud provider can implement a cloud-controller-manager by importing packages from Kubernetes core; each cloudprovider will register their own code by calling cloudprovider.RegisterCloudProvider to update a global variable of available cloud providers.
Developing
Out of tree
To build an out-of-tree cloud-controller-manager for your cloud:
Use main.go in cloud-controller-manager from Kubernetes core as a template for your main.go. As mentioned above, the only difference should be the cloud package that will be imported.
Many cloud providers publish their controller manager code as open source. If you are creating
a new cloud-controller-manager from scratch, you could take an existing out-of-tree cloud
controller manager as your starting point.
This page shows how to enable or disable an API version from your cluster's
control plane.
Specific API versions can be turned on or off by passing --runtime-config=api/<version> as a
command line argument to the API server. The values for this argument are a comma-separated
list of API versions. Later values override earlier values.
The runtime-config command line argument also supports 2 special keys:
api/all, representing all known APIs
api/legacy, representing only legacy APIs. Legacy APIs are any APIs that have been
explicitly deprecated.
For example, to turning off all API versions except v1, pass --runtime-config=api/all=false,api/v1=true
to the kube-apiserver.
This feature, specifically the alpha topologyKeys field, is deprecated since
Kubernetes v1.21.
Topology Aware Hints,
introduced in Kubernetes v1.21, provide similar functionality.
Service Topology enables a Service to route traffic based upon the Node
topology of the cluster. For example, a service can specify that traffic be
preferentially routed to endpoints that are on the same Node as the client, or
in the same availability zone.
Before you begin
You need to have a Kubernetes cluster, and the kubectl command-line tool must
be configured to communicate with your cluster. It is recommended to run this tutorial on a cluster with at least two nodes that are not acting as control plane hosts. If you do not already have a
cluster, you can create one by using
minikube
or you can use one of these Kubernetes playgrounds:
This page shows how to enable and configure encryption of secret data at rest.
Before you begin
You need to have a Kubernetes cluster, and the kubectl command-line tool must
be configured to communicate with your cluster. It is recommended to run this tutorial on a cluster with at least two nodes that are not acting as control plane hosts. If you do not already have a
cluster, you can create one by using
minikube
or you can use one of these Kubernetes playgrounds:
Your Kubernetes server must be at or later than version 1.13.
To check the version, enter kubectl version.
etcd v3.0 or later is required
Configuration and determining whether encryption at rest is already enabled
The kube-apiserver process accepts an argument --encryption-provider-config
that controls how API data is encrypted in etcd.
The configuration is provided as an API named
EncryptionConfiguration.
An example configuration is provided below.
Caution:IMPORTANT: For high-availability configurations (with two or more control plane nodes), the
encryption configuration file must be the same! Otherwise, the kube-apiserver component cannot
decrypt data stored in the etcd.
Understanding the encryption at rest configuration.
Each resources array item is a separate config and contains a complete configuration. The
resources.resources field is an array of Kubernetes resource names (resource or resource.group)
that should be encrypted. The providers array is an ordered list of the possible encryption
providers.
Only one provider type may be specified per entry (identity or aescbc may be provided,
but not both in the same item).
The first provider in the list is used to encrypt resources written into the storage. When reading
resources from storage, each provider that matches the stored data attempts in order to decrypt the
data. If no provider can read the stored data due to a mismatch in format or secret key, an error
is returned which prevents clients from accessing that resource.
For more detailed information about the EncryptionConfiguration struct, please refer to the
encryption configuration API.
Caution: If any resource is not readable via the encryption config (because keys were changed),
the only recourse is to delete that key from the underlying etcd directly. Calls that attempt to
read that resource will fail until it is deleted or a valid decryption key is provided.
Providers:
Providers for Kubernetes encryption at rest
Name
Encryption
Strength
Speed
Key Length
Other Considerations
identity
None
N/A
N/A
N/A
Resources written as-is without encryption. When set as the first provider, the resource will be decrypted as new values are written.
secretbox
XSalsa20 and Poly1305
Strong
Faster
32-byte
A newer standard and may not be considered acceptable in environments that require high levels of review.
aesgcm
AES-GCM with random nonce
Must be rotated every 200k writes
Fastest
16, 24, or 32-byte
Is not recommended for use except when an automated key rotation scheme is implemented.
aescbc
AES-CBC with PKCS#7 padding
Weak
Fast
32-byte
Not recommended due to CBC's vulnerability to padding oracle attacks.
kms
Uses envelope encryption scheme: Data is encrypted by data encryption keys (DEKs) using AES-CBC with PKCS#7 padding, DEKs are encrypted by key encryption keys (KEKs) according to configuration in Key Management Service (KMS)
Strongest
Fast
32-bytes
The recommended choice for using a third party tool for key management. Simplifies key rotation, with a new DEK generated for each encryption, and KEK rotation controlled by the user. Configure the KMS provider
Each provider supports multiple keys - the keys are tried in order for decryption, and if the provider
is the first provider, the first key is used for encryption.
Caution: Storing the raw encryption key in the EncryptionConfig only moderately improves your security
posture, compared to no encryption. Please use kms provider for additional security.
By default, the identity provider is used to protect Secrets in etcd, which provides no
encryption. EncryptionConfiguration was introduced to encrypt Secrets locally, with a locally
managed key.
Encrypting Secrets with a locally managed key protects against an etcd compromise, but it fails to
protect against a host compromise. Since the encryption keys are stored on the host in the
EncryptionConfiguration YAML file, a skilled attacker can access that file and extract the encryption
keys.
Envelope encryption creates dependence on a separate key, not stored in Kubernetes. In this case,
an attacker would need to compromise etcd, the kubeapi-server, and the third-party KMS provider to
retrieve the plaintext values, providing a higher level of security than locally stored encryption keys.
To create a new Secret, perform the following steps:
Generate a 32-byte random key and base64 encode it. If you're on Linux or macOS, run the following command:
head -c 32 /dev/urandom | base64
Place that value in the secret field of the EncryptionConfiguration struct.
Set the --encryption-provider-config flag on the kube-apiserver to point to
the location of the config file.
Restart your API server.
Caution: Your config file contains keys that can decrypt the contents in etcd, so you must properly restrict
permissions on your control-plane nodes so only the user who runs the kube-apiserver can read it.
Verifying that data is encrypted
Data is encrypted when written to etcd. After restarting your kube-apiserver, any newly created or
updated Secret should be encrypted when stored. To check this, you can use the etcdctl command line
program to retrieve the contents of your Secret.
Create a new Secret called secret1 in the default namespace:
The command above reads all Secrets and then updates them to apply server side encryption.
Note: If an error occurs due to a conflicting write, retry the command.
For larger clusters, you may wish to subdivide the secrets by namespace or script an update.
Rotating a decryption key
Changing a Secret without incurring downtime requires a multi-step operation, especially in
the presence of a highly-available deployment where multiple kube-apiserver processes are running.
Generate a new key and add it as the second key entry for the current provider on all servers
Restart all kube-apiserver processes to ensure each server can decrypt using the new key
Make the new key the first entry in the keys array so that it is used for encryption in the config
Restart all kube-apiserver processes to ensure each server now encrypts using the new key
Run kubectl get secrets --all-namespaces -o json | kubectl replace -f - to encrypt all
existing Secrets with the new key
Remove the old decryption key from the config after you have backed up etcd with the new key in use
and updated all Secrets
When running a single kube-apiserver instance, step 2 may be skipped.
Decrypting all data
To disable encryption at rest, place the identity provider as the first entry in the config
and restart all kube-apiserver processes.
2.22 - Guaranteed Scheduling For Critical Add-On Pods
Kubernetes core components such as the API server, scheduler, and controller-manager run on a control plane node. However, add-ons must run on a regular cluster node.
Some of these add-ons are critical to a fully functional cluster, such as metrics-server, DNS, and UI.
A cluster may stop working properly if a critical add-on is evicted (either manually or as a side effect of another operation like upgrade)
and becomes pending (for example when the cluster is highly utilized and either there are other pending pods that schedule into the space
vacated by the evicted critical add-on pod or the amount of resources available on the node changed for some other reason).
Note that marking a pod as critical is not meant to prevent evictions entirely; it only prevents the pod from becoming permanently unavailable.
A static pod marked as critical, can't be evicted. However, a non-static pods marked as critical are always rescheduled.
Marking pod as critical
To mark a Pod as critical, set priorityClassName for that Pod to system-cluster-critical or system-node-critical. system-node-critical is the highest available priority, even higher than system-cluster-critical.
2.23 - IP Masquerade Agent User Guide
This page shows how to configure and enable the ip-masq-agent.
Before you begin
You need to have a Kubernetes cluster, and the kubectl command-line tool must
be configured to communicate with your cluster. It is recommended to run this tutorial on a cluster with at least two nodes that are not acting as control plane hosts. If you do not already have a
cluster, you can create one by using
minikube
or you can use one of these Kubernetes playgrounds:
The ip-masq-agent configures iptables rules to hide a pod's IP address behind the cluster node's IP address. This is typically done when sending traffic to destinations outside the cluster's pod CIDR range.
Key Terms
NAT (Network Address Translation)
Is a method of remapping one IP address to another by modifying either the source and/or destination address information in the IP header. Typically performed by a device doing IP routing.
Masquerading
A form of NAT that is typically used to perform a many to one address translation, where multiple source IP addresses are masked behind a single address, which is typically the device doing the IP routing. In Kubernetes this is the Node's IP address.
CIDR (Classless Inter-Domain Routing)
Based on the variable-length subnet masking, allows specifying arbitrary-length prefixes. CIDR introduced a new method of representation for IP addresses, now commonly known as CIDR notation, in which an address or routing prefix is written with a suffix indicating the number of bits of the prefix, such as 192.168.2.0/24.
Link Local
A link-local address is a network address that is valid only for communications within the network segment or the broadcast domain that the host is connected to. Link-local addresses for IPv4 are defined in the address block 169.254.0.0/16 in CIDR notation.
The ip-masq-agent configures iptables rules to handle masquerading node/pod IP addresses when sending traffic to destinations outside the cluster node's IP and the Cluster IP range. This essentially hides pod IP addresses behind the cluster node's IP address. In some environments, traffic to "external" addresses must come from a known machine address. For example, in Google Cloud, any traffic to the internet must come from a VM's IP. When containers are used, as in Google Kubernetes Engine, the Pod IP will be rejected for egress. To avoid this, we must hide the Pod IP behind the VM's own IP address - generally known as "masquerade". By default, the agent is configured to treat the three private IP ranges specified by RFC 1918 as non-masquerade CIDR. These ranges are 10.0.0.0/8, 172.16.0.0/12, and 192.168.0.0/16. The agent will also treat link-local (169.254.0.0/16) as a non-masquerade CIDR by default. The agent is configured to reload its configuration from the location /etc/config/ip-masq-agent every 60 seconds, which is also configurable.
The agent configuration file must be written in YAML or JSON syntax, and may contain three optional keys:
nonMasqueradeCIDRs: A list of strings in
CIDR notation that specify the non-masquerade ranges.
masqLinkLocal: A Boolean (true/false) which indicates whether to masquerade traffic to the
link local prefix 169.254.0.0/16. False by default.
resyncInterval: A time interval at which the agent attempts to reload config from disk.
For example: '30s', where 's' means seconds, 'ms' means milliseconds.
Traffic to 10.0.0.0/8, 172.16.0.0/12 and 192.168.0.0/16) ranges will NOT be masqueraded. Any other traffic (assumed to be internet) will be masqueraded. An example of a local destination from a pod could be its Node's IP address as well as another node's address or one of the IP addresses in Cluster's IP range. Any other traffic will be masqueraded by default. The below entries show the default set of rules that are applied by the ip-masq-agent:
iptables -t nat -L IP-MASQ-AGENT
RETURN all -- anywhere 169.254.0.0/16 /* ip-masq-agent: cluster-local traffic should not be subject to MASQUERADE */ ADDRTYPE match dst-type !LOCAL
RETURN all -- anywhere 10.0.0.0/8 /* ip-masq-agent: cluster-local traffic should not be subject to MASQUERADE */ ADDRTYPE match dst-type !LOCAL
RETURN all -- anywhere 172.16.0.0/12 /* ip-masq-agent: cluster-local traffic should not be subject to MASQUERADE */ ADDRTYPE match dst-type !LOCAL
RETURN all -- anywhere 192.168.0.0/16 /* ip-masq-agent: cluster-local traffic should not be subject to MASQUERADE */ ADDRTYPE match dst-type !LOCAL
MASQUERADE all -- anywhere anywhere /* ip-masq-agent: outbound traffic should be subject to MASQUERADE (this match must come after cluster-local CIDR matches) */ ADDRTYPE match dst-type !LOCAL
By default, in GCE/Google Kubernetes Engine, if network policy is enabled or
you are using a cluster CIDR not in the 10.0.0.0/8 range, the ip-masq-agent
will run in your cluster. If you are running in another environment,
you can add the ip-masq-agentDaemonSet
to your cluster.
Create an ip-masq-agent
To create an ip-masq-agent, run the following kubectl command:
More information can be found in the ip-masq-agent documentation here
In most cases, the default set of rules should be sufficient; however, if this is not the case for your cluster, you can create and apply a ConfigMap to customize the IP ranges that are affected. For example, to allow only 10.0.0.0/8 to be considered by the ip-masq-agent, you can create the following ConfigMap in a file called "config".
Note:
It is important that the file is called config since, by default, that will be used as the key for lookup by the ip-masq-agent:
nonMasqueradeCIDRs:- 10.0.0.0/8resyncInterval:60s
Run the following command to add the config map to your cluster:
This will update a file located at /etc/config/ip-masq-agent which is periodically checked every resyncInterval and applied to the cluster node.
After the resync interval has expired, you should see the iptables rules reflect your changes:
iptables -t nat -L IP-MASQ-AGENT
Chain IP-MASQ-AGENT (1 references)
target prot opt source destination
RETURN all -- anywhere 169.254.0.0/16 /* ip-masq-agent: cluster-local traffic should not be subject to MASQUERADE */ ADDRTYPE match dst-type !LOCAL
RETURN all -- anywhere 10.0.0.0/8 /* ip-masq-agent: cluster-local
MASQUERADE all -- anywhere anywhere /* ip-masq-agent: outbound traffic should be subject to MASQUERADE (this match must come after cluster-local CIDR matches) */ ADDRTYPE match dst-type !LOCAL
By default, the link local range (169.254.0.0/16) is also handled by the ip-masq agent, which sets up the appropriate iptables rules. To have the ip-masq-agent ignore link local, you can set masqLinkLocal to true in the ConfigMap.
You need to have a Kubernetes cluster, and the kubectl command-line tool must
be configured to communicate with your cluster. It is recommended to run this tutorial on a cluster with at least two nodes that are not acting as control plane hosts. If you do not already have a
cluster, you can create one by using
minikube
or you can use one of these Kubernetes playgrounds:
The cluster-admin is operating a cluster on behalf of a user population and the admin wants to control
how much storage a single namespace can consume in order to control cost.
The admin would like to limit:
The number of persistent volume claims in a namespace
The amount of storage each claim can request
The amount of cumulative storage the namespace can have
LimitRange to limit requests for storage
Adding a LimitRange to a namespace enforces storage request sizes to a minimum and maximum. Storage is requested
via PersistentVolumeClaim. The admission controller that enforces limit ranges will reject any PVC that is above or below
the values set by the admin.
In this example, a PVC requesting 10Gi of storage would be rejected because it exceeds the 2Gi max.
Minimum storage requests are used when the underlying storage provider requires certain minimums. For example,
AWS EBS volumes have a 1Gi minimum requirement.
StorageQuota to limit PVC count and cumulative storage capacity
Admins can limit the number of PVCs in a namespace as well as the cumulative capacity of those PVCs. New PVCs that exceed
either maximum value will be rejected.
In this example, a 6th PVC in the namespace would be rejected because it exceeds the maximum count of 5. Alternatively,
a 5Gi maximum quota when combined with the 2Gi max limit above, cannot have 3 PVCs where each has 2Gi. That would be 6Gi requested
for a namespace capped at 5Gi.
A limit range can put a ceiling on how much storage is requested while a resource quota can effectively cap the storage
consumed by a namespace through claim counts and cumulative storage capacity. The allows a cluster-admin to plan their
cluster's storage budget without risk of any one project going over their allotment.
2.25 - Migrate Replicated Control Plane To Use Cloud Controller Manager
FEATURE STATE:Kubernetes v1.24 [stable]
The cloud-controller-manager is a Kubernetes control plane component
that embeds cloud-specific control logic. The cloud controller manager lets you link your
cluster into your cloud provider's API, and separates out the components that interact
with that cloud platform from components that only interact with your cluster.
By decoupling the interoperability logic between Kubernetes and the underlying cloud
infrastructure, the cloud-controller-manager component enables cloud providers to release
features at a different pace compared to the main Kubernetes project.
Background
As part of the cloud provider extraction effort, all cloud specific controllers must be moved out of the kube-controller-manager. All existing clusters that run cloud controllers in the kube-controller-manager must migrate to instead run the controllers in a cloud provider specific cloud-controller-manager.
Leader Migration provides a mechanism in which HA clusters can safely migrate "cloud specific" controllers between the kube-controller-manager and the cloud-controller-manager via a shared resource lock between the two components while upgrading the replicated control plane. For a single-node control plane, or if unavailability of controller managers can be tolerated during the upgrade, Leader Migration is not needed and this guide can be ignored.
Leader Migration can be enabled by setting --enable-leader-migration on kube-controller-manager or cloud-controller-manager. Leader Migration only applies during the upgrade and can be safely disabled or left enabled after the upgrade is complete.
This guide walks you through the manual process of upgrading the control plane from kube-controller-manager with built-in cloud provider to running both kube-controller-manager and cloud-controller-manager. If you use a tool to deploy and manage the cluster, please refer to the documentation of the tool and the cloud provider for specific instructions of the migration.
Before you begin
It is assumed that the control plane is running Kubernetes version N and to be upgraded to version N + 1. Although it is possible to migrate within the same version, ideally the migration should be performed as part of an upgrade so that changes of configuration can be aligned to each release. The exact versions of N and N + 1 depend on each cloud provider. For example, if a cloud provider builds a cloud-controller-manager to work with Kubernetes 1.24, then N can be 1.23 and N + 1 can be 1.24.
The control plane nodes should run kube-controller-manager with Leader Election enabled, which is the default. As of version N, an in-tree cloud provider must be set with --cloud-provider flag and cloud-controller-manager should not yet be deployed.
The out-of-tree cloud provider must have built a cloud-controller-manager with Leader Migration implementation. If the cloud provider imports k8s.io/cloud-provider and k8s.io/controller-manager of version v0.21.0 or later, Leader Migration will be available. However, for version before v0.22.0, Leader Migration is alpha and requires feature gate ControllerManagerLeaderMigration to be enabled in cloud-controller-manager.
This guide assumes that kubelet of each control plane node starts kube-controller-manager and cloud-controller-manager as static pods defined by their manifests. If the components run in a different setting, please adjust the steps accordingly.
For authorization, this guide assumes that the cluster uses RBAC. If another authorization mode grants permissions to kube-controller-manager and cloud-controller-manager components, please grant the needed access in a way that matches the mode.
Grant access to Migration Lease
The default permissions of the controller manager allow only accesses to their main Lease. In order for the migration to work, accesses to another Lease are required.
You can grant kube-controller-manager full access to the leases API by modifying the system::leader-locking-kube-controller-manager role. This task guide assumes that the name of the migration lease is cloud-provider-extraction-migration.
Leader Migration optionally takes a configuration file representing the state of controller-to-manager assignment. At this moment, with in-tree cloud provider, kube-controller-manager runs route, service, and cloud-node-lifecycle. The following example configuration shows the assignment.
Leader Migration can be enabled without a configuration. Please see Default Configuration for details.
Alternatively, because the controllers can run under either controller managers, setting component to *
for both sides makes the configuration file consistent between both parties of the migration.
On each control plane node, save the content to /etc/leadermigration.conf, and update the manifest of kube-controller-manager so that the file is mounted inside the container at the same location. Also, update the same manifest to add the following arguments:
--enable-leader-migration to enable Leader Migration on the controller manager
--leader-migration-config=/etc/leadermigration.conf to set configuration file
Restart kube-controller-manager on each node. At this moment, kube-controller-manager has leader migration enabled and is ready for the migration.
Deploy Cloud Controller Manager
In version N + 1, the desired state of controller-to-manager assignment can be represented by a new configuration file, shown as follows. Please note component field of each controllerLeaders changing from kube-controller-manager to cloud-controller-manager. Alternatively, use the wildcard version mentioned above, which has the same effect.
When creating control plane nodes of version N + 1, the content should be deployed to /etc/leadermigration.conf. The manifest of cloud-controller-manager should be updated to mount the configuration file in the same manner as kube-controller-manager of version N. Similarly, add --enable-leader-migration and --leader-migration-config=/etc/leadermigration.conf to the arguments of cloud-controller-manager.
Create a new control plane node of version N + 1 with the updated cloud-controller-manager manifest, and with the --cloud-provider flag set to external for kube-controller-manager. kube-controller-manager of version N + 1 MUST NOT have Leader Migration enabled because, with an external cloud provider, it does not run the migrated controllers anymore, and thus it is not involved in the migration.
The control plane now contains nodes of both version N and N + 1. The nodes of version N run kube-controller-manager only, and these of version N + 1 run both kube-controller-manager and cloud-controller-manager. The migrated controllers, as specified in the configuration, are running under either kube-controller-manager of version N or cloud-controller-manager of version N + 1 depending on which controller manager holds the migration lease. No controller will ever be running under both controller managers at any time.
In a rolling manner, create a new control plane node of version N + 1 and bring down one of version N + 1 until the control plane contains only nodes of version N + 1.
If a rollback from version N + 1 to N is required, add nodes of version N with Leader Migration enabled for kube-controller-manager back to the control plane, replacing one of version N + 1 each time until there are only nodes of version N.
(Optional) Disable Leader Migration
Now that the control plane has been upgraded to run both kube-controller-manager and cloud-controller-manager of version N + 1, Leader Migration has finished its job and can be safely disabled to save one Lease resource. It is safe to re-enable Leader Migration for the rollback in the future.
In a rolling manager, update manifest of cloud-controller-manager to unset both --enable-leader-migration and --leader-migration-config= flag, also remove the mount of /etc/leadermigration.conf, and finally remove /etc/leadermigration.conf. To re-enable Leader Migration, recreate the configuration file and add its mount and the flags that enable Leader Migration back to cloud-controller-manager.
Default Configuration
Starting Kubernetes 1.22, Leader Migration provides a default configuration suitable for the default controller-to-manager assignment.
The default configuration can be enabled by setting --enable-leader-migration but without --leader-migration-config=.
For kube-controller-manager and cloud-controller-manager, if there are no flags that enable any in-tree cloud provider or change ownership of controllers, the default configuration can be used to avoid manual creation of the configuration file.
Special case: migrating the Node IPAM controller
If your cloud provider provides an implementation of Node IPAM controller, you should switch to the implementation in cloud-controller-manager. Disable Node IPAM controller in kube-controller-manager of version N + 1 by adding --controllers=*,-nodeipam to its flags. Then add nodeipam to the list of migrated controllers.
# wildcard version, with nodeipamkind:LeaderMigrationConfigurationapiVersion:controllermanager.config.k8s.io/v1leaderName:cloud-provider-extraction-migrationcontrollerLeaders:- name:routecomponent:*- name:servicecomponent:*- name:cloud-node-lifecyclecomponent:*- name:nodeipam- component:*
A mechanism to attach authorization and policy to a subsection of the cluster.
Use of multiple namespaces is optional.
This example demonstrates how to use Kubernetes namespaces to subdivide your cluster.
Before you begin
You need to have a Kubernetes cluster, and the kubectl command-line tool must
be configured to communicate with your cluster. It is recommended to run this tutorial on a cluster with at least two nodes that are not acting as control plane hosts. If you do not already have a
cluster, you can create one by using
minikube
or you can use one of these Kubernetes playgrounds:
By default, a Kubernetes cluster will instantiate a default namespace when provisioning the cluster to hold the default set of Pods,
Services, and Deployments used by the cluster.
Assuming you have a fresh cluster, you can inspect the available namespaces by doing the following:
kubectl get namespaces
NAME STATUS AGE
default Active 13m
Create new namespaces
For this exercise, we will create two additional Kubernetes namespaces to hold our content.
Let's imagine a scenario where an organization is using a shared Kubernetes cluster for development and production use cases.
The development team would like to maintain a space in the cluster where they can get a view on the list of Pods, Services, and Deployments
they use to build and run their application. In this space, Kubernetes resources come and go, and the restrictions on who can or cannot modify resources
are relaxed to enable agile development.
The operations team would like to maintain a space in the cluster where they can enforce strict procedures on who can or cannot manipulate the set of
Pods, Services, and Deployments that run the production site.
One pattern this organization could follow is to partition the Kubernetes cluster into two namespaces: development and production.
Let's create two new namespaces to hold our work.
Use the file namespace-dev.json which describes a development namespace:
The next step is to define a context for the kubectl client to work in each namespace. The value of "cluster" and "user" fields are copied from the current context.
By default, the above commands adds two contexts that are saved into file
.kube/config. You can now view the contexts and alternate against the two
new request contexts depending on which namespace you wish to work against.
We have created a deployment whose replica size is 2 that is running the pod called snowflake with a basic container that serves the hostname.
kubectl get deployment
NAME READY UP-TO-DATE AVAILABLE AGE
snowflake 2/2 2 2 2m
kubectl get pods -l app=snowflake
NAME READY STATUS RESTARTS AGE
snowflake-3968820950-9dgr8 1/1 Running 0 2m
snowflake-3968820950-vgc4n 1/1 Running 0 2m
And this is great, developers are able to do what they want, and they do not have to worry about affecting content in the production namespace.
Let's switch to the production namespace and show how resources in one namespace are hidden from the other.
kubectl config use-context prod
The production namespace should be empty, and the following commands should return nothing.
kubectl get deployment
kubectl get pods
Production likes to run cattle, so let's create some cattle pods.
kubectl create deployment cattle --image=k8s.gcr.io/serve_hostname --replicas=5
kubectl get deployment
NAME READY UP-TO-DATE AVAILABLE AGE
cattle 5/5 5 5 10s
kubectl get pods -l app=cattle
NAME READY STATUS RESTARTS AGE
cattle-2263376956-41xy6 1/1 Running 0 34s
cattle-2263376956-kw466 1/1 Running 0 34s
cattle-2263376956-n4v97 1/1 Running 0 34s
cattle-2263376956-p5p3i 1/1 Running 0 34s
cattle-2263376956-sxpth 1/1 Running 0 34s
At this point, it should be clear that the resources users create in one namespace are hidden from the other namespace.
As the policy support in Kubernetes evolves, we will extend this scenario to show how you can provide different
authorization rules for each namespace.
2.27 - Operating etcd clusters for Kubernetes
etcd is a consistent and highly-available key value store used as Kubernetes' backing store for all cluster data.
If your Kubernetes cluster uses etcd as its backing store, make sure you have a
back up plan
for those data.
You can find in-depth information about etcd in the official documentation.
Before you begin
You need to have a Kubernetes cluster, and the kubectl command-line tool must
be configured to communicate with your cluster. It is recommended to run this tutorial on a cluster with at least two nodes that are not acting as control plane hosts. If you do not already have a
cluster, you can create one by using
minikube
or you can use one of these Kubernetes playgrounds:
etcd is a leader-based distributed system. Ensure that the leader
periodically send heartbeats on time to all followers to keep the cluster
stable.
Ensure that no resource starvation occurs.
Performance and stability of the cluster is sensitive to network and disk
I/O. Any resource starvation can lead to heartbeat timeout, causing instability
of the cluster. An unstable etcd indicates that no leader is elected. Under
such circumstances, a cluster cannot make any changes to its current state,
which implies no new pods can be scheduled.
Keeping etcd clusters stable is critical to the stability of Kubernetes
clusters. Therefore, run etcd clusters on dedicated machines or isolated
environments for guaranteed resource requirements.
The minimum recommended version of etcd to run in production is 3.2.10+.
Resource requirements
Operating etcd with limited resources is suitable only for testing purposes.
For deploying in production, advanced hardware configuration is required.
Before deploying etcd in production, see
resource requirement reference.
Starting etcd clusters
This section covers starting a single-node and multi-node etcd cluster.
Single-node etcd cluster
Use a single-node etcd cluster only for testing purpose.
Start the Kubernetes API server with the flag
--etcd-servers=$PRIVATE_IP:2379.
Make sure PRIVATE_IP is set to your etcd client IP.
Multi-node etcd cluster
For durability and high availability, run etcd as a multi-node cluster in
production and back it up periodically. A five-member cluster is recommended
in production. For more information, see
FAQ documentation.
Configure an etcd cluster either by static member information or by dynamic
discovery. For more information on clustering, see
etcd clustering documentation.
For an example, consider a five-member etcd cluster running with the following
client URLs: http://$IP1:2379, http://$IP2:2379, http://$IP3:2379,
http://$IP4:2379, and http://$IP5:2379. To start a Kubernetes API server:
Start the Kubernetes API servers with the flag
--etcd-servers=$IP1:2379,$IP2:2379,$IP3:2379,$IP4:2379,$IP5:2379.
Make sure the IP<n> variables are set to your client IP addresses.
Multi-node etcd cluster with load balancer
To run a load balancing etcd cluster:
Set up an etcd cluster.
Configure a load balancer in front of the etcd cluster.
For example, let the address of the load balancer be $LB.
Start Kubernetes API Servers with the flag --etcd-servers=$LB:2379.
Securing etcd clusters
Access to etcd is equivalent to root permission in the cluster so ideally only
the API server should have access to it. Considering the sensitivity of the
data, it is recommended to grant permission to only those nodes that require
access to etcd clusters.
To secure etcd, either set up firewall rules or use the security features
provided by etcd. etcd security features depend on x509 Public Key
Infrastructure (PKI). To begin, establish secure communication channels by
generating a key and certificate pair. For example, use key pairs peer.key
and peer.cert for securing communication between etcd members, and
client.key and client.cert for securing communication between etcd and its
clients. See the example scripts
provided by the etcd project to generate key pairs and CA files for client
authentication.
Securing communication
To configure etcd with secure peer communication, specify flags
--peer-key-file=peer.key and --peer-cert-file=peer.cert, and use HTTPS as
the URL schema.
Similarly, to configure etcd with secure client communication, specify flags
--key-file=k8sclient.key and --cert-file=k8sclient.cert, and use HTTPS as
the URL schema. Here is an example on a client command that uses secure
communication:
ETCDCTL_API=3 etcdctl --endpoints 10.2.0.9:2379 \
--cert=/etc/kubernetes/pki/etcd/server.crt \
--key=/etc/kubernetes/pki/etcd/server.key \
--cacert=/etc/kubernetes/pki/etcd/ca.crt \
member list
Limiting access of etcd clusters
After configuring secure communication, restrict the access of etcd cluster to
only the Kubernetes API servers. Use TLS authentication to do so.
For example, consider key pairs k8sclient.key and k8sclient.cert that are
trusted by the CA etcd.ca. When etcd is configured with --client-cert-auth
along with TLS, it verifies the certificates from clients by using system CAs
or the CA passed in by --trusted-ca-file flag. Specifying flags
--client-cert-auth=true and --trusted-ca-file=etcd.ca will restrict the
access to clients with the certificate k8sclient.cert.
Once etcd is configured correctly, only clients with valid certificates can
access it. To give Kubernetes API servers the access, configure them with the
flags --etcd-certfile=k8sclient.cert, --etcd-keyfile=k8sclient.key and
--etcd-cafile=ca.cert.
Note: etcd authentication is not currently supported by Kubernetes. For more
information, see the related issue
Support Basic Auth for Etcd v2.
Replacing a failed etcd member
etcd cluster achieves high availability by tolerating minor member failures.
However, to improve the overall health of the cluster, replace failed members
immediately. When multiple members fail, replace them one by one. Replacing a
failed member involves two steps: removing the failed member and adding a new
member.
Though etcd keeps unique member IDs internally, it is recommended to use a
unique name for each member to avoid human errors. For example, consider a
three-member etcd cluster. Let the URLs be, member1=http://10.0.0.1,
member2=http://10.0.0.2, and member3=http://10.0.0.3. When member1 fails,
replace it with member4=http://10.0.0.4.
Get the member ID of the failed member1:
etcdctl --endpoints=http://10.0.0.2,http://10.0.0.3 member list
Update the --etcd-servers flag for the Kubernetes API servers to make
Kubernetes aware of the configuration changes, then restart the
Kubernetes API servers.
Update the load balancer configuration if a load balancer is used in the
deployment.
All Kubernetes objects are stored on etcd. Periodically backing up the etcd
cluster data is important to recover Kubernetes clusters under disaster
scenarios, such as losing all control plane nodes. The snapshot file contains
all the Kubernetes states and critical information. In order to keep the
sensitive Kubernetes data safe, encrypt the snapshot files.
Backing up an etcd cluster can be accomplished in two ways: etcd built-in
snapshot and volume snapshot.
Built-in snapshot
etcd supports built-in snapshot. A snapshot may either be taken from a live
member with the etcdctl snapshot save command or by copying the
member/snap/db file from an etcd
data directory
that is not currently used by an etcd process. Taking the snapshot will
not affect the performance of the member.
Below is an example for taking a snapshot of the keyspace served by
$ENDPOINT to the file snapshotdb:
ETCDCTL_API=3 etcdctl --endpoints $ENDPOINT snapshot save snapshotdb
Verify the snapshot:
ETCDCTL_API=3 etcdctl --write-out=table snapshot status snapshotdb
If etcd is running on a storage volume that supports backup, such as Amazon
Elastic Block Store, back up etcd data by taking a snapshot of the storage
volume.
Snapshot using etcdctl options
We can also take the snapshot using various options given by etcdctl. For example
ETCDCTL_API=3 etcdctl -h
will list various options available from etcdctl. For example, you can take a snapshot by specifying
the endpoint, certificates etc as shown below:
where trusted-ca-file, cert-file and key-file can be obtained from the description of the etcd Pod.
Scaling up etcd clusters
Scaling up etcd clusters increases availability by trading off performance.
Scaling does not increase cluster performance nor capability. A general rule
is not to scale up or down etcd clusters. Do not configure any auto scaling
groups for etcd clusters. It is highly recommended to always run a static
five-member etcd cluster for production Kubernetes clusters at any officially
supported scale.
A reasonable scaling is to upgrade a three-member cluster to a five-member
one, when more reliability is desired. See
etcd reconfiguration documentation
for information on how to add members into an existing cluster.
Restoring an etcd cluster
etcd supports restoring from snapshots that are taken from an etcd process of
the major.minor version. Restoring a version from a
different patch version of etcd also is supported. A restore operation is
employed to recover the data of a failed cluster.
Before starting the restore operation, a snapshot file must be present. It can
either be a snapshot file from a previous backup operation, or from a remaining
data directory.
Here is an example:
If the access URLs of the restored cluster is changed from the previous
cluster, the Kubernetes API server must be reconfigured accordingly. In this
case, restart Kubernetes API servers with the flag
--etcd-servers=$NEW_ETCD_CLUSTER instead of the flag
--etcd-servers=$OLD_ETCD_CLUSTER. Replace $NEW_ETCD_CLUSTER and
$OLD_ETCD_CLUSTER with the respective IP addresses. If a load balancer is
used in front of an etcd cluster, you might need to update the load balancer
instead.
If the majority of etcd members have permanently failed, the etcd cluster is
considered failed. In this scenario, Kubernetes cannot make any changes to its
current state. Although the scheduled pods might continue to run, no new pods
can be scheduled. In such cases, recover the etcd cluster and potentially
reconfigure Kubernetes API servers to fix the issue.
Note:
If any API servers are running in your cluster, you should not attempt to
restore instances of etcd. Instead, follow these steps to restore etcd:
stop all API server instances
restore state in all etcd instances
restart all API server instances
We also recommend restarting any components (e.g. kube-scheduler,
kube-controller-manager, kubelet) to ensure that they don't rely on some
stale data. Note that in practice, the restore takes a bit of time. During the
restoration, critical components will lose leader lock and restart themselves.
Upgrading etcd clusters
For more details on etcd upgrade, please refer to the etcd upgrades documentation.
Note: Before you start an upgrade, please back up your etcd cluster first.
2.28 - Reconfigure a Node's Kubelet in a Live Cluster
FEATURE STATE:Kubernetes v1.22 [deprecated]
Caution: The Dynamic Kubelet Configuration
feature is deprecated in 1.22 and removed in 1.24.
Please switch to alternative means distributing configuration to the Nodes of your cluster.
Migrating from using Dynamic Kubelet Configuration
There is no recommended replacement for this feature that works generically
across various Kubernetes distributions. If you are using managed Kubernetes
version, please consult with the vendor hosting Kubernetes for the best
practices for customizing your Kubernetes. If you are using kubeadm, refer to
Configuring each kubelet in your cluster using kubeadm.
In order to migrate off the Dynamic Kubelet Configuration feature, the
alternative mechanism should be used to distribute kubelet configuration files.
In order to apply configuration, config file must be updated and kubelet restarted.
See the Set Kubelet parameters via a config file
for information.
Please note, the DynamicKubeletConfig feature gate cannot be set on a kubelet
starting v1.24 as it has no effect. However, the feature gate is not removed
from the API server or the controller manager before v1.26. This is designed for
the control plane to support nodes with older versions of kubelets and for
satisfying the Kubernetes version skew policy.
2.29 - Reserve Compute Resources for System Daemons
Kubernetes nodes can be scheduled to Capacity. Pods can consume all the
available capacity on a node by default. This is an issue because nodes
typically run quite a few system daemons that power the OS and Kubernetes
itself. Unless resources are set aside for these system daemons, pods and system
daemons compete for resources and lead to resource starvation issues on the
node.
The kubelet exposes a feature named 'Node Allocatable' that helps to reserve
compute resources for system daemons. Kubernetes recommends cluster
administrators to configure 'Node Allocatable' based on their workload density
on each node.
Before you begin
You need to have a Kubernetes cluster, and the kubectl command-line tool must
be configured to communicate with your cluster. It is recommended to run this tutorial on a cluster with at least two nodes that are not acting as control plane hosts. If you do not already have a
cluster, you can create one by using
minikube
or you can use one of these Kubernetes playgrounds:
Your Kubernetes server must be at or later than version 1.8.
To check the version, enter kubectl version.
Your Kubernetes server must be at or later than version 1.17 to use
the kubelet command line option --reserved-cpus to set an
explicitly reserved CPU list.
Node Allocatable
'Allocatable' on a Kubernetes node is defined as the amount of compute resources
that are available for pods. The scheduler does not over-subscribe
'Allocatable'. 'CPU', 'memory' and 'ephemeral-storage' are supported as of now.
Node Allocatable is exposed as part of v1.Node object in the API and as part
of kubectl describe node in the CLI.
Resources can be reserved for two categories of system daemons in the kubelet.
Enabling QoS and Pod level cgroups
To properly enforce node allocatable constraints on the node, you must
enable the new cgroup hierarchy via the --cgroups-per-qos flag. This flag is
enabled by default. When enabled, the kubelet will parent all end-user pods
under a cgroup hierarchy managed by the kubelet.
Configuring a cgroup driver
The kubelet supports manipulation of the cgroup hierarchy on
the host using a cgroup driver. The driver is configured via the
--cgroup-driver flag.
The supported values are the following:
cgroupfs is the default driver that performs direct manipulation of the
cgroup filesystem on the host in order to manage cgroup sandboxes.
systemd is an alternative driver that manages cgroup sandboxes using
transient slices for resources that are supported by that init system.
Depending on the configuration of the associated container runtime,
operators may have to choose a particular cgroup driver to ensure
proper system behavior. For example, if operators use the systemd
cgroup driver provided by the containerd runtime, the kubelet must
be configured to use the systemd cgroup driver.
kube-reserved is meant to capture resource reservation for kubernetes system
daemons like the kubelet, container runtime, node problem detector, etc.
It is not meant to reserve resources for system daemons that are run as pods.
kube-reserved is typically a function of pod density on the nodes.
In addition to cpu, memory, and ephemeral-storage, pid may be
specified to reserve the specified number of process IDs for
kubernetes system daemons.
To optionally enforce kube-reserved on kubernetes system daemons, specify the parent
control group for kube daemons as the value for --kube-reserved-cgroup kubelet
flag.
It is recommended that the kubernetes system daemons are placed under a top
level control group (runtime.slice on systemd machines for example). Each
system daemon should ideally run within its own child control group. Refer to
the design proposal
for more details on recommended control group hierarchy.
Note that Kubelet does not create --kube-reserved-cgroup if it doesn't
exist. Kubelet will fail if an invalid cgroup is specified.
system-reserved is meant to capture resource reservation for OS system daemons
like sshd, udev, etc. system-reserved should reserve memory for the
kernel too since kernel memory is not accounted to pods in Kubernetes at this time.
Reserving resources for user login sessions is also recommended (user.slice in
systemd world).
In addition to cpu, memory, and ephemeral-storage, pid may be
specified to reserve the specified number of process IDs for OS system
daemons.
To optionally enforce system-reserved on system daemons, specify the parent
control group for OS system daemons as the value for --system-reserved-cgroup
kubelet flag.
It is recommended that the OS system daemons are placed under a top level
control group (system.slice on systemd machines for example).
Note that kubeletdoes not create --system-reserved-cgroup if it doesn't
exist. kubelet will fail if an invalid cgroup is specified.
Explicitly Reserved CPU List
FEATURE STATE:Kubernetes v1.17 [stable]
Kubelet Flag: --reserved-cpus=0-3
reserved-cpus is meant to define an explicit CPU set for OS system daemons and
kubernetes system daemons. reserved-cpus is for systems that do not intend to
define separate top level cgroups for OS system daemons and kubernetes system daemons
with regard to cpuset resource.
If the Kubelet does not have --system-reserved-cgroup and --kube-reserved-cgroup,
the explicit cpuset provided by reserved-cpus will take precedence over the CPUs
defined by --kube-reserved and --system-reserved options.
This option is specifically designed for Telco/NFV use cases where uncontrolled
interrupts/timers may impact the workload performance. you can use this option
to define the explicit cpuset for the system/kubernetes daemons as well as the
interrupts/timers, so the rest CPUs on the system can be used exclusively for
workloads, with less impact from uncontrolled interrupts/timers. To move the
system daemon, kubernetes daemons and interrupts/timers to the explicit cpuset
defined by this option, other mechanism outside Kubernetes should be used.
For example: in Centos, you can do this using the tuned toolset.
Memory pressure at the node level leads to System OOMs which affects the entire
node and all pods running on it. Nodes can go offline temporarily until memory
has been reclaimed. To avoid (or reduce the probability of) system OOMs kubelet
provides out of resource
management. Evictions are
supported for memory and ephemeral-storage only. By reserving some memory via
--eviction-hard flag, the kubelet attempts to evict pods whenever memory
availability on the node drops below the reserved value. Hypothetically, if
system daemons did not exist on a node, pods cannot use more than capacity - eviction-hard. For this reason, resources reserved for evictions are not
available for pods.
The scheduler treats 'Allocatable' as the available capacity for pods.
kubelet enforce 'Allocatable' across pods by default. Enforcement is performed
by evicting pods whenever the overall usage across all pods exceeds
'Allocatable'. More details on eviction policy can be found
on the node pressure eviction
page. This enforcement is controlled by
specifying pods value to the kubelet flag --enforce-node-allocatable.
Optionally, kubelet can be made to enforce kube-reserved and
system-reserved by specifying kube-reserved & system-reserved values in
the same flag. Note that to enforce kube-reserved or system-reserved,
--kube-reserved-cgroup or --system-reserved-cgroup needs to be specified
respectively.
General Guidelines
System daemons are expected to be treated similar to
Guaranteed pods.
System daemons can burst within their bounding control groups and this behavior needs
to be managed as part of kubernetes deployments. For example, kubelet should
have its own control group and share kube-reserved resources with the
container runtime. However, Kubelet cannot burst and use up all available Node
resources if kube-reserved is enforced.
Be extra careful while enforcing system-reserved reservation since it can lead
to critical system services being CPU starved, OOM killed, or unable
to fork on the node. The
recommendation is to enforce system-reserved only if a user has profiled their
nodes exhaustively to come up with precise estimates and is confident in their
ability to recover if any process in that group is oom-killed.
To begin with enforce 'Allocatable' on pods.
Once adequate monitoring and alerting is in place to track kube system
daemons, attempt to enforce kube-reserved based on usage heuristics.
If absolutely necessary, enforce system-reserved over time.
The resource requirements of kube system daemons may grow over time as more and
more features are added. Over time, kubernetes project will attempt to bring
down utilization of node system daemons, but that is not a priority as of now.
So expect a drop in Allocatable capacity in future releases.
Example Scenario
Here is an example to illustrate Node Allocatable computation:
Node has 32Gi of memory, 16 CPUs and 100Gi of Storage
--kube-reserved is set to cpu=1,memory=2Gi,ephemeral-storage=1Gi
--system-reserved is set to cpu=500m,memory=1Gi,ephemeral-storage=1Gi
--eviction-hard is set to memory.available<500Mi,nodefs.available<10%
Under this scenario, 'Allocatable' will be 14.5 CPUs, 28.5Gi of memory and
88Gi of local storage.
Scheduler ensures that the total memory requests across all pods on this node does
not exceed 28.5Gi and storage doesn't exceed 88Gi.
Kubelet evicts pods whenever the overall memory usage across pods exceeds 28.5Gi,
or if overall disk usage exceeds 88Gi If all processes on the node consume as
much CPU as they can, pods together cannot consume more than 14.5 CPUs.
If kube-reserved and/or system-reserved is not enforced and system daemons
exceed their reservation, kubelet evicts pods whenever the overall node memory
usage is higher than 31.5Gi or storage is greater than 90Gi.
2.30 - Running Kubernetes Node Components as a Non-root User
FEATURE STATE:Kubernetes v1.22 [alpha]
This document describes how to run Kubernetes Node components such as kubelet, CRI, OCI, and CNI
without root privileges, by using a user namespace.
This technique is also known as rootless mode.
Note:
This document describes how to run Kubernetes Node components (and hence pods) as a non-root user.
If you are just looking for how to run a pod as a non-root user, see SecurityContext.
Before you begin
Your Kubernetes server must be at or later than version 1.22.
To check the version, enter kubectl version.
minikube also supports running Kubernetes inside Rootless Docker.
See the page about the docker driver in the Minikube documentation.
Rootless Podman is not supported.
Running Kubernetes inside Unprivileged Containers
Note:
This section links to third party projects that provide functionality required by Kubernetes. The Kubernetes project authors aren't responsible for these projects, which are listed alphabetically. To add a project to this list, read the content guide before submitting a change. More information.
sysbox
Sysbox is an open-source container runtime
(similar to "runc") that supports running system-level workloads such as Docker
and Kubernetes inside unprivileged containers isolated with the Linux user
namespace.
Sysbox supports running Kubernetes inside unprivileged containers without
requiring Cgroup v2 and without the KubeletInUserNamespace feature gate. It
does this by exposing specially crafted /proc and /sys filesystems inside
the container plus several other advanced OS virtualization techniques.
Running Rootless Kubernetes directly on a host
Note:
This section links to third party projects that provide functionality required by Kubernetes. The Kubernetes project authors aren't responsible for these projects, which are listed alphabetically. To add a project to this list, read the content guide before submitting a change. More information.
If you are trying to run Kubernetes in a user-namespaced container such as
Rootless Docker/Podman or LXC/LXD, you are all set, and you can go to the next subsection.
Otherwise you have to create a user namespace by yourself, by calling unshare(2) with CLONE_NEWUSER.
A user namespace can be also unshared by using command line tools such as:
After unsharing the user namespace, you will also have to unshare other namespaces such as mount namespace.
You do not need to call chroot() nor pivot_root() after unsharing the mount namespace,
however, you have to mount writable filesystems on several directories in the namespace.
At least, the following directories need to be writable in the namespace (not outside the namespace):
/etc
/run
/var/logs
/var/lib/kubelet
/var/lib/cni
/var/lib/containerd (for containerd)
/var/lib/containers (for CRI-O)
Creating a delegated cgroup tree
In addition to the user namespace, you also need to have a writable cgroup tree with cgroup v2.
Note: Kubernetes support for running Node components in user namespaces requires cgroup v2.
Cgroup v1 is not supported.
If you are trying to run Kubernetes in Rootless Docker/Podman or LXC/LXD on a systemd-based host, you are all set.
Otherwise you have to create a systemd unit with Delegate=yes property to delegate a cgroup tree with writable permission.
On your node, systemd must already be configured to allow delegation; for more details, see
cgroup v2 in the Rootless
Containers documentation.
Configuring network
Note:
This section links to third party projects that provide functionality required by Kubernetes. The Kubernetes project authors aren't responsible for these projects, which are listed alphabetically. To add a project to this list, read the content guide before submitting a change. More information.
The network namespace of the Node components has to have a non-loopback interface, which can be for example configured with
slirp4netns,
VPNKit, or
lxc-user-nic(1).
The network namespaces of the Pods can be configured with regular CNI plugins.
For multi-node networking, Flannel (VXLAN, 8472/UDP) is known to work.
Ports such as the kubelet port (10250/TCP) and NodePort service ports have to be exposed from the Node network namespace to
the host with an external port forwarder, such as RootlessKit, slirp4netns, or
socat(1).
The kubelet relies on a container runtime. You should deploy a container runtime such as
containerd or CRI-O and ensure that it is running within the user namespace before the kubelet starts.
Running CRI plugin of containerd in a user namespace is supported since containerd 1.4.
Running containerd within a user namespace requires the following configurations.
version = 2
[plugins."io.containerd.grpc.v1.cri"]
# Disable AppArmor
disable_apparmor = true# Ignore an error during setting oom_score_adj
restrict_oom_score_adj = true# Disable hugetlb cgroup v2 controller (because systemd does not support delegating hugetlb controller)
disable_hugetlb_controller = true
[plugins."io.containerd.grpc.v1.cri".containerd]
# Using non-fuse overlayfs is also possible for kernel >= 5.11, but requires SELinux to be disabled
snapshotter = "fuse-overlayfs"
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc.options]
# We use cgroupfs that is delegated by systemd, so we do not use SystemdCgroup driver# (unless you run another systemd in the namespace)
SystemdCgroup = false
The default path of the configuration file is /etc/containerd/config.toml.
The path can be specified with containerd -c /path/to/containerd/config.toml.
Running CRI-O in a user namespace is supported since CRI-O 1.22.
CRI-O requires an environment variable _CRIO_ROOTLESS=1 to be set.
The following configurations are also recommended:
[crio]
storage_driver = "overlay"# Using non-fuse overlayfs is also possible for kernel >= 5.11, but requires SELinux to be disabled
storage_option = ["overlay.mount_program=/usr/local/bin/fuse-overlayfs"]
[crio.runtime]
# We use cgroupfs that is delegated by systemd, so we do not use "systemd" driver# (unless you run another systemd in the namespace)
cgroup_manager = "cgroupfs"
The default path of the configuration file is /etc/crio/crio.conf.
The path can be specified with crio --config /path/to/crio/crio.conf.
Configuring kubelet
Running kubelet in a user namespace requires the following configuration:
apiVersion:kubelet.config.k8s.io/v1beta1kind:KubeletConfigurationfeatureGates:KubeletInUserNamespace:true# We use cgroupfs that is delegated by systemd, so we do not use "systemd" driver# (unless you run another systemd in the namespace)cgroupDriver:"cgroupfs"
When the KubeletInUserNamespace feature gate is enabled, the kubelet ignores errors
that may happen during setting the following sysctl values on the node.
vm.overcommit_memory
vm.panic_on_oom
kernel.panic
kernel.panic_on_oops
kernel.keys.root_maxkeys
kernel.keys.root_maxbytes.
Within a user namespace, the kubelet also ignores any error raised from trying to open /dev/kmsg.
This feature gate also allows kube-proxy to ignore an error during setting RLIMIT_NOFILE.
The KubeletInUserNamespace feature gate was introduced in Kubernetes v1.22 with "alpha" status.
Running kubelet in a user namespace without using this feature gate is also possible
by mounting a specially crafted proc filesystem (as done by Sysbox), but not officially supported.
Configuring kube-proxy
Running kube-proxy in a user namespace requires the following configuration:
apiVersion:kubeproxy.config.k8s.io/v1alpha1kind:KubeProxyConfigurationmode:"iptables"# or "userspace"conntrack:# Skip setting sysctl value "net.netfilter.nf_conntrack_max"maxPerCore:0# Skip setting "net.netfilter.nf_conntrack_tcp_timeout_established"tcpEstablishedTimeout:0s# Skip setting "net.netfilter.nf_conntrack_tcp_timeout_close"tcpCloseWaitTimeout:0s
Caveats
Most of "non-local" volume drivers such as nfs and iscsi do not work.
Local volumes like local, hostPath, emptyDir, configMap, secret, and downwardAPI are known to work.
Some CNI plugins may not work. Flannel (VXLAN) is known to work.
To ensure that your workloads remain available during maintenance, you can
configure a PodDisruptionBudget.
If availability is important for any applications that run or could run on the node(s)
that you are draining, configure a PodDisruptionBudgets
first and then continue following this guide.
Use kubectl drain to remove a node from service
You can use kubectl drain to safely evict all of your pods from a
node before you perform maintenance on the node (e.g. kernel upgrade,
hardware maintenance, etc.). Safe evictions allow the pod's containers
to gracefully terminate
and will respect the PodDisruptionBudgets you have specified.
Note: By default kubectl drain ignores certain system pods on the node
that cannot be killed; see
the kubectl drain
documentation for more details.
When kubectl drain returns successfully, that indicates that all of
the pods (except the ones excluded as described in the previous paragraph)
have been safely evicted (respecting the desired graceful termination period,
and respecting the PodDisruptionBudget you have defined). It is then safe to
bring down the node by powering down its physical machine or, if running on a
cloud platform, deleting its virtual machine.
First, identify the name of the node you wish to drain. You can list all of the nodes in your cluster with
kubectl get nodes
Next, tell Kubernetes to drain the node:
kubectl drain <node name>
Once it returns (without giving an error), you can power down the node
(or equivalently, if on a cloud platform, delete the virtual machine backing the node).
If you leave the node in the cluster during the maintenance operation, you need to run
kubectl uncordon <node name>
afterwards to tell Kubernetes that it can resume scheduling new pods onto the node.
Draining multiple nodes in parallel
The kubectl drain command should only be issued to a single node at a
time. However, you can run multiple kubectl drain commands for
different nodes in parallel, in different terminals or in the
background. Multiple drain commands running concurrently will still
respect the PodDisruptionBudget you specify.
For example, if you have a StatefulSet with three replicas and have
set a PodDisruptionBudget for that set specifying minAvailable: 2,
kubectl drain only evicts a pod from the StatefulSet if all three
replicas pods are ready; if then you issue multiple drain commands in
parallel, Kubernetes respects the PodDisruptionBudget and ensure
that only 1 (calculated as replicas - minAvailable) Pod is unavailable
at any given time. Any drains that would cause the number of ready
replicas to fall below the specified budget are blocked.
The Eviction API
If you prefer not to use kubectl drain (such as
to avoid calling to an external command, or to get finer control over the pod
eviction process), you can also programmatically cause evictions using the
eviction API.
This document covers topics related to protecting a cluster from accidental or malicious access
and provides recommendations on overall security.
Before you begin
You need to have a Kubernetes cluster, and the kubectl command-line tool must
be configured to communicate with your cluster. It is recommended to run this tutorial on a cluster with at least two nodes that are not acting as control plane hosts. If you do not already have a
cluster, you can create one by using
minikube
or you can use one of these Kubernetes playgrounds:
As Kubernetes is entirely API-driven, controlling and limiting who can access the cluster and what actions
they are allowed to perform is the first line of defense.
Use Transport Layer Security (TLS) for all API traffic
Kubernetes expects that all API communication in the cluster is encrypted by default with TLS, and the
majority of installation methods will allow the necessary certificates to be created and distributed to
the cluster components. Note that some components and installation methods may enable local ports over
HTTP and administrators should familiarize themselves with the settings of each component to identify
potentially unsecured traffic.
API Authentication
Choose an authentication mechanism for the API servers to use that matches the common access patterns
when you install a cluster. For instance, small, single-user clusters may wish to use a simple certificate
or static Bearer token approach. Larger clusters may wish to integrate an existing OIDC or LDAP server that
allow users to be subdivided into groups.
All API clients must be authenticated, even those that are part of the infrastructure like nodes,
proxies, the scheduler, and volume plugins. These clients are typically service accounts or use x509 client certificates, and they are created automatically at cluster startup or are setup as part of the cluster installation.
Once authenticated, every API call is also expected to pass an authorization check. Kubernetes ships
an integrated Role-Based Access Control (RBAC) component that matches an incoming user or group to a
set of permissions bundled into roles. These permissions combine verbs (get, create, delete) with
resources (pods, services, nodes) and can be namespace-scoped or cluster-scoped. A set of out-of-the-box
roles are provided that offer reasonable default separation of responsibility depending on what
actions a client might want to perform. It is recommended that you use the
Node and
RBAC authorizers together, in combination with the
NodeRestriction admission plugin.
As with authentication, simple and broad roles may be appropriate for smaller clusters, but as
more users interact with the cluster, it may become necessary to separate teams into separate
namespaces with more limited roles.
With authorization, it is important to understand how updates on one object may cause actions in
other places. For instance, a user may not be able to create pods directly, but allowing them to
create a deployment, which creates pods on their behalf, will let them create those pods
indirectly. Likewise, deleting a node from the API will result in the pods scheduled to that node
being terminated and recreated on other nodes. The out-of-the box roles represent a balance
between flexibility and common use cases, but more limited roles should be carefully reviewed
to prevent accidental escalation. You can make roles specific to your use case if the out-of-box ones don't meet your needs.
Kubelets expose HTTPS endpoints which grant powerful control over the node and containers. By default Kubelets allow unauthenticated access to this API.
Production clusters should enable Kubelet authentication and authorization.
Controlling the capabilities of a workload or user at runtime
Authorization in Kubernetes is intentionally high level, focused on coarse actions on resources.
More powerful controls exist as policies to limit by use case how those objects act on the
cluster, themselves, and other resources.
Limiting resource usage on a cluster
Resource quota limits the number or capacity of
resources granted to a namespace. This is most often used to limit the amount of CPU, memory,
or persistent disk a namespace can allocate, but can also control how many pods, services, or
volumes exist in each namespace.
Limit ranges restrict the maximum or minimum size of some of the
resources above, to prevent users from requesting unreasonably high or low values for commonly
reserved resources like memory, or to provide default limits when none are specified.
Controlling what privileges containers run with
A pod definition contains a security context
that allows it to request access to run as a specific Linux user on a node (like root),
access to run privileged or access the host network, and other controls that would otherwise
allow it to run unfettered on a hosting node.
Generally, most application workloads need limited access to host resources so they can
successfully run as a root process (uid 0) without access to host information. However,
considering the privileges associated with the root user, you should write application
containers to run as a non-root user. Similarly, administrators who wish to prevent
client applications from escaping their containers should apply the Baseline
or Restricted Pod Security Standard.
Preventing containers from loading unwanted kernel modules
The Linux kernel automatically loads kernel modules from disk if needed in certain
circumstances, such as when a piece of hardware is attached or a filesystem is mounted. Of
particular relevance to Kubernetes, even unprivileged processes can cause certain
network-protocol-related kernel modules to be loaded, just by creating a socket of the
appropriate type. This may allow an attacker to exploit a security hole in a kernel module
that the administrator assumed was not in use.
To prevent specific modules from being automatically loaded, you can uninstall them from
the node, or add rules to block them. On most Linux distributions, you can do that by
creating a file such as /etc/modprobe.d/kubernetes-blacklist.conf with contents like:
# DCCP is unlikely to be needed, has had multiple serious
# vulnerabilities, and is not well-maintained.
blacklist dccp
# SCTP is not used in most Kubernetes clusters, and has also had
# vulnerabilities in the past.
blacklist sctp
To block module loading more generically, you can use a Linux Security Module (such as
SELinux) to completely deny the module_request permission to containers, preventing the
kernel from loading modules for containers under any circumstances. (Pods would still be
able to use modules that had been loaded manually, or modules that were loaded by the
kernel on behalf of some more-privileged process.)
Restricting network access
The network policies for a namespace
allows application authors to restrict which pods in other namespaces may access pods and ports
within their namespaces. Many of the supported Kubernetes networking providers
now respect network policy.
Quota and limit ranges can also be used to control whether users may request node ports or
load-balanced services, which on many clusters can control whether those users applications
are visible outside of the cluster.
Additional protections may be available that control network rules on a per-plugin or per-
environment basis, such as per-node firewalls, physically separating cluster nodes to
prevent cross talk, or advanced networking policy.
Restricting cloud metadata API access
Cloud platforms (AWS, Azure, GCE, etc.) often expose metadata services locally to instances.
By default these APIs are accessible by pods running on an instance and can contain cloud
credentials for that node, or provisioning data such as kubelet credentials. These credentials
can be used to escalate within the cluster or to other cloud services under the same account.
When running Kubernetes on a cloud platform, limit permissions given to instance credentials, use
network policies to restrict pod access
to the metadata API, and avoid using provisioning data to deliver secrets.
As an administrator, a beta admission plugin PodNodeSelector can be used to force pods
within a namespace to default or require a specific node selector, and if end users cannot
alter namespaces, this can strongly limit the placement of all of the pods in a specific workload.
Protecting cluster components from compromise
This section describes some common patterns for protecting clusters from compromise.
Restrict access to etcd
Write access to the etcd backend for the API is equivalent to gaining root on the entire cluster,
and read access can be used to escalate fairly quickly. Administrators should always use strong
credentials from the API servers to their etcd server, such as mutual auth via TLS client certificates,
and it is often recommended to isolate the etcd servers behind a firewall that only the API servers
may access.
Caution: Allowing other components within the cluster to access the master etcd instance with
read or write access to the full keyspace is equivalent to granting cluster-admin access. Using
separate etcd instances for non-master components or using etcd ACLs to restrict read and write
access to a subset of the keyspace is strongly recommended.
Enable audit logging
The audit logger is a beta feature that records actions taken by the
API for later analysis in the event of a compromise. It is recommended to enable audit logging
and archive the audit file on a secure server.
Restrict access to alpha or beta features
Alpha and beta Kubernetes features are in active development and may have limitations or bugs
that result in security vulnerabilities. Always assess the value an alpha or beta feature may
provide against the possible risk to your security posture. When in doubt, disable features you
do not use.
Rotate infrastructure credentials frequently
The shorter the lifetime of a secret or credential the harder it is for an attacker to make
use of that credential. Set short lifetimes on certificates and automate their rotation. Use
an authentication provider that can control how long issued tokens are available and use short
lifetimes where possible. If you use service-account tokens in external integrations, plan to
rotate those tokens frequently. For example, once the bootstrap phase is complete, a bootstrap
token used for setting up nodes should be revoked or its authorization removed.
Review third party integrations before enabling them
Many third party integrations to Kubernetes may alter the security profile of your cluster. When
enabling an integration, always review the permissions that an extension requests before granting
it access. For example, many security integrations may request access to view all secrets on
your cluster which is effectively making that component a cluster admin. When in doubt,
restrict the integration to functioning in a single namespace if possible.
Components that create pods may also be unexpectedly powerful if they can do so inside namespaces
like the kube-system namespace, because those pods can gain access to service account secrets
or run with elevated permissions if those service accounts are granted access to permissive
PodSecurityPolicies.
If you use Pod Security admission and allow
any component to create Pods within a namespace that permits privileged Pods, those Pods may
be able to escape their containers and use this widened access to elevate their privileges.
You should not allow untrusted components to create Pods in any system namespace (those with
names that start with kube-) nor in any namespace where that access grant allows the possibility
of privilege escalation.
Encrypt secrets at rest
In general, the etcd database will contain any information accessible via the Kubernetes API
and may grant an attacker significant visibility into the state of your cluster. Always encrypt
your backups using a well reviewed backup and encryption solution, and consider using full disk
encryption where possible.
Kubernetes supports encryption at rest, a feature
introduced in 1.7, and beta since 1.13. This will encrypt Secret resources in etcd, preventing
parties that gain access to your etcd backups from viewing the content of those secrets. While
this feature is currently beta, it offers an additional level of defense when backups
are not encrypted or an attacker gains read access to etcd.
Receiving alerts for security updates and reporting vulnerabilities
Join the kubernetes-announce
group for emails about security announcements. See the
security reporting
page for more on how to report vulnerabilities.
2.33 - Set Kubelet parameters via a config file
A subset of the Kubelet's configuration parameters may be
set via an on-disk config file, as a substitute for command-line flags.
Providing parameters via a config file is the recommended approach because
it simplifies node deployment and configuration management.
Create the config file
The subset of the Kubelet's configuration that can be configured via a file
is defined by the
KubeletConfiguration
struct.
The configuration file must be a JSON or YAML representation of the parameters
in this struct. Make sure the Kubelet has read permissions on the file.
Here is an example of what this file might look like:
In the example, the Kubelet is configured to serve on IP address 192.168.0.8 and port 20250, pull images in parallel,
and evict Pods when available memory drops below 200Mi.
All other Kubelet configuration values are left at their built-in defaults, unless overridden
by flags. Command line flags which target the same value as a config file will override that value.
Start a Kubelet process configured via the config file
Note: If you use kubeadm to initialize your cluster, use the kubelet-config while creating your cluster with kubeadmin init.
See configuring kubelet using kubeadm for details.
Start the Kubelet with the --config flag set to the path of the Kubelet's config file.
The Kubelet will then load its config from this file.
Note that command line flags which target the same value as a config file will override that value.
This helps ensure backwards compatibility with the command-line API.
Note that relative file paths in the Kubelet config file are resolved relative to the
location of the Kubelet config file, whereas relative paths in command line flags are resolved
relative to the Kubelet's current working directory.
Note that some default values differ between command-line flags and the Kubelet config file.
If --config is provided and the values are not specified via the command line, the
defaults for the KubeletConfiguration version apply.
In the above example, this version is kubelet.config.k8s.io/v1beta1.
What's next
Learn more about kubelet configuration by checking the
KubeletConfiguration
reference.
2.34 - Share a Cluster with Namespaces
This page shows how to view, work in, and delete namespaces. The page also shows how to use Kubernetes namespaces to subdivide your cluster.
NAME STATUS AGE
default Active 11d
kube-system Active 11d
kube-public Active 11d
Kubernetes starts with three initial namespaces:
default The default namespace for objects with no other namespace
kube-system The namespace for objects created by the Kubernetes system
kube-public This namespace is created automatically and is readable by all users (including those not authenticated). This namespace is mostly reserved for cluster usage, in case that some resources should be visible and readable publicly throughout the whole cluster. The public aspect of this namespace is only a convention, not a requirement.
You can also get the summary of a specific namespace using:
kubectl get namespaces <name>
Or you can get detailed information with:
kubectl describe namespaces <name>
Name: default
Labels: <none>
Annotations: <none>
Status: Active
No resource quota.
Resource Limits
Type Resource Min Max Default
---- -------- --- --- ---
Container cpu - - 100m
Note that these details show both resource quota (if present) as well as resource limit ranges.
Resource quota tracks aggregate usage of resources in the Namespace and allows cluster operators
to define Hard resource usage limits that a Namespace may consume.
A limit range defines min/max constraints on the amount of resources a single entity can consume in
a Namespace.
The name of your namespace must be a valid
DNS label.
There's an optional field finalizers, which allows observables to purge resources whenever the namespace is deleted. Keep in mind that if you specify a nonexistent finalizer, the namespace will be created but will get stuck in the Terminating state if the user tries to delete it.
More information on finalizers can be found in the namespace design doc.
Warning: This deletes everything under the namespace!
This delete is asynchronous, so for a time you will see the namespace in the Terminating state.
Subdividing your cluster using Kubernetes namespaces
Understand the default namespace
By default, a Kubernetes cluster will instantiate a default namespace when provisioning the cluster to hold the default set of Pods,
Services, and Deployments used by the cluster.
Assuming you have a fresh cluster, you can introspect the available namespaces by doing the following:
kubectl get namespaces
NAME STATUS AGE
default Active 13m
Create new namespaces
For this exercise, we will create two additional Kubernetes namespaces to hold our content.
In a scenario where an organization is using a shared Kubernetes cluster for development and production use cases:
The development team would like to maintain a space in the cluster where they can get a view on the list of Pods, Services, and Deployments
they use to build and run their application. In this space, Kubernetes resources come and go, and the restrictions on who can or cannot modify resources
are relaxed to enable agile development.
The operations team would like to maintain a space in the cluster where they can enforce strict procedures on who can or cannot manipulate the set of
Pods, Services, and Deployments that run the production site.
One pattern this organization could follow is to partition the Kubernetes cluster into two namespaces: development and production.
NAME READY UP-TO-DATE AVAILABLE AGE
cattle 5/5 5 5 10s
kubectl get pods -l app=cattle -n=production
NAME READY STATUS RESTARTS AGE
cattle-2263376956-41xy6 1/1 Running 0 34s
cattle-2263376956-kw466 1/1 Running 0 34s
cattle-2263376956-n4v97 1/1 Running 0 34s
cattle-2263376956-p5p3i 1/1 Running 0 34s
cattle-2263376956-sxpth 1/1 Running 0 34s
At this point, it should be clear that the resources users create in one namespace are hidden from the other namespace.
As the policy support in Kubernetes evolves, we will extend this scenario to show how you can provide different
authorization rules for each namespace.
Understanding the motivation for using namespaces
A single cluster should be able to satisfy the needs of multiple users or groups of users (henceforth a 'user community').
Kubernetes namespaces help different projects, teams, or customers to share a Kubernetes cluster.
policies (who can or cannot perform actions in their community)
constraints (this community is allowed this much quota, etc.)
A cluster operator may create a Namespace for each unique user community.
The Namespace provides a unique scope for:
named resources (to avoid basic naming collisions)
delegated management authority to trusted users
ability to limit community resource consumption
Use cases include:
As a cluster operator, I want to support multiple user communities on a single cluster.
As a cluster operator, I want to delegate authority to partitions of the cluster to trusted users
in those communities.
As a cluster operator, I want to limit the amount of resources each community can consume in order
to limit the impact to other communities using the cluster.
As a cluster user, I want to interact with resources that are pertinent to my user community in
isolation of what other user communities are doing on the cluster.
Understanding namespaces and DNS
When you create a Service, it creates a corresponding DNS entry.
This entry is of the form <service-name>.<namespace-name>.svc.cluster.local, which means
that if a container uses <service-name> it will resolve to the service which
is local to a namespace. This is useful for using the same configuration across
multiple namespaces such as Development, Staging and Production. If you want to reach
across namespaces, you need to use the fully qualified domain name (FQDN).
Adjust manifests and other resources based on the API changes that accompany the
new Kubernetes version
Before you begin
You must have an existing cluster. This page is about upgrading from Kubernetes
1.23 to Kubernetes 1.24. If your cluster
is not currently running Kubernetes 1.23 then please check
the documentation for the version of Kubernetes that you plan to upgrade to.
Upgrade approaches
kubeadm
If your cluster was deployed using the kubeadm tool, refer to
Upgrading kubeadm clusters
for detailed information on how to upgrade the cluster.
For each node in your cluster, drain
that node and then either replace it with a new node that uses the 1.24
kubelet, or upgrade the kubelet on that node and bring the node back into service.
Other deployments
Refer to the documentation for your cluster deployment tool to learn the recommended set
up steps for maintenance.
Post-upgrade tasks
Switch your cluster's storage API version
The objects that are serialized into etcd for a cluster's internal
representation of the Kubernetes resources active in the cluster are
written using a particular version of the API.
When the supported API changes, these objects may need to be rewritten
in the newer API. Failure to do this will eventually result in resources
that are no longer decodable or usable by the Kubernetes API server.
For each affected object, fetch it using the latest supported API and then
write it back also using the latest supported API.
Update manifests
Upgrading to a new Kubernetes version can provide new APIs.
You can use kubectl convert command to convert manifests between different API versions.
For example:
kubectl convert -f pod.yaml --output-version v1
The kubectl tool replaces the contents of pod.yaml with a manifest that sets kind to
Pod (unchanged), but with a revised apiVersion.
You need to have a Kubernetes cluster, and the kubectl command-line tool must
be configured to communicate with your cluster. It is recommended to run this tutorial on a cluster with at least two nodes that are not acting as control plane hosts. If you do not already have a
cluster, you can create one by using
minikube
or you can use one of these Kubernetes playgrounds:
You also need to create a sample Deployment
to experiment with the different types of cascading deletion. You will need to
recreate the Deployment for each type.
Check owner references on your pods
Check that the ownerReferences field is present on your pods:
kubectl get pods -l app=nginx --output=yaml
The output has an ownerReferences field similar to this:
By default, Kubernetes uses background cascading deletion
to delete dependents of an object. You can switch to foreground cascading deletion
using either kubectl or the Kubernetes API, depending on the Kubernetes
version your cluster runs.
To check the version, enter kubectl version.
Use either kubectl or the Kubernetes API to delete the Deployment,
depending on the Kubernetes version your cluster runs.
To check the version, enter kubectl version.
You can delete objects using background cascading deletion using kubectl
or the Kubernetes API.
Kubernetes uses background cascading deletion by default, and does so
even if you run the following commands without the --cascade flag or the
propagationPolicy argument.
Kubernetes uses background cascading deletion by default, and does so
even if you run the following commands without the --cascade flag or the
propagationPolicy: Background argument.
By default, when you tell Kubernetes to delete an object, the
controller also deletes
dependent objects. You can make Kubernetes orphan these dependents using
kubectl or the Kubernetes API, depending on the Kubernetes version your
cluster runs.
To check the version, enter kubectl version.
This page shows how to configure a Key Management Service (KMS) provider and plugin to enable secret data encryption.
Before you begin
You need to have a Kubernetes cluster, and the kubectl command-line tool must
be configured to communicate with your cluster. It is recommended to run this tutorial on a cluster with at least two nodes that are not acting as control plane hosts. If you do not already have a
cluster, you can create one by using
minikube
or you can use one of these Kubernetes playgrounds:
The KMS encryption provider uses an envelope encryption scheme to encrypt data in etcd. The data is encrypted using a data encryption key (DEK); a new DEK is generated for each encryption. The DEKs are encrypted with a key encryption key (KEK) that is stored and managed in a remote KMS. The KMS provider uses gRPC to communicate with a specific KMS
plugin. The KMS plugin, which is implemented as a gRPC server and deployed on the same host(s) as the Kubernetes master(s), is responsible for all communication with the remote KMS.
Configuring the KMS provider
To configure a KMS provider on the API server, include a provider of type kms in the providers array in the encryption configuration file and set the following properties:
name: Display name of the KMS plugin.
endpoint: Listen address of the gRPC server (KMS plugin). The endpoint is a UNIX domain socket.
cachesize: Number of data encryption keys (DEKs) to be cached in the clear.
When cached, DEKs can be used without another call to the KMS;
whereas DEKs that are not cached require a call to the KMS to unwrap.
timeout: How long should kube-apiserver wait for kms-plugin to respond before returning an error (default is 3 seconds).
To implement a KMS plugin, you can develop a new plugin gRPC server or enable a KMS plugin already provided by your cloud provider. You then integrate the plugin with the remote KMS and deploy it on the Kubernetes master.
Enabling the KMS supported by your cloud provider
Refer to your cloud provider for instructions on enabling the cloud provider-specific KMS plugin.
Developing a KMS plugin gRPC server
You can develop a KMS plugin gRPC server using a stub file available for Go. For other languages, you use a proto file to create a stub file that you can use to develop the gRPC server code.
Using Go: Use the functions and data structures in the stub file: service.pb.go to develop the gRPC server code
Using languages other than Go: Use the protoc compiler with the proto file: service.proto to generate a stub file for the specific language
Then use the functions and data structures in the stub file to develop the server code.
Notes:
kms plugin version: v1beta1
In response to procedure call Version, a compatible KMS plugin should return v1beta1 as VersionResponse.version.
message version: v1beta1
All messages from KMS provider have the version field set to current version v1beta1.
protocol: UNIX domain socket (unix)
The gRPC server should listen at UNIX domain socket.
Integrating a KMS plugin with the remote KMS
The KMS plugin can communicate with the remote KMS using any protocol supported by the KMS.
All configuration data, including authentication credentials the KMS plugin uses to communicate with the remote KMS,
are stored and managed by the KMS plugin independently. The KMS plugin can encode the ciphertext with additional metadata that may be required before sending it to the KMS for decryption.
Deploying the KMS plugin
Ensure that the KMS plugin runs on the same host(s) as the Kubernetes master(s).
Encrypting your data with the KMS provider
To encrypt the data:
Create a new encryption configuration file using the appropriate properties for the kms provider:
Set the --encryption-provider-config flag on the kube-apiserver to point to the location of the configuration file.
Restart your API server.
Verifying that the data is encrypted
Data is encrypted when written to etcd. After restarting your kube-apiserver,
any newly created or updated secret should be encrypted when stored. To verify,
you can use the etcdctl command line program to retrieve the contents of your secret.
Create a new secret called secret1 in the default namespace:
Using the etcdctl command line, read that secret out of etcd:
ETCDCTL_API=3 etcdctl get /kubernetes.io/secrets/default/secret1 [...] | hexdump -C
where [...] must be the additional arguments for connecting to the etcd server.
Verify the stored secret is prefixed with k8s:enc:kms:v1:, which indicates that the kms provider has encrypted the resulting data.
Verify that the secret is correctly decrypted when retrieved via the API:
kubectl describe secret secret1 -n default
should match mykey: mydata
Ensuring all secrets are encrypted
Because secrets are encrypted on write, performing an update on a secret encrypts that content.
The following command reads all secrets and then updates them to apply server side encryption.
If an error occurs due to a conflicting write, retry the command.
For larger clusters, you may wish to subdivide the secrets by namespace or script an update.
This page describes the CoreDNS upgrade process and how to install CoreDNS instead of kube-dns.
Before you begin
You need to have a Kubernetes cluster, and the kubectl command-line tool must
be configured to communicate with your cluster. It is recommended to run this tutorial on a cluster with at least two nodes that are not acting as control plane hosts. If you do not already have a
cluster, you can create one by using
minikube
or you can use one of these Kubernetes playgrounds:
Your Kubernetes server must be at or later than version v1.9.
To check the version, enter kubectl version.
About CoreDNS
CoreDNS is a flexible, extensible DNS server
that can serve as the Kubernetes cluster DNS.
Like Kubernetes, the CoreDNS project is hosted by the
CNCF.
You can use CoreDNS instead of kube-dns in your cluster by replacing
kube-dns in an existing deployment, or by using tools like kubeadm
that will deploy and upgrade the cluster for you.
Installing CoreDNS
For manual deployment or replacement of kube-dns, see the documentation at the
CoreDNS GitHub project.
Migrating to CoreDNS
Upgrading an existing cluster with kubeadm
In Kubernetes version 1.21, kubeadm removed its support for kube-dns as a DNS application.
For kubeadm v1.24, the only supported cluster DNS application
is CoreDNS.
You can move to CoreDNS when you use kubeadm to upgrade a cluster that is
using kube-dns. In this case, kubeadm generates the CoreDNS configuration
("Corefile") based upon the kube-dns ConfigMap, preserving configurations for
stub domains, and upstream name server.
Upgrading CoreDNS
You can check the version of CoreDNS that kubeadm installs for each version of
Kubernetes in the page
CoreDNS version in Kubernetes.
CoreDNS can be upgraded manually in case you want to only upgrade CoreDNS
or use your own custom image.
There is a helpful guideline and walkthrough
available to ensure a smooth upgrade.
Make sure the existing CoreDNS configuration ("Corefile") is retained when
upgrading your cluster.
If you are upgrading your cluster using the kubeadm tool, kubeadm
can take care of retaining the existing CoreDNS configuration automatically.
Tuning CoreDNS
When resource utilisation is a concern, it may be useful to tune the
configuration of CoreDNS. For more details, check out the
documentation on scaling CoreDNS.
What's next
You can configure CoreDNS to support many more use cases than
kube-dns does by modifying the CoreDNS configuration ("Corefile").
For more information, see the documentation
for the kubernetes CoreDNS plugin, or read the
Custom DNS Entries for Kubernetes.
in the CoreDNS blog.
2.39 - Using NodeLocal DNSCache in Kubernetes clusters
FEATURE STATE:Kubernetes v1.18 [stable]
This page provides an overview of NodeLocal DNSCache feature in Kubernetes.
Before you begin
You need to have a Kubernetes cluster, and the kubectl command-line tool must
be configured to communicate with your cluster. It is recommended to run this tutorial on a cluster with at least two nodes that are not acting as control plane hosts. If you do not already have a
cluster, you can create one by using
minikube
or you can use one of these Kubernetes playgrounds:
NodeLocal DNSCache improves Cluster DNS performance by running a DNS caching agent
on cluster nodes as a DaemonSet. In today's architecture, Pods in 'ClusterFirst' DNS mode
reach out to a kube-dns serviceIP for DNS queries. This is translated to a
kube-dns/CoreDNS endpoint via iptables rules added by kube-proxy.
With this new architecture, Pods will reach out to the DNS caching agent
running on the same node, thereby avoiding iptables DNAT rules and connection tracking.
The local caching agent will query kube-dns service for cache misses of cluster
hostnames ("cluster.local" suffix by default).
Motivation
With the current DNS architecture, it is possible that Pods with the highest DNS QPS
have to reach out to a different node, if there is no local kube-dns/CoreDNS instance.
Having a local cache will help improve the latency in such scenarios.
Skipping iptables DNAT and connection tracking will help reduce
conntrack races
and avoid UDP DNS entries filling up conntrack table.
Connections from local caching agent to kube-dns service can be upgraded to TCP.
TCP conntrack entries will be removed on connection close in contrast with
UDP entries that have to timeout
(defaultnf_conntrack_udp_timeout is 30 seconds)
Upgrading DNS queries from UDP to TCP would reduce tail latency attributed to
dropped UDP packets and DNS timeouts usually up to 30s (3 retries + 10s timeout).
Since the nodelocal cache listens for UDP DNS queries, applications don't need to be changed.
Metrics & visibility into DNS requests at a node level.
Negative caching can be re-enabled, thereby reducing number of queries to kube-dns service.
Architecture Diagram
This is the path followed by DNS Queries after NodeLocal DNSCache is enabled:
Nodelocal DNSCache flow
This image shows how NodeLocal DNSCache handles DNS queries.
Configuration
Note: The local listen IP address for NodeLocal DNSCache can be any address that
can be guaranteed to not collide with any existing IP in your cluster.
It's recommended to use an address with a local scope, per example,
from the 'link-local' range '169.254.0.0/16' for IPv4 or from the
'Unique Local Address' range in IPv6 'fd00::/8'.
This feature can be enabled using the following steps:
Prepare a manifest similar to the sample
nodelocaldns.yaml
and save it as nodelocaldns.yaml.
If using IPv6, the CoreDNS configuration file need to enclose all the IPv6 addresses
into square brackets if used in 'IP:Port' format.
If you are using the sample manifest from the previous point, this will require to modify
the configuration line L70
like this: "health [__PILLAR__LOCAL__DNS__]:8080"
Substitute the variables in the manifest with the right values:
kubedns=`kubectl get svc kube-dns -n kube-system -o jsonpath={.spec.clusterIP}`domain=<cluster-domain>
localdns=<node-local-address>
<cluster-domain> is "cluster.local" by default. <node-local-address> is the
local listen IP address chosen for NodeLocal DNSCache.
If kube-proxy is running in IPTABLES mode:
sed -i "s/__PILLAR__LOCAL__DNS__/$localdns/g; s/__PILLAR__DNS__DOMAIN__/$domain/g; s/__PILLAR__DNS__SERVER__/$kubedns/g" nodelocaldns.yaml
__PILLAR__CLUSTER__DNS__ and __PILLAR__UPSTREAM__SERVERS__ will be populated by
the node-local-dns pods.
In this mode, the node-local-dns pods listen on both the kube-dns service IP
as well as <node-local-address>, so pods can lookup DNS records using either IP address.
If kube-proxy is running in IPVS mode:
sed -i "s/__PILLAR__LOCAL__DNS__/$localdns/g; s/__PILLAR__DNS__DOMAIN__/$domain/g; s/,__PILLAR__DNS__SERVER__//g; s/__PILLAR__CLUSTER__DNS__/$kubedns/g" nodelocaldns.yaml
In this mode, the node-local-dns pods listen only on <node-local-address>.
The node-local-dns interface cannot bind the kube-dns cluster IP since the
interface used for IPVS loadbalancing already uses this address.
__PILLAR__UPSTREAM__SERVERS__ will be populated by the node-local-dns pods.
Run kubectl create -f nodelocaldns.yaml
If using kube-proxy in IPVS mode, --cluster-dns flag to kubelet needs to be modified
to use <node-local-address> that NodeLocal DNSCache is listening on.
Otherwise, there is no need to modify the value of the --cluster-dns flag,
since NodeLocal DNSCache listens on both the kube-dns service IP as well as
<node-local-address>.
Once enabled, the node-local-dns Pods will run in the kube-system namespace
on each of the cluster nodes. This Pod runs CoreDNS
in cache mode, so all CoreDNS metrics exposed by the different plugins will
be available on a per-node basis.
You can disable this feature by removing the DaemonSet, using kubectl delete -f <manifest>.
You should also revert any changes you made to the kubelet configuration.
StubDomains and Upstream server Configuration
StubDomains and upstream servers specified in the kube-dns ConfigMap in the kube-system namespace
are automatically picked up by node-local-dns pods. The ConfigMap contents need to follow the format
shown in the example.
The node-local-dns ConfigMap can also be modified directly with the stubDomain configuration
in the Corefile format. Some cloud providers might not allow modifying node-local-dns ConfigMap directly.
In those cases, the kube-dns ConfigMap can be updated.
Setting memory limits
The node-local-dns Pods use memory for storing cache entries and processing queries.
Since they do not watch Kubernetes objects, the cluster size or the number of Services/Endpoints
do not directly affect memory usage. Memory usage is influenced by the DNS query pattern.
From CoreDNS docs,
The default cache size is 10000 entries, which uses about 30 MB when completely filled.
This would be the memory usage for each server block (if the cache gets completely filled).
Memory usage can be reduced by specifying smaller cache sizes.
The number of concurrent queries is linked to the memory demand, because each extra
goroutine used for handling a query requires an amount of memory. You can set an upper limit
using the max_concurrent option in the forward plugin.
If a node-local-dns Pod attempts to use more memory than is available (because of total system
resources, or because of a configured
resource limit), the operating system
may shut down that pod's container.
If this happens, the container that is terminated (“OOMKilled”) does not clean up the custom
packet filtering rules that it previously added during startup.
The node-local-dns container should get restarted (since managed as part of a DaemonSet), but this
will lead to a brief DNS downtime each time that the container fails: the packet filtering rules direct
DNS queries to a local Pod that is unhealthy.
You can determine a suitable memory limit by running node-local-dns pods without a limit and
measuring the peak usage. You can also set up and use a
VerticalPodAutoscaler
in recommender mode, and then check its recommendations.
2.40 - Using sysctls in a Kubernetes Cluster
FEATURE STATE:Kubernetes v1.21 [stable]
This document describes how to configure and use kernel parameters within a
Kubernetes cluster using the sysctl
interface.
Note: Starting from Kubernetes version 1.23, the kubelet supports the use of either / or .
as separators for sysctl names.
For example, you can represent the same sysctl name as kernel.shm_rmid_forced using a
period as the separator, or as kernel/shm_rmid_forced using a slash as a separator.
For more sysctl parameter conversion method details, please refer to
the page sysctl.d(5) from
the Linux man-pages project.
Setting Sysctls for a Pod and PodSecurityPolicy features do not yet support
setting sysctls with slashes.
Before you begin
You need to have a Kubernetes cluster, and the kubectl command-line tool must
be configured to communicate with your cluster. It is recommended to run this tutorial on a cluster with at least two nodes that are not acting as control plane hosts. If you do not already have a
cluster, you can create one by using
minikube
or you can use one of these Kubernetes playgrounds:
For some steps, you also need to be able to reconfigure the command line
options for the kubelets running on your cluster.
Listing all Sysctl Parameters
In Linux, the sysctl interface allows an administrator to modify kernel
parameters at runtime. Parameters are available via the /proc/sys/ virtual
process file system. The parameters cover various subsystems such as:
Sysctls are grouped into safe and unsafe sysctls. In addition to proper
namespacing, a safe sysctl must be properly isolated between pods on the
same node. This means that setting a safe sysctl for one pod
must not have any influence on any other pod on the node
must not allow to harm the node's health
must not allow to gain CPU or memory resources outside of the resource limits
of a pod.
By far, most of the namespaced sysctls are not necessarily considered safe.
The following sysctls are supported in the safe set:
Note: The example net.ipv4.tcp_syncookies is not namespaced on Linux kernel version 4.4 or lower.
This list will be extended in future Kubernetes versions when the kubelet
supports better isolation mechanisms.
All safe sysctls are enabled by default.
All unsafe sysctls are disabled by default and must be allowed manually by the
cluster admin on a per-node basis. Pods with disabled unsafe sysctls will be
scheduled, but will fail to launch.
With the warning above in mind, the cluster admin can allow certain unsafe
sysctls for very special situations such as high-performance or real-time
application tuning. Unsafe sysctls are enabled on a node-by-node basis with a
flag of the kubelet; for example:
A number of sysctls are namespaced in today's Linux kernels. This means that
they can be set independently for each pod on a node. Only namespaced sysctls
are configurable via the pod securityContext within Kubernetes.
The following sysctls are known to be namespaced. This list could change
in future versions of the Linux kernel.
kernel.shm*,
kernel.msg*,
kernel.sem,
fs.mqueue.*,
The parameters under net.* that can be set in container networking
namespace. However, there are exceptions (e.g.,
net.netfilter.nf_conntrack_max and net.netfilter.nf_conntrack_expect_max
can be set in container networking namespace but they are unnamespaced).
Sysctls with no namespace are called node-level sysctls. If you need to set
them, you must manually configure them on each node's operating system, or by
using a DaemonSet with privileged containers.
Use the pod securityContext to configure namespaced sysctls. The securityContext
applies to all containers in the same pod.
This example uses the pod securityContext to set a safe sysctl
kernel.shm_rmid_forced and two unsafe sysctls net.core.somaxconn and
kernel.msgmax. There is no distinction between safe and unsafe sysctls in
the specification.
Warning: Only modify sysctl parameters after you understand their effects, to avoid
destabilizing your operating system.
Warning: Due to their nature of being unsafe, the use of unsafe sysctls
is at-your-own-risk and can lead to severe problems like wrong behavior of
containers, resource shortage or complete breakage of a node.
It is good practice to consider nodes with special sysctl settings as
tainted within a cluster, and only schedule pods onto them which need those
sysctl settings. It is suggested to use the Kubernetes taints and toleration
feature to implement this.
A pod with the unsafe sysctls will fail to launch on any node which has not
enabled those two unsafe sysctls explicitly. As with node-level sysctls it
is recommended to use
taints and toleration feature or
taints on nodes
to schedule those pods onto the right nodes.
PodSecurityPolicy
FEATURE STATE:Kubernetes v1.21 [deprecated]
You can further control which sysctls can be set in pods by specifying lists of
sysctls or sysctl patterns in the forbiddenSysctls and/or
allowedUnsafeSysctls fields of the PodSecurityPolicy. A sysctl pattern ends
with a * character, such as kernel.*. A * character on its own matches
all sysctls.
By default, all safe sysctls are allowed.
Both forbiddenSysctls and allowedUnsafeSysctls are lists of plain sysctl names
or sysctl patterns (which end with *). The string * matches all sysctls.
The forbiddenSysctls field excludes specific sysctls. You can forbid a
combination of safe and unsafe sysctls in the list. To forbid setting any
sysctls, use * on its own.
If you specify any unsafe sysctl in the allowedUnsafeSysctls field and it is
not present in the forbiddenSysctls field, that sysctl can be used in Pods
using this PodSecurityPolicy. To allow all unsafe sysctls in the
PodSecurityPolicy to be set, use * on its own.
Do not configure these two fields such that there is overlap, meaning that a
given sysctl is both allowed and forbidden.
Warning: If you allow unsafe sysctls via the allowedUnsafeSysctls field
in a PodSecurityPolicy, any pod using such a sysctl will fail to start
if the sysctl is not allowed via the --allowed-unsafe-sysctls kubelet
flag as well on that node.
This example allows unsafe sysctls prefixed with kernel.msg to be set and
disallows setting of the kernel.shm_rmid_forced sysctl.
The Kubernetes Memory Manager enables the feature of guaranteed memory (and hugepages)
allocation for pods in the GuaranteedQoS class.
The Memory Manager employs hint generation protocol to yield the most suitable NUMA affinity for a pod.
The Memory Manager feeds the central manager (Topology Manager) with these affinity hints.
Based on both the hints and Topology Manager policy, the pod is rejected or admitted to the node.
Moreover, the Memory Manager ensures that the memory which a pod requests
is allocated from a minimum number of NUMA nodes.
The Memory Manager is only pertinent to Linux based hosts.
Before you begin
You need to have a Kubernetes cluster, and the kubectl command-line tool must
be configured to communicate with your cluster. It is recommended to run this tutorial on a cluster with at least two nodes that are not acting as control plane hosts. If you do not already have a
cluster, you can create one by using
minikube
or you can use one of these Kubernetes playgrounds:
Your Kubernetes server must be at or later than version v1.21.
To check the version, enter kubectl version.
To align memory resources with other requested resources in a Pod spec:
the CPU Manager should be enabled and proper CPU Manager policy should be configured on a Node.
See control CPU Management Policies;
the Topology Manager should be enabled and proper Topology Manager policy should be configured on a Node.
See control Topology Management Policies.
Starting from v1.22, the Memory Manager is enabled by default through MemoryManagerfeature gate.
Preceding v1.22, the kubelet must be started with the following flag:
--feature-gates=MemoryManager=true
in order to enable the Memory Manager feature.
How Memory Manager Operates?
The Memory Manager currently offers the guaranteed memory (and hugepages) allocation
for Pods in Guaranteed QoS class.
To immediately put the Memory Manager into operation follow the guidelines in the section
Memory Manager configuration, and subsequently,
prepare and deploy a Guaranteed pod as illustrated in the section
Placing a Pod in the Guaranteed QoS class.
The Memory Manager is a Hint Provider, and it provides topology hints for
the Topology Manager which then aligns the requested resources according to these topology hints.
It also enforces cgroups (i.e. cpuset.mems) for pods.
The complete flow diagram concerning pod admission and deployment process is illustrated in
Memory Manager KEP: Design Overview and below:
During this process, the Memory Manager updates its internal counters stored in
Node Map and Memory Maps to manage guaranteed memory allocation.
The Memory Manager updates the Node Map during the startup and runtime as follows.
The administrator must provide --reserved-memory flag when Static policy is configured.
Runtime
Reference Memory Manager KEP: Memory Maps at runtime (with examples) illustrates
how a successful pod deployment affects the Node Map, and it also relates to
how potential Out-of-Memory (OOM) situations are handled further by Kubernetes or operating system.
Other Managers should be first pre-configured. Next, the Memory Manger feature should be enabled
and be run with Static policy (section Static policy).
Optionally, some amount of memory can be reserved for system or kubelet processes to increase
node stability (section Reserved memory flag).
Policies
Memory Manager supports two policies. You can select a policy via a kubelet flag --memory-manager-policy:
None (default)
Static
None policy
This is the default policy and does not affect the memory allocation in any way.
It acts the same as if the Memory Manager is not present at all.
The None policy returns default topology hint. This special hint denotes that Hint Provider
(Memory Manger in this case) has no preference for NUMA affinity with any resource.
Static policy
In the case of the Guaranteed pod, the Static Memory Manger policy returns topology hints
relating to the set of NUMA nodes where the memory can be guaranteed,
and reserves the memory through updating the internal NodeMap object.
In the case of the BestEffort or Burstable pod, the Static Memory Manager policy sends back
the default topology hint as there is no request for the guaranteed memory,
and does not reserve the memory in the internal NodeMap object.
Reserved memory flag
The Node Allocatable mechanism
is commonly used by node administrators to reserve K8S node system resources for the kubelet
or operating system processes in order to enhance the node stability.
A dedicated set of flags can be used for this purpose to set the total amount of reserved memory
for a node. This pre-configured value is subsequently utilized to calculate
the real amount of node's "allocatable" memory available to pods.
The Kubernetes scheduler incorporates "allocatable" to optimise pod scheduling process.
The foregoing flags include --kube-reserved, --system-reserved and --eviction-threshold.
The sum of their values will account for the total amount of reserved memory.
A new --reserved-memory flag was added to Memory Manager to allow for this total reserved memory
to be split (by a node administrator) and accordingly reserved across many NUMA nodes.
The flag specifies a comma-separated list of memory reservations of different memory types per NUMA node.
Memory reservations across multiple NUMA nodes can be specified using semicolon as separator.
This parameter is only useful in the context of the Memory Manager feature.
The Memory Manager will not use this reserved memory for the allocation of container workloads.
For example, if you have a NUMA node "NUMA0" with 10Gi of memory available, and
the --reserved-memory was specified to reserve 1Gi of memory at "NUMA0",
the Memory Manager assumes that only 9Gi is available for containers.
You can omit this parameter, however, you should be aware that the quantity of reserved memory
from all NUMA nodes should be equal to the quantity of memory specified by the
Node Allocatable feature.
If at least one node allocatable parameter is non-zero, you will need to specify
--reserved-memory for at least one NUMA node.
In fact, eviction-hard threshold value is equal to 100Mi by default, so
if Static policy is used, --reserved-memory is obligatory.
Also, avoid the following configurations:
duplicates, i.e. the same NUMA node or memory type, but with a different value;
setting zero limit for any of memory types;
NUMA node IDs that do not exist in the machine hardware;
memory type names different than memory or hugepages-<size>
(hugepages of particular <size> should also exist).
When you specify values for --reserved-memory flag, you must comply with the setting that
you prior provided via Node Allocatable Feature flags.
That is, the following rule must be obeyed for each memory type:
An example of kubelet command-line arguments relevant to the node Allocatable configuration:
--kube-reserved=cpu=500m,memory=50Mi
--system-reserved=cpu=123m,memory=333Mi
--eviction-hard=memory.available<500Mi
Note: The default hard eviction threshold is 100MiB, and not zero.
Remember to increase the quantity of memory that you reserve by setting --reserved-memory
by that hard eviction threshold. Otherwise, the kubelet will not start Memory Manager and
display an error.
If the selected policy is anything other than None, the Memory Manager identifies pods
that are in the Guaranteed QoS class.
The Memory Manager provides specific topology hints to the Topology Manager for each Guaranteed pod.
For pods in a QoS class other than Guaranteed, the Memory Manager provides default topology hints
to the Topology Manager.
The following excerpts from pod manifests assign a pod to the Guaranteed QoS class.
Pod with integer CPU(s) runs in the Guaranteed QoS class, when requests are equal to limits:
Notice that both CPU and memory requests must be specified for a Pod to lend it to Guaranteed QoS class.
Troubleshooting
The following means can be used to troubleshoot the reason why a pod could not be deployed or
became rejected at a node:
pod status - indicates topology affinity errors
system logs - include valuable information for debugging, e.g., about generated hints
state file - the dump of internal state of the Memory Manager
(includes Node Map and Memory Maps)
starting from v1.22, the device plugin resource API can be used
to retrieve information about the memory reserved for containers
Pod status (TopologyAffinityError)
This error typically occurs in the following situations:
a node has not enough resources available to satisfy the pod's request
the pod's request is rejected due to particular Topology Manager policy constraints
The error appears in the status of a pod:
kubectl get pods
NAME READY STATUS RESTARTS AGE
guaranteed 0/1 TopologyAffinityError 0 113s
Use kubectl describe pod <id> or kubectl get events to obtain detailed error message:
Warning TopologyAffinityError 10m kubelet, dell8 Resources cannot be allocated with Topology locality
System logs
Search system logs with respect to a particular pod.
The set of hints that Memory Manager generated for the pod can be found in the logs.
Also, the set of hints generated by CPU Manager should be present in the logs.
Topology Manager merges these hints to calculate a single best hint.
The best hint should be also present in the logs.
The best hint indicates where to allocate all the resources.
Topology Manager tests this hint against its current policy, and based on the verdict,
it either admits the pod to the node or rejects it.
Also, search the logs for occurrences associated with the Memory Manager,
e.g. to find out information about cgroups and cpuset.mems updates.
Examine the memory manager state on a node
Let us first deploy a sample Guaranteed pod whose specification is as follows:
It can be deduced from the state file that the pod was pinned to both NUMA nodes, i.e.:
"numaAffinity":[
0,
1
],
Pinned term means that pod's memory consumption is constrained (through cgroups configuration)
to these NUMA nodes.
This automatically implies that Memory Manager instantiated a new group that
comprises these two NUMA nodes, i.e. 0 and 1 indexed NUMA nodes.
Notice that the management of groups is handled in a relatively complex manner, and
further elaboration is provided in Memory Manager KEP in this and this sections.
In order to analyse memory resources available in a group,the corresponding entries from
NUMA nodes belonging to the group must be added up.
For example, the total amount of free "conventional" memory in the group can be computed
by adding up the free memory available at every NUMA node in the group,
i.e., in the "memory" section of NUMA node 0 ("free":0) and NUMA node 1 ("free":103739236352).
So, the total amount of free "conventional" memory in this group is equal to 0 + 103739236352 bytes.
The line "systemReserved":3221225472 indicates that the administrator of this node reserved
3221225472 bytes (i.e. 3Gi) to serve kubelet and system processes at NUMA node 0,
by using --reserved-memory flag.
Device plugin resource API
By employing the API,
the information about reserved memory for each container can be retrieved, which is contained
in protobuf ContainerMemory message.
This information can be retrieved solely for pods in Guaranteed QoS class.
These instructions are for Kubernetes 1.24. If you want
to check the integrity of components for a different version of Kubernetes,
check the documentation for that Kubernetes release.
You will need to have the following tools installed:
Note:COSIGN_EXPERIMENTAL=1 is used to allow verification of images signed
in KEYLESS mode. To learn more about keyless signing, please refer to
Keyless Signatures
.
Verifying images for all control plane components
To verify all signed control plane images, please run this command:
Once you have verified an image, specify that image by its digest in your Pod
manifests as per this
example: registry-url/image-name@sha256:45b23dee08af5e43a7fea6c4cf9c25ccf269ee113168c19722f87876677c5cb2
.
Verifying Image Signatures with Admission Controller
For non-control plane images (
e.g. conformance image)
, signatures can also be verified at deploy time using
cosigned
admission controller. To get started with cosigned here are a few helpful
resources:
Perform common configuration tasks for Pods and containers.
3.1 - Assign Memory Resources to Containers and Pods
This page shows how to assign a memory request and a memory limit to a
Container. A Container is guaranteed to have as much memory as it requests,
but is not allowed to use more memory than its limit.
Before you begin
You need to have a Kubernetes cluster, and the kubectl command-line tool must
be configured to communicate with your cluster. It is recommended to run this tutorial on a cluster with at least two nodes that are not acting as control plane hosts. If you do not already have a
cluster, you can create one by using
minikube
or you can use one of these Kubernetes playgrounds:
Each node in your cluster must have at least 300 MiB of memory.
A few of the steps on this page require you to run the
metrics-server
service in your cluster. If you have the metrics-server
running, you can skip those steps.
If you are running Minikube, run the following command to enable the
metrics-server:
minikube addons enable metrics-server
To see whether the metrics-server is running, or another provider of the resource metrics
API (metrics.k8s.io), run the following command:
kubectl get apiservices
If the resource metrics API is available, the output includes a
reference to metrics.k8s.io.
NAME
v1beta1.metrics.k8s.io
Create a namespace
Create a namespace so that the resources you create in this exercise are
isolated from the rest of your cluster.
kubectl create namespace mem-example
Specify a memory request and a memory limit
To specify a memory request for a Container, include the resources:requests field
in the Container's resource manifest. To specify a memory limit, include resources:limits.
In this exercise, you create a Pod that has one Container. The Container has a memory
request of 100 MiB and a memory limit of 200 MiB. Here's the configuration file
for the Pod:
The args section in the configuration file provides arguments for the Container when it starts.
The "--vm-bytes", "150M" arguments tell the Container to attempt to allocate 150 MiB of memory.
kubectl top pod memory-demo --namespace=mem-example
The output shows that the Pod is using about 162,900,000 bytes of memory, which
is about 150 MiB. This is greater than the Pod's 100 MiB request, but within the
Pod's 200 MiB limit.
NAME CPU(cores) MEMORY(bytes)
memory-demo <something> 162856960
Delete your Pod:
kubectl delete pod memory-demo --namespace=mem-example
Exceed a Container's memory limit
A Container can exceed its memory request if the Node has memory available. But a Container
is not allowed to use more than its memory limit. If a Container allocates more memory than
its limit, the Container becomes a candidate for termination. If the Container continues to
consume memory beyond its limit, the Container is terminated. If a terminated Container can be
restarted, the kubelet restarts it, as with any other type of runtime failure.
In this exercise, you create a Pod that attempts to allocate more memory than its limit.
Here is the configuration file for a Pod that has one Container with a
memory request of 50 MiB and a memory limit of 100 MiB:
In the args section of the configuration file, you can see that the Container
will attempt to allocate 250 MiB of memory, which is well above the 100 MiB limit.
The Container in this exercise can be restarted, so the kubelet restarts it. Repeat
this command several times to see that the Container is repeatedly killed and restarted:
kubectl get pod memory-demo-2 --namespace=mem-example
The output shows that the Container is killed, restarted, killed again, restarted again, and so on:
kubectl get pod memory-demo-2 --namespace=mem-example
NAME READY STATUS RESTARTS AGE
memory-demo-2 0/1 OOMKilled 1 37s
kubectl get pod memory-demo-2 --namespace=mem-example
NAME READY STATUS RESTARTS AGE
memory-demo-2 1/1 Running 2 40s
View detailed information about the Pod history:
kubectl describe pod memory-demo-2 --namespace=mem-example
The output shows that the Container starts and fails repeatedly:
... Normal Created Created container with id 66a3a20aa7980e61be4922780bf9d24d1a1d8b7395c09861225b0eba1b1f8511
... Warning BackOff Back-off restarting failed container
View detailed information about your cluster's Nodes:
kubectl describe nodes
The output includes a record of the Container being killed because of an out-of-memory condition:
Warning OOMKilling Memory cgroup out of memory: Kill process 4481 (stress) score 1994 or sacrifice child
Delete your Pod:
kubectl delete pod memory-demo-2 --namespace=mem-example
Specify a memory request that is too big for your Nodes
Memory requests and limits are associated with Containers, but it is useful to think
of a Pod as having a memory request and limit. The memory request for the Pod is the
sum of the memory requests for all the Containers in the Pod. Likewise, the memory
limit for the Pod is the sum of the limits of all the Containers in the Pod.
Pod scheduling is based on requests. A Pod is scheduled to run on a Node only if the Node
has enough available memory to satisfy the Pod's memory request.
In this exercise, you create a Pod that has a memory request so big that it exceeds the
capacity of any Node in your cluster. Here is the configuration file for a Pod that has one
Container with a request for 1000 GiB of memory, which likely exceeds the capacity
of any Node in your cluster.
kubectl get pod memory-demo-3 --namespace=mem-example
The output shows that the Pod status is PENDING. That is, the Pod is not scheduled to run on any Node, and it will remain in the PENDING state indefinitely:
kubectl get pod memory-demo-3 --namespace=mem-example
NAME READY STATUS RESTARTS AGE
memory-demo-3 0/1 Pending 0 25s
View detailed information about the Pod, including events:
kubectl describe pod memory-demo-3 --namespace=mem-example
The output shows that the Container cannot be scheduled because of insufficient memory on the Nodes:
Events:
... Reason Message
------ -------
... FailedScheduling No nodes are available that match all of the following predicates:: Insufficient memory (3).
Memory units
The memory resource is measured in bytes. You can express memory as a plain integer or a
fixed-point integer with one of these suffixes: E, P, T, G, M, K, Ei, Pi, Ti, Gi, Mi, Ki.
For example, the following represent approximately the same value:
128974848, 129e6, 129M , 123Mi
Delete your Pod:
kubectl delete pod memory-demo-3 --namespace=mem-example
If you do not specify a memory limit
If you do not specify a memory limit for a Container, one of the following situations applies:
The Container has no upper bound on the amount of memory it uses. The Container
could use all of the memory available on the Node where it is running which in turn could invoke the OOM Killer. Further, in case of an OOM Kill, a container with no resource limits will have a greater chance of being killed.
The Container is running in a namespace that has a default memory limit, and the
Container is automatically assigned the default limit. Cluster administrators can use a
LimitRange
to specify a default value for the memory limit.
Motivation for memory requests and limits
By configuring memory requests and limits for the Containers that run in your
cluster, you can make efficient use of the memory resources available on your cluster's
Nodes. By keeping a Pod's memory request low, you give the Pod a good chance of being
scheduled. By having a memory limit that is greater than the memory request, you accomplish two things:
The Pod can have bursts of activity where it makes use of memory that happens to be available.
The amount of memory a Pod can use during a burst is limited to some reasonable amount.
Clean up
Delete your namespace. This deletes all the Pods that you created for this task:
This page shows how to assign a CPU request and a CPU limit to
a container. Containers cannot use more CPU than the configured limit.
Provided the system has CPU time free, a container is guaranteed to be
allocated as much CPU as it requests.
Before you begin
You need to have a Kubernetes cluster, and the kubectl command-line tool must
be configured to communicate with your cluster. It is recommended to run this tutorial on a cluster with at least two nodes that are not acting as control plane hosts. If you do not already have a
cluster, you can create one by using
minikube
or you can use one of these Kubernetes playgrounds:
Your cluster must have at least 1 CPU available for use to run the task examples.
A few of the steps on this page require you to run the
metrics-server
service in your cluster. If you have the metrics-server
running, you can skip those steps.
If you are running Minikube, run the
following command to enable metrics-server:
minikube addons enable metrics-server
To see whether metrics-server (or another provider of the resource metrics
API, metrics.k8s.io) is running, type the following command:
kubectl get apiservices
If the resource metrics API is available, the output will include a
reference to metrics.k8s.io.
NAME
v1beta1.metrics.k8s.io
Create a namespace
Create a Namespace so that the resources you
create in this exercise are isolated from the rest of your cluster.
kubectl create namespace cpu-example
Specify a CPU request and a CPU limit
To specify a CPU request for a container, include the resources:requests field
in the Container resource manifest. To specify a CPU limit, include resources:limits.
In this exercise, you create a Pod that has one container. The container has a request
of 0.5 CPU and a limit of 1 CPU. Here is the configuration file for the Pod:
The args section of the configuration file provides arguments for the container when it starts.
The -cpus "2" argument tells the Container to attempt to use 2 CPUs.
kubectl get pod cpu-demo --output=yaml --namespace=cpu-example
The output shows that the one container in the Pod has a CPU request of 500 milliCPU
and a CPU limit of 1 CPU.
resources:limits:cpu:"1"requests:cpu:500m
Use kubectl top to fetch the metrics for the pod:
kubectl top pod cpu-demo --namespace=cpu-example
This example output shows that the Pod is using 974 milliCPU, which is
slightly less than the limit of 1 CPU specified in the Pod configuration.
NAME CPU(cores) MEMORY(bytes)
cpu-demo 974m <something>
Recall that by setting -cpu "2", you configured the Container to attempt to use 2 CPUs, but the Container is only being allowed to use about 1 CPU. The container's CPU use is being throttled, because the container is attempting to use more CPU resources than its limit.
Note: Another possible explanation for the CPU use being below 1.0 is that the Node might not have
enough CPU resources available. Recall that the prerequisites for this exercise require your cluster to have at least 1 CPU available for use. If your Container runs on a Node that has only 1 CPU, the Container cannot use more than 1 CPU regardless of the CPU limit specified for the Container.
CPU units
The CPU resource is measured in CPU units. One CPU, in Kubernetes, is equivalent to:
1 AWS vCPU
1 GCP Core
1 Azure vCore
1 Hyperthread on a bare-metal Intel processor with Hyperthreading
Fractional values are allowed. A Container that requests 0.5 CPU is guaranteed half as much
CPU as a Container that requests 1 CPU. You can use the suffix m to mean milli. For example
100m CPU, 100 milliCPU, and 0.1 CPU are all the same. Precision finer than 1m is not allowed.
CPU is always requested as an absolute quantity, never as a relative quantity; 0.1 is the same
amount of CPU on a single-core, dual-core, or 48-core machine.
Delete your Pod:
kubectl delete pod cpu-demo --namespace=cpu-example
Specify a CPU request that is too big for your Nodes
CPU requests and limits are associated with Containers, but it is useful to think
of a Pod as having a CPU request and limit. The CPU request for a Pod is the sum
of the CPU requests for all the Containers in the Pod. Likewise, the CPU limit for
a Pod is the sum of the CPU limits for all the Containers in the Pod.
Pod scheduling is based on requests. A Pod is scheduled to run on a Node only if
the Node has enough CPU resources available to satisfy the Pod CPU request.
In this exercise, you create a Pod that has a CPU request so big that it exceeds
the capacity of any Node in your cluster. Here is the configuration file for a Pod
that has one Container. The Container requests 100 CPU, which is likely to exceed the
capacity of any Node in your cluster.
kubectl get pod cpu-demo-2 --namespace=cpu-example
The output shows that the Pod status is Pending. That is, the Pod has not been
scheduled to run on any Node, and it will remain in the Pending state indefinitely:
NAME READY STATUS RESTARTS AGE
cpu-demo-2 0/1 Pending 0 7m
View detailed information about the Pod, including events:
kubectl describe pod cpu-demo-2 --namespace=cpu-example
The output shows that the Container cannot be scheduled because of insufficient
CPU resources on the Nodes:
Events:
Reason Message
------ -------
FailedScheduling No nodes are available that match all of the following predicates:: Insufficient cpu (3).
Delete your Pod:
kubectl delete pod cpu-demo-2 --namespace=cpu-example
If you do not specify a CPU limit
If you do not specify a CPU limit for a Container, then one of these situations applies:
The Container has no upper bound on the CPU resources it can use. The Container
could use all of the CPU resources available on the Node where it is running.
The Container is running in a namespace that has a default CPU limit, and the
Container is automatically assigned the default limit. Cluster administrators can use a
LimitRange
to specify a default value for the CPU limit.
If you specify a CPU limit but do not specify a CPU request
If you specify a CPU limit for a Container but do not specify a CPU request, Kubernetes automatically
assigns a CPU request that matches the limit. Similarly, if a Container specifies its own memory limit,
but does not specify a memory request, Kubernetes automatically assigns a memory request that matches
the limit.
Motivation for CPU requests and limits
By configuring the CPU requests and limits of the Containers that run in your
cluster, you can make efficient use of the CPU resources available on your cluster
Nodes. By keeping a Pod CPU request low, you give the Pod a good chance of being
scheduled. By having a CPU limit that is greater than the CPU request, you accomplish two things:
The Pod can have bursts of activity where it makes use of CPU resources that happen to be available.
The amount of CPU resources a Pod can use during a burst is limited to some reasonable amount.
3.3 - Configure GMSA for Windows Pods and containers
FEATURE STATE:Kubernetes v1.18 [stable]
This page shows how to configure Group Managed Service Accounts (GMSA) for Pods and containers that will run on Windows nodes. Group Managed Service Accounts are a specific type of Active Directory account that provides automatic password management, simplified service principal name (SPN) management, and the ability to delegate the management to other administrators across multiple servers.
In Kubernetes, GMSA credential specs are configured at a Kubernetes cluster-wide scope as Custom Resources. Windows Pods, as well as individual containers within a Pod, can be configured to use a GMSA for domain based functions (e.g. Kerberos authentication) when interacting with other Windows services.
Before you begin
You need to have a Kubernetes cluster and the kubectl command-line tool must be configured to communicate with your cluster. The cluster is expected to have Windows worker nodes. This section covers a set of initial steps required once for each cluster:
Install the GMSACredentialSpec CRD
A CustomResourceDefinition(CRD) for GMSA credential spec resources needs to be configured on the cluster to define the custom resource type GMSACredentialSpec. Download the GMSA CRD YAML and save it as gmsa-crd.yaml.
Next, install the CRD with kubectl apply -f gmsa-crd.yaml
Install webhooks to validate GMSA users
Two webhooks need to be configured on the Kubernetes cluster to populate and validate GMSA credential spec references at the Pod or container level:
A mutating webhook that expands references to GMSAs (by name from a Pod specification) into the full credential spec in JSON form within the Pod spec.
A validating webhook ensures all references to GMSAs are authorized to be used by the Pod service account.
Installing the above webhooks and associated objects require the steps below:
Create a certificate key pair (that will be used to allow the webhook container to communicate to the cluster)
Install a secret with the certificate from above.
Create a deployment for the core webhook logic.
Create the validating and mutating webhook configurations referring to the deployment.
A script can be used to deploy and configure the GMSA webhooks and associated objects mentioned above. The script can be run with a --dry-run=server option to allow you to review the changes that would be made to your cluster.
The YAML template used by the script may also be used to deploy the webhooks and associated objects manually (with appropriate substitutions for the parameters)
Configure GMSAs and Windows nodes in Active Directory
Before Pods in Kubernetes can be configured to use GMSAs, the desired GMSAs need to be provisioned in Active Directory as described in the Windows GMSA documentation. Windows worker nodes (that are part of the Kubernetes cluster) need to be configured in Active Directory to access the secret credentials associated with the desired GMSA as described in the Windows GMSA documentation
Create GMSA credential spec resources
With the GMSACredentialSpec CRD installed (as described earlier), custom resources containing GMSA credential specs can be configured. The GMSA credential spec does not contain secret or sensitive data. It is information that a container runtime can use to describe the desired GMSA of a container to Windows. GMSA credential specs can be generated in YAML format with a utility PowerShell script.
Following are the steps for generating a GMSA credential spec YAML manually in JSON format and then converting it:
Import the CredentialSpec module: ipmo CredentialSpec.psm1
Create a credential spec in JSON format using New-CredentialSpec. To create a GMSA credential spec named WebApp1, invoke New-CredentialSpec -Name WebApp1 -AccountName WebApp1 -Domain $(Get-ADDomain -Current LocalComputer)
Use Get-CredentialSpec to show the path of the JSON file.
Convert the credspec file from JSON to YAML format and apply the necessary header fields apiVersion, kind, metadata and credspec to make it a GMSACredentialSpec custom resource that can be configured in Kubernetes.
The following YAML configuration describes a GMSA credential spec named gmsa-WebApp1:
apiVersion:windows.k8s.io/v1kind:GMSACredentialSpecmetadata:name:gmsa-WebApp1 #This is an arbitrary name but it will be used as a referencecredspec:ActiveDirectoryConfig:GroupManagedServiceAccounts:- Name:WebApp1 #Username of the GMSA accountScope:CONTOSO #NETBIOS Domain Name- Name:WebApp1 #Username of the GMSA accountScope:contoso.com#DNS Domain NameCmsPlugins:- ActiveDirectoryDomainJoinConfig:DnsName:contoso.com #DNS Domain NameDnsTreeName:contoso.com#DNS Domain Name RootGuid:244818ae-87ac-4fcd-92ec-e79e5252348a #GUIDMachineAccountName:WebApp1#Username of the GMSA accountNetBiosName:CONTOSO #NETBIOS Domain NameSid:S-1-5-21-2126449477-2524075714-3094792973#SID of GMSA
The above credential spec resource may be saved as gmsa-Webapp1-credspec.yaml and applied to the cluster using: kubectl apply -f gmsa-Webapp1-credspec.yml
Configure cluster role to enable RBAC on specific GMSA credential specs
A cluster role needs to be defined for each GMSA credential spec resource. This authorizes the use verb on a specific GMSA resource by a subject which is typically a service account. The following example shows a cluster role that authorizes usage of the gmsa-WebApp1 credential spec from above. Save the file as gmsa-webapp1-role.yaml and apply using kubectl apply -f gmsa-webapp1-role.yaml
#Create the Role to read the credspecapiVersion:rbac.authorization.k8s.io/v1kind:ClusterRolemetadata:name:webapp1-rolerules:- apiGroups:["windows.k8s.io"]resources:["gmsacredentialspecs"]verbs:["use"]resourceNames:["gmsa-WebApp1"]
Assign role to service accounts to use specific GMSA credspecs
A service account (that Pods will be configured with) needs to be bound to the cluster role create above. This authorizes the service account to use the desired GMSA credential spec resource. The following shows the default service account being bound to a cluster role webapp1-role to use gmsa-WebApp1 credential spec resource created above.
Configure GMSA credential spec reference in Pod spec
The Pod spec field securityContext.windowsOptions.gmsaCredentialSpecName is used to specify references to desired GMSA credential spec custom resources in Pod specs. This configures all containers in the Pod spec to use the specified GMSA. A sample Pod spec with the annotation populated to refer to gmsa-WebApp1:
Individual containers in a Pod spec can also specify the desired GMSA credspec using a per-container securityContext.windowsOptions.gmsaCredentialSpecName field. For example:
As Pod specs with GMSA fields populated (as described above) are applied in a cluster, the following sequence of events take place:
The mutating webhook resolves and expands all references to GMSA credential spec resources to the contents of the GMSA credential spec.
The validating webhook ensures the service account associated with the Pod is authorized for the use verb on the specified GMSA credential spec.
The container runtime configures each Windows container with the specified GMSA credential spec so that the container can assume the identity of the GMSA in Active Directory and access services in the domain using that identity.
Authenticating to network shares using hostname or FQDN
If you are experiencing issues connecting to SMB shares from Pods using hostname or FQDN, but are able to access the shares via their IPv4 address then make sure the following registry key is set on the Windows nodes.
Running Pods will then need to be recreated to pick up the behavior changes.
More information on how this registry key is used can be found here
Troubleshooting
If you are having difficulties getting GMSA to work in your environment, there are a few troubleshooting steps you can take.
First, make sure the credspec has been passed to the Pod. To do this you will need to exec into one of your Pods and check the output of the nltest.exe /parentdomain command.
In the example below the Pod did not get the credspec correctly:
nltest.exe /parentdomain results in the following error:
Getting parent domain failed: Status = 1722 0x6ba RPC_S_SERVER_UNAVAILABLE
If your Pod did get the credspec correctly, then next check communication with the domain. First, from inside of your Pod, quickly do an nslookup to find the root of your domain.
This will tell us 3 things:
The Pod can reach the DC
The DC can reach the Pod
DNS is working correctly.
If the DNS and communication test passes, next you will need to check if the Pod has established secure channel communication with the domain. To do this, again, exec into your Pod and run the nltest.exe /query command.
nltest.exe /query
Results in the following output:
I_NetLogonControl failed: Status = 1722 0x6ba RPC_S_SERVER_UNAVAILABLE
This tells us that for some reason, the Pod was unable to logon to the domain using the account specified in the credspec. You can try to repair the secure channel by running the following:
nltest /sc_reset:domain.example
If the command is successful you will see and output similar to this:
Flags: 30 HAS_IP HAS_TIMESERV
Trusted DC Name \\dc10.domain.example
Trusted DC Connection Status Status = 0 0x0 NERR_Success
The command completed successfully
If the above corrects the error, you can automate the step by adding the following lifecycle hook to your Pod spec. If it did not correct the error, you will need to examine your credspec again and confirm that it is correct and complete.
If you add the lifecycle section show above to your Pod spec, the Pod will execute the commands listed to restart the netlogon service until the nltest.exe /query command exits without error.
3.4 - Configure RunAsUserName for Windows pods and containers
FEATURE STATE:Kubernetes v1.18 [stable]
This page shows how to use the runAsUserName setting for Pods and containers that will run on Windows nodes. This is roughly equivalent of the Linux-specific runAsUser setting, allowing you to run applications in a container as a different username than the default.
Before you begin
You need to have a Kubernetes cluster and the kubectl command-line tool must be configured to communicate with your cluster. The cluster is expected to have Windows worker nodes where pods with containers running Windows workloads will get scheduled.
Set the Username for a Pod
To specify the username with which to execute the Pod's container processes, include the securityContext field (PodSecurityContext) in the Pod specification, and within it, the windowsOptions (WindowsSecurityContextOptions) field containing the runAsUserName field.
The Windows security context options that you specify for a Pod apply to all Containers and init Containers in the Pod.
Here is a configuration file for a Windows Pod that has the runAsUserName field set:
Check that the shell is running user the correct username:
echo $env:USERNAME
The output should be:
ContainerUser
Set the Username for a Container
To specify the username with which to execute a Container's processes, include the securityContext field (SecurityContext) in the Container manifest, and within it, the windowsOptions (WindowsSecurityContextOptions) field containing the runAsUserName field.
The Windows security context options that you specify for a Container apply only to that individual Container, and they override the settings made at the Pod level.
Here is the configuration file for a Pod that has one Container, and the runAsUserName field is set at the Pod level and the Container level:
Check that the shell is running user the correct username (the one set at the Container level):
echo $env:USERNAME
The output should be:
ContainerAdministrator
Windows Username limitations
In order to use this feature, the value set in the runAsUserName field must be a valid username. It must have the following format: DOMAIN\USER, where DOMAIN\ is optional. Windows user names are case insensitive. Additionally, there are some restrictions regarding the DOMAIN and USER:
The runAsUserName field cannot be empty, and it cannot contain control characters (ASCII values: 0x00-0x1F, 0x7F)
The DOMAIN must be either a NetBios name, or a DNS name, each with their own restrictions:
NetBios names: maximum 15 characters, cannot start with . (dot), and cannot contain the following characters: \ / : * ? " < > |
DNS names: maximum 255 characters, contains only alphanumeric characters, dots, and dashes, and it cannot start or end with a . (dot) or - (dash).
The USER must have at most 20 characters, it cannot contain only dots or spaces, and it cannot contain the following characters: " / \ [ ] : ; | = , + * ? < > @.
Examples of acceptable values for the runAsUserName field: ContainerAdministrator, ContainerUser, NT AUTHORITY\NETWORK SERVICE, NT AUTHORITY\LOCAL SERVICE.
For more information about these limtations, check here and here.
Windows HostProcess containers enable you to run containerized
workloads on a Windows host. These containers operate as
normal processes but have access to the host network namespace,
storage, and devices when given the appropriate user privileges.
HostProcess containers can be used to deploy network plugins,
storage configurations, device plugins, kube-proxy, and other
components to Windows nodes without the need for dedicated proxies or
the direct installation of host services.
Administrative tasks such as installation of security patches, event
log collection, and more can be performed without requiring cluster operators to
log onto each Windows node. HostProcess containers can run as any user that is
available on the host or is in the domain of the host machine, allowing administrators
to restrict resource access through user permissions. While neither filesystem or process
isolation are supported, a new volume is created on the host upon starting the container
to give it a clean and consolidated workspace. HostProcess containers can also be built on
top of existing Windows base images and do not inherit the same
compatibility requirements
as Windows server containers, meaning that the version of the base images does not need
to match that of the host. It is, however, recommended that you use the same base image
version as your Windows Server container workloads to ensure you do not have any unused
images taking up space on the node. HostProcess containers also support
volume mounts within the container volume.
When should I use a Windows HostProcess container?
When you need to perform tasks which require the networking namespace of the host.
HostProcess containers have access to the host's network interfaces and IP addresses.
You need access to resources on the host such as the filesystem, event logs, etc.
Installation of specific device drivers or Windows services.
Consolidation of administrative tasks and security policies. This reduces the degree of
privileges needed by Windows nodes.
Before you begin
This task guide is specific to Kubernetes v1.24.
If you are not running Kubernetes v1.24, check the documentation for
that version of Kubernetes.
In Kubernetes 1.24, the HostProcess container feature is enabled by default. The kubelet will
communicate with containerd directly by passing the hostprocess flag via CRI. You can use the
latest version of containerd (v1.6+) to run HostProcess containers.
How to install containerd.
To disable HostProcess containers you need to pass the following feature gate flag to the
kubelet and kube-apiserver:
These limitations are relevant for Kubernetes v1.24:
HostProcess containers require containerd 1.6 or higher
container runtime.
HostProcess pods can only contain HostProcess containers. This is a current limitation
of the Windows OS; non-privileged Windows containers cannot share a vNIC with the host IP namespace.
HostProcess containers run as a process on the host and do not have any degree of
isolation other than resource constraints imposed on the HostProcess user account. Neither
filesystem or Hyper-V isolation are supported for HostProcess containers.
Volume mounts are supported and are mounted under the container volume. See
Volume Mounts
A limited set of host user accounts are available for HostProcess containers by default.
See Choosing a User Account.
Resource limits (disk, memory, cpu count) are supported in the same fashion as processes
on the host.
Both Named pipe mounts and Unix domain sockets are not supported and should instead
be accessed via their path on the host (e.g. \\.\pipe\*)
HostProcess Pod configuration requirements
Enabling a Windows HostProcess pod requires setting the right configurations in the pod security
configuration. Of the policies defined in the Pod Security Standards
HostProcess pods are disallowed by the baseline and restricted policies. It is therefore recommended
that HostProcess pods run in alignment with the privileged profile.
When running under the privileged policy, here are
the configurations which need to be set to enable the creation of a HostProcess pod:
Because HostProcess containers have privileged access to the host, the runAsNonRoot field cannot be set to true.
Allowed Values
Undefined/Nil
false
Example manifest (excerpt)
spec:securityContext:windowsOptions:hostProcess:truerunAsUserName:"NT AUTHORITY\\Local service"hostNetwork:truecontainers:- name:testimage:image1:latestcommand:- ping- -t- 127.0.0.1nodeSelector:"kubernetes.io/os": windows
Volume mounts
HostProcess containers support the ability to mount volumes within the container volume space.
Applications running inside the container can access volume mounts directly via relative or
absolute paths. An environment variable $CONTAINER_SANDBOX_MOUNT_POINT is set upon container
creation and provides the absolute host path to the container volume. Relative paths are based
upon the .spec.containers.volumeMounts.mountPath configuration.
Example
To access service account tokens the following path structures are supported within the container:
Resource limits (disk, memory, cpu count) are applied to the job and are job wide.
For example, with a limit of 10MB set, the memory allocated for any HostProcess job object
will be capped at 10MB. This is the same behavior as other Windows container types.
These limits would be specified the same way they are currently for whatever orchestrator
or runtime is being used. The only difference is in the disk resource usage calculation
used for resource tracking due to the difference in how HostProcess containers are bootstrapped.
Choosing a user account
HostProcess containers support the ability to run as one of three supported Windows service accounts:
You should select an appropriate Windows service account for each HostProcess
container, aiming to limit the degree of privileges so as to avoid accidental (or even
malicious) damage to the host. The LocalSystem service account has the highest level
of privilege of the three and should be used only if absolutely necessary. Where possible,
use the LocalService service account as it is the least privileged of the three options.
3.6 - Configure Quality of Service for Pods
This page shows how to configure Pods so that they will be assigned particular
Quality of Service (QoS) classes. Kubernetes uses QoS classes to make decisions about
scheduling and evicting Pods.
Before you begin
You need to have a Kubernetes cluster, and the kubectl command-line tool must
be configured to communicate with your cluster. It is recommended to run this tutorial on a cluster with at least two nodes that are not acting as control plane hosts. If you do not already have a
cluster, you can create one by using
minikube
or you can use one of these Kubernetes playgrounds:
When Kubernetes creates a Pod it assigns one of these QoS classes to the Pod:
Guaranteed
Burstable
BestEffort
Create a namespace
Create a namespace so that the resources you create in this exercise are
isolated from the rest of your cluster.
kubectl create namespace qos-example
Create a Pod that gets assigned a QoS class of Guaranteed
For a Pod to be given a QoS class of Guaranteed:
Every Container in the Pod must have a memory limit and a memory request.
For every Container in the Pod, the memory limit must equal the memory request.
Every Container in the Pod must have a CPU limit and a CPU request.
For every Container in the Pod, the CPU limit must equal the CPU request.
These restrictions apply to init containers and app containers equally.
Here is the configuration file for a Pod that has one Container. The Container has a memory limit and a
memory request, both equal to 200 MiB. The Container has a CPU limit and a CPU request, both equal to 700 milliCPU:
kubectl get pod qos-demo --namespace=qos-example --output=yaml
The output shows that Kubernetes gave the Pod a QoS class of Guaranteed. The output also
verifies that the Pod Container has a memory request that matches its memory limit, and it has
a CPU request that matches its CPU limit.
Note: If a Container specifies its own memory limit, but does not specify a memory request, Kubernetes
automatically assigns a memory request that matches the limit. Similarly, if a Container specifies its own
CPU limit, but does not specify a CPU request, Kubernetes automatically assigns a CPU request that matches
the limit.
Delete your Pod:
kubectl delete pod qos-demo --namespace=qos-example
Create a Pod that gets assigned a QoS class of Burstable
A Pod is given a QoS class of Burstable if:
The Pod does not meet the criteria for QoS class Guaranteed.
At least one Container in the Pod has a memory or CPU request or limit.
Here is the configuration file for a Pod that has one Container. The Container has a memory limit of 200 MiB
and a memory request of 100 MiB.
kubectl delete pod qos-demo-3 --namespace=qos-example
Create a Pod that has two Containers
Here is the configuration file for a Pod that has two Containers. One container specifies a memory
request of 200 MiB. The other Container does not specify any requests or limits.
Notice that this Pod meets the criteria for QoS class Burstable. That is, it does not meet the
criteria for QoS class Guaranteed, and one of its Containers has a memory request.
This page shows how to assign extended resources to a Container.
Before you begin
You need to have a Kubernetes cluster, and the kubectl command-line tool must
be configured to communicate with your cluster. It is recommended to run this tutorial on a cluster with at least two nodes that are not acting as control plane hosts. If you do not already have a
cluster, you can create one by using
minikube
or you can use one of these Kubernetes playgrounds:
Before you do this exercise, do the exercise in
Advertise Extended Resources for a Node.
That will configure one of your Nodes to advertise a dongle resource.
Assign an extended resource to a Pod
To request an extended resource, include the resources:requests field in your
Container manifest. Extended resources are fully qualified with any domain outside of
*.kubernetes.io/. Valid extended resource names have the form example.com/foo where
example.com is replaced with your organization's domain and foo is a
descriptive resource name.
Here is the configuration file for a Pod that has one Container: