This is the multi-page printable view of this section. Click here to print.

Return to the regular view of this page.

Node Reference Information

1: Kubelet Checkpoint API
2: Linux Kernel Version Requirements
3: Articles on dockershim Removal and on Using CRI-compatible Runtimes
4: Kubelet Pod Info gRPC API
5: Node Labels Populated By The Kubelet
6: Kubelet Sync Loop
7: Local Files And Paths Used By The Kubelet
8: Kubelet Configuration Directory Merging
9: Kubelet Device Manager API Versions
10: Kubelet Systemd Watchdog
11: Node Status
12: Seccomp and Kubernetes
13: What Happens After A Node Restart
14: Linux Node Swap Behaviors

This section contains the following reference topics about nodes:

the kubelet's sync loop
the kubelet's checkpoint API
the kubelet's Pod Info gRPC API
a list of Articles on dockershim Removal and on Using CRI-compatible Runtimes
Kubelet Device Manager API Versions
Node Labels Populated By The Kubelet
Local Files And Paths Used By The Kubelet
Node .status information
Linux Node Swap Behaviors
Seccomp information
What happens after a node restart

You can also read node reference details from elsewhere in the Kubernetes documentation, including:

1 - Kubelet Checkpoint API

FEATURE STATE: Kubernetes v1.30 [beta](enabled by default)

Checkpointing a container is the functionality to create a stateful copy of a running container. Once you have a stateful copy of a container, you could move it to a different computer for debugging or similar purposes.

If you move the checkpointed container data to a computer that's able to restore it, that restored container continues to run at exactly the same point it was checkpointed. You can also inspect the saved data, provided that you have suitable tools for doing so.

Creating a checkpoint of a container might have security implications. Typically a checkpoint contains all memory pages of all processes in the checkpointed container. This means that everything that used to be in memory is now available on the local disk. This includes all private data and possibly keys used for encryption. The underlying CRI implementations (the container runtime on that node) should create the checkpoint archive to be only accessible by the root user. It is still important to remember if the checkpoint archive is transferred to another system all memory pages will be readable by the owner of the checkpoint archive.

Operations

`post` checkpoint the specified container

Tell the kubelet to checkpoint a specific container from the specified Pod.

Consult the Kubelet authentication/authorization reference for more information about how access to the kubelet checkpoint interface is controlled.

The kubelet will request a checkpoint from the underlying CRI implementation. In the checkpoint request the kubelet will specify the name of the checkpoint archive as checkpoint-<podFullName>-<containerName>-<timestamp>.tar and also request to store the checkpoint archive in the checkpoints directory below its root directory (as defined by --root-dir). This defaults to /var/lib/kubelet/checkpoints.

The checkpoint archive is in tar format, and could be listed using an implementation of tar. The contents of the archive depend on the underlying CRI implementation (the container runtime on that node).

HTTP Request

POST /checkpoint/{namespace}/{pod}/{container}

Parameters

namespace (in path): string, required
Namespace
pod (in path): string, required
Pod
container (in path): string, required
Container
timeout (in query): integer
Timeout in seconds to wait until the checkpoint creation is finished. If zero or no timeout is specified the default CRI timeout value will be used. Checkpoint creation time depends directly on the used memory of the container. The more memory a container uses the more time is required to create the corresponding checkpoint.

Response

200: OK

401: Unauthorized

404: Not Found (if the ContainerCheckpoint feature gate is disabled)

404: Not Found (if the specified namespace, pod or container cannot be found)

500: Internal Server Error (if the CRI implementation encounter an error during checkpointing (see error message for further details))

500: Internal Server Error (if the CRI implementation does not implement the checkpoint CRI API (see error message for further details))

2 - Linux Kernel Version Requirements

Note: This section links to third party projects that provide functionality required by Kubernetes. The Kubernetes project authors aren't responsible for these projects, which are listed alphabetically. To add a project to this list, read the content guide before submitting a change. More information.

Many features rely on specific kernel functionalities and have minimum kernel version requirements. However, relying solely on kernel version numbers may not be sufficient for certain operating system distributions, as maintainers for distributions such as RHEL, Ubuntu and SUSE often backport selected features to older kernel releases (retaining the older kernel version).

Pod sysctls

On Linux, the sysctl() system call configures kernel parameters at run time. There is a command line tool named sysctl that you can use to configure these parameters, and many are exposed via the proc filesystem.

Some sysctls are only available if you have a modern enough kernel.

The following sysctls have a minimal kernel version requirement, and are supported in the safe set:

net.ipv4.ip_local_reserved_ports (since Kubernetes 1.27, needs kernel 3.16+);
net.ipv4.tcp_keepalive_time (since Kubernetes 1.29, needs kernel 4.5+);
net.ipv4.tcp_fin_timeout (since Kubernetes 1.29, needs kernel 4.6+);
net.ipv4.tcp_keepalive_intvl (since Kubernetes 1.29, needs kernel 4.5+);
net.ipv4.tcp_keepalive_probes (since Kubernetes 1.29, needs kernel 4.5+);
net.ipv4.tcp_syncookies (namespaced since kernel 4.6+).
net.ipv4.tcp_rmem (since Kubernetes 1.32, needs kernel 4.15+).
net.ipv4.tcp_wmem (since Kubernetes 1.32, needs kernel 4.15+).
net.ipv4.vs.conn_reuse_mode (used in ipvs proxy mode, needs kernel 4.1+);

kube proxy `nftables` proxy mode

For Kubernetes 1.36, the nftables mode of kube-proxy requires version 1.0.1 or later of the nft command-line tool, as well as kernel 5.13 or later.

For testing/development purposes, you can use older kernels, as far back as 5.4 if you set the nftables.skipKernelVersionCheck option in the kube-proxy config. But this is not recommended in production since it may cause problems with other nftables users on the system.

Version 2 control groups

Kubernetes cgroup v1 support is in maintained mode starting from Kubernetes v1.31; using cgroup v2 is recommended. In Linux 5.8, the system-level cpu.stat file was added to the root cgroup for convenience.

In runc document, Kernel older than 5.2 is not recommended due to lack of freezer.

Pressure Stall Information (PSI)

Pressure Stall Information is supported in Linux kernel versions 4.20 and up, but requires the following configuration:

The kernel must be compiled with the CONFIG_PSI=y option. Most modern distributions enable this by default. You can check your kernel's configuration by running zgrep CONFIG_PSI /proc/config.gz.
Some Linux distributions may compile PSI into the kernel but disable it by default. If so, you need to enable it at boot time by adding the psi=1 parameter to the kernel command line.

Other kernel requirements

Some features may depend on new kernel functionalities and have specific kernel requirements:

Recursive read only mount: This is implemented by applying the MOUNT_ATTR_RDONLY attribute with the AT_RECURSIVE flag using mount_setattr(2) added in Linux kernel v5.12.
Pod user namespace support requires minimal kernel version 6.5+, according to KEP-127.
For node system swap, tmpfs set to noswap is not supported until kernel 6.3.

Linux kernel long term maintenance

Active kernel releases can be found in kernel.org.

There are usually several long term maintenance kernel releases provided for the purposes of backporting bug fixes for older kernel trees. Only important bug fixes are applied to such kernels and they don't usually see very frequent releases, especially for older trees. See the Linux kernel website for the list of releases in the Longterm category.

What's next

See sysctls for more details.
Allow running kube-proxy with in nftables mode.
Read more information in cgroups v2.

3 - Articles on dockershim Removal and on Using CRI-compatible Runtimes

This is a list of articles and other pages that are either about the Kubernetes' deprecation and removal of dockershim, or about using CRI-compatible container runtimes, in connection with that removal.

Kubernetes project

Kubernetes blog: Dockershim Removal FAQ (originally published 2020/12/02)
Kubernetes blog: Updated: Dockershim Removal FAQ (updated published 2022/02/17)
Kubernetes blog: Kubernetes is Moving on From Dockershim: Commitments and Next Steps (published 2022/01/07)
Kubernetes blog: Dockershim removal is coming. Are you ready? (published 2021/11/12)
Kubernetes documentation: Migrating from dockershim
Kubernetes documentation: Container Runtimes
Kubernetes enhancement proposal: KEP-2221: Removing dockershim from kubelet
Kubernetes enhancement proposal issue: Removing dockershim from kubelet (k/enhancements#2221)

You can provide feedback via the GitHub issue Dockershim removal feedback & issues. (k/kubernetes/#106917)

External sources

Amazon Web Services EKS documentation: Amazon EKS is ending support for Dockershim
CNCF conference video: Lessons Learned Migrating Kubernetes from Docker to containerd Runtime (Ana Caylin, at KubeCon Europe 2019)
Docker.com blog: What developers need to know about Docker, Docker Engine, and Kubernetes v1.20 (published 2020/12/04)
"Google Open Source" channel on YouTube: Learn Kubernetes with Google - Migrating from Dockershim to Containerd
Microsoft Apps on Azure blog: Dockershim deprecation and AKS (published 2022/01/21)
Mirantis blog: The Future of Dockershim is cri-dockerd (published 2021/04/21)
Mirantis: Mirantis/cri-dockerd Official Documentation
Tripwire: How Dockershim’s Forthcoming Deprecation Affects Your Kubernetes (published 2021/07/01)

4 - Kubelet Pod Info gRPC API

FEATURE STATE: Kubernetes v1.35 [alpha](disabled by default)

The Kubelet Pod Info gRPC API provides a way for node-local components to query information about pods running on the node directly from the kubelet. This increases reliability by removing the dependency on the Kubernetes API server for node-local information and reduces load on the control plane.

Access to this API is restricted to local admin users (typically root) through file permissions on the UNIX socket.

Endpoint

The API listens on a UNIX socket at: /var/lib/kubelet/pods/kubelet.sock

Note:

This API is not supported on Windows nodes.

Operations

The API provides the following gRPC methods:

`ListPods`

Returns a list of all pods currently managed by the kubelet on the node.

`WatchPods`

Returns a stream of pod updates. Whenever a pod's state changes locally, the kubelet sends the updated pod information through the stream.

`GetPod`

Returns information for a specific pod identified by its UID.

API Definition

The API uses the following protobuf definition:

import "google/protobuf/field_mask.proto";
import "k8s.io/api/core/v1/generated.proto";

service Pods {
    // ListPods returns a list of v1.Pod, optionally filtered by field mask.
    rpc ListPods(PodListRequest) returns (PodListResponse) {}
    // WatchPods returns a stream of Pod updates, optionally filtered by field mask.
    rpc WatchPods(PodWatchRequest) returns (stream PodWatchResponse) {}
    // GetPod returns a v1.Pod for a given pod's UID, optionally filtered by field mask.
    rpc GetPod(PodGetRequest) returns (PodGetResponse) {}
}

message PodListRequest {
    // Optional field mask in the gRPC metadata, to specify which pod fields to return.
}

message PodListResponse {
    repeated v1.Pod pods = 1;
}

message PodWatchRequest {
    // Optional field mask in the gRPC metadata, to specify which pod fields to return.
}

message PodWatchResponse {
    v1.Pod pod = 1;
}

message PodGetRequest {
    string podUID = 1;
    // Optional field mask in the gRPC metadata, to specify which pod fields to return.
}

message PodGetResponse {
    v1.Pod pod = 1;
}

Field selection

The API supports google.protobuf.FieldMask to allow clients to request only the specific fields they need (e.g., status.phase, status.podIPs). This enables lean and efficient data transfer. If no field mask is provided, the full v1.Pod object is returned.

Reliability and availability

The API serves the most up-to-date information known locally by the kubelet, derived from its internal cache and reconciliation with the container runtime. It remains available even if the node loses connectivity to the Kubernetes control plane.

If the kubelet has recently restarted and its internal state is not yet fully initialized, the API returns a gRPC FAILED_PRECONDITION error.

5 - Node Labels Populated By The Kubelet

Kubernetes nodes come pre-populated with a standard set of labels.

You can also set your own labels on nodes, either through the kubelet configuration or using the Kubernetes API.

Preset labels

The preset labels that Kubernetes sets on nodes are:

kubernetes.io/arch
kubernetes.io/hostname
kubernetes.io/os
node.kubernetes.io/instance-type (if known to the kubelet – Kubernetes may not have this information to set the label)
topology.kubernetes.io/region (if known to the kubelet – Kubernetes may not have this information to set the label)
topology.kubernetes.io/zone (if known to the kubelet – Kubernetes may not have this information to set the label)

Note:

The value of these labels is cloud provider specific and is not guaranteed to be reliable. For example, the value of kubernetes.io/hostname may be the same as the node name in some environments and a different value in other environments.

What's next

See Well-Known Labels, Annotations and Taints for a list of common labels.
Learn how to add a label to a node.

6 - Kubelet Sync Loop

The kubelet is the primary "node agent" that creates and watches Pods on each node. The kubelet runs a sync loop that periodically reconciles the desired state (a Pod spec) with the actual state of the running containers.

Sync Loop: The Sync Loop queues work (aggregated from many sources) for the Pods assigned to its node (where nodeName matches the node). Over the course of each loop, subprocesses called pod workers will attempt to reconcile the desired state of these Pods against the current state of the running containers.
Sync Pod: The majority of the kubelet logic is stored in a suite of functions within the podSyncer interface, including the SyncPod function and its variants (like SyncTerminatingPod and SyncTerminatedPod). During each Sync Loop, a relevant podSyncer function will be executed for each Pod in an attempt to drive its state on the node toward the desired state.
Container Runtime Interface (CRI): To actually run the containers, the kubelet uses the CRI to talk to a container runtime (like containerd or CRI-O). The kubelet acts as the client, instructing the runtime to create a "pod sandbox" and then create/start the individual containers defined in the Pod spec.
PLEG (Pod Lifecycle Event Generator): The kubelet needs to know when containers start, stop, or fail. It relies on a component called PLEG to periodically poll the runtime for the standard state of all containers. PLEG generates events that wake up the Sync Loop to update the Pod status.

Because of this polling mechanism, the status seen in the API (like kubectl get pod) might have a slight delay compared to the instant reality on the node.

7 - Local Files And Paths Used By The Kubelet

The kubelet is mostly a stateless process running on a Kubernetes node. This document outlines files that kubelet reads and writes.

Note:

This document is for informational purpose and not describing any guaranteed behaviors or APIs. It lists resources used by the kubelet, which is an implementation detail and a subject to change at any release.

The kubelet typically uses the control plane as the source of truth on what needs to run on the Node, and the container runtime to retrieve the current state of containers. So long as you provide a kubeconfig (API client configuration) to the kubelet, the kubelet does connect to your control plane; otherwise the node operates in standalone mode.

On Linux nodes, the kubelet also relies on reading cgroups and various system files to collect metrics.

On Windows nodes, the kubelet collects metrics via a different mechanism that does not rely on paths.

There are also a few other files that are used by the kubelet as well, as kubelet communicates using local Unix-domain sockets. Some are sockets that the kubelet listens on, and for other sockets the kubelet discovers them and then connects as a client.

Note:

This page lists paths as Linux paths, which map to the Windows paths by adding a root disk C:\ in place of / (unless specified otherwise). For example, /var/lib/kubelet/device-plugins maps to C:\var\lib\kubelet\device-plugins.

Configuration

Kubelet configuration files

The path to the kubelet configuration file can be configured using the command line argument --config. The kubelet also supports drop-in configuration files to enhance configuration.

Certificates

Certificates and private keys are typically located at /var/lib/kubelet/pki, but can be configured using the --cert-dir kubelet command line argument. Names of certificate files are also configurable.

Manifests

Manifests for static pods are typically located in /etc/kubernetes/manifests. Location can be configured using the staticPodPath kubelet configuration option.

Systemd unit settings

When kubelet is running as a systemd unit, some kubelet configuration may be declared in systemd unit settings file. Typically it includes:

command line arguments to run kubelet
environment variables, used by kubelet or configuring golang runtime

State

Checkpoint files for resource managers

All resource managers keep the mapping of Pods to allocated resources in state files. State files are located in the kubelet's base directory, also termed the root directory (but not the same as /, the node root directory). You can configure the base directory for the kubelet using the kubelet command line argument --root-dir.

Names of files:

memory_manager_state for the Memory Manager
cpu_manager_state for the CPU Manager
dra_manager_state for DRA

Checkpoint file for device manager

Device manager creates checkpoints in the same directory with socket files: /var/lib/kubelet/device-plugins/. This path is hardcoded and is not relative to the kubelet root directory. The name of a checkpoint file is kubelet_internal_checkpoint for Device Manager

Pod resource checkpoints

FEATURE STATE: Kubernetes v1.35 [stable](enabled by default)

If a node has enabled the InPlacePodVerticalScalingfeature gate, the kubelet stores a local record of allocated and actuated Pod resources. See Resize CPU and Memory Resources assigned to Containers for more details on how these records are used.

Names of files:

allocated_pods_state records the resources allocated to each pod running on the node
actuated_pods_state records the resources that have been accepted by the runtime for each pod pod running on the node

The files are located within the kubelet base directory (/var/lib/kubelet by default on Linux; configurable using --root-dir).

Container runtime

Kubelet communicates with the container runtime using socket configured via the configuration parameters:

containerRuntimeEndpoint for runtime operations
imageServiceEndpoint for image management operations

The actual values of those endpoints depend on the container runtime being used.

Device plugins

The kubelet exposes a socket at the path /var/lib/kubelet/device-plugins/kubelet.sock for various Device Plugins to register.

When a device plugin registers itself, it provides its socket path for the kubelet to connect.

The device plugin socket must be in the directory /var/lib/kubelet/device-plugins/. This path is hardcoded and is not relative to the kubelet base directory (root directory). On Linux, this path is always /var/lib/kubelet/device-plugins.

Pod resources API

Pod Resources API will be exposed at the path pod-resources within the kubelet base directory (root directory). On a typical Linux node, this means /var/lib/kubelet/pod-resources.

DRA, CSI, and Device plugins

The kubelet looks for socket files created by device plugins managed via DRA, device manager, or storage plugins, and then attempts to connect to these sockets. The directory that the kubelet looks in is plugins_registry within the kubelet base directory, so on a typical Linux node this means /var/lib/kubelet/plugins_registry.

Note, for the device plugins there are two alternative registration mechanisms Only one should be used for a given plugin.

The types of plugins that can place socket files into that directory are:

CSI plugins
DRA plugins
Device Manager plugins

(typically /var/lib/kubelet/plugins_registry).

Graceful node shutdown

FEATURE STATE: Kubernetes v1.21 [beta](enabled by default)

Graceful node shutdown stores state locally at /var/lib/kubelet/graceful_node_shutdown_state.

Image Pull Records

FEATURE STATE: Kubernetes v1.35 [beta](enabled by default)

The kubelet stores records of attempted and successful image pulls, and uses it to verify that the image was previously successfully pulled with the same credentials.

These records are cached as files in the image_registry directory within the kubelet base directory. On a typical Linux node, this means /var/lib/kubelet/image_manager. There are two subdirectories to image_manager:

pulling - stores records about images the Kubelet is attempting to pull.
pulled - stores records about images that were successfully pulled by the Kubelet, along with metadata about the credentials used for the pulls.

See Ensure Image Pull Credential Verification for details.

Security profiles & configuration

Seccomp

Seccomp profile files referenced from Pods should be placed in /var/lib/kubelet/seccomp. See the seccomp reference for details.

AppArmor

The kubelet does not load or refer to AppArmor profiles by a Kubernetes-specific path. AppArmor profiles are loaded via the node operating system rather then referenced by their path.

Locking

FEATURE STATE: Kubernetes v1.2 [alpha]

A lock file for the kubelet; typically /var/run/kubelet.lock. The kubelet uses this to ensure that two different kubelets don't try to run in conflict with each other. You can configure the path to the lock file using the the --lock-file kubelet command line argument.

If two kubelets on the same node use a different value for the lock file path, they will not be able to detect a conflict when both are running.

What's next

Learn about the kubelet command line arguments.
Review the Kubelet Configuration (v1beta1) reference

8 - Kubelet Configuration Directory Merging

When using the kubelet's --config-dir flag to specify a drop-in directory for configuration, there is some specific behavior on how different types are merged.

Here are some examples of how different data types behave during configuration merging:

Structure Fields

There are two types of structure fields in a YAML structure: singular (or a scalar type) and embedded (structures that contain scalar types). The configuration merging process handles the overriding of singular and embedded struct fields to create a resulting kubelet configuration.

For instance, you may want a baseline kubelet configuration for all nodes, but you may want to customize the address and authorization fields. This can be done as follows:

Main kubelet configuration file contents:

apiVersion: kubelet.config.k8s.io/v1beta1
kind: KubeletConfiguration
port: 20250
authorization:
  mode: Webhook
  webhook:
    cacheAuthorizedTTL: "5m"
    cacheUnauthorizedTTL: "30s"
serializeImagePulls: false
address: "192.168.0.1"

Contents of a file in --config-dir directory:

apiVersion: kubelet.config.k8s.io/v1beta1
kind: KubeletConfiguration
authorization:
  mode: AlwaysAllow
  webhook:
    cacheAuthorizedTTL: "8m"
    cacheUnauthorizedTTL: "45s"
address: "192.168.0.8"

The resulting configuration will be as follows:

apiVersion: kubelet.config.k8s.io/v1beta1
kind: KubeletConfiguration
port: 20250
serializeImagePulls: false
authorization:
  mode: AlwaysAllow
  webhook:
    cacheAuthorizedTTL: "8m"
    cacheUnauthorizedTTL: "45s"
address: "192.168.0.8"

Lists

You can override the slices/lists values of the kubelet configuration. However, the entire list gets overridden during the merging process. For example, you can override the clusterDNS list as follows:

Main kubelet configuration file contents:

apiVersion: kubelet.config.k8s.io/v1beta1
kind: KubeletConfiguration
port: 20250
serializeImagePulls: false
clusterDNS:
  - "192.168.0.9"
  - "192.168.0.8"

Contents of a file in --config-dir directory:

apiVersion: kubelet.config.k8s.io/v1beta1
kind: KubeletConfiguration
clusterDNS:
  - "192.168.0.2"
  - "192.168.0.3"
  - "192.168.0.5"

The resulting configuration will be as follows:

apiVersion: kubelet.config.k8s.io/v1beta1
kind: KubeletConfiguration
port: 20250
serializeImagePulls: false
clusterDNS:
  - "192.168.0.2"
  - "192.168.0.3"
  - "192.168.0.5"

Maps, including Nested Structures

Individual fields in maps, regardless of their value types (boolean, string, etc.), can be selectively overridden. However, for map[string][]string, the entire list associated with a specific field gets overridden. Let's understand this better with an example, particularly on fields like featureGates and staticPodURLHeader:

Main kubelet configuration file contents:

apiVersion: kubelet.config.k8s.io/v1beta1
kind: KubeletConfiguration
port: 20250
serializeImagePulls: false
featureGates:
  AllAlpha: false
  MemoryQoS: true
staticPodURLHeader:
  kubelet-api-support:
  - "Authorization: 234APSDFA"
  - "X-Custom-Header: 123"
  custom-static-pod:
  - "Authorization: 223EWRWER"
  - "X-Custom-Header: 456"

Contents of a file in --config-dir directory:

apiVersion: kubelet.config.k8s.io/v1beta1
kind: KubeletConfiguration
featureGates:
  MemoryQoS: false
  KubeletTracing: true
  DynamicResourceAllocation: true
staticPodURLHeader:
  custom-static-pod:
  - "Authorization: 223EWRWER"
  - "X-Custom-Header: 345"

The resulting configuration will be as follows:

apiVersion: kubelet.config.k8s.io/v1beta1
kind: KubeletConfiguration
port: 20250
serializeImagePulls: false
featureGates:
  AllAlpha: false
  MemoryQoS: false
  KubeletTracing: true
  DynamicResourceAllocation: true
staticPodURLHeader:
  kubelet-api-support:
  - "Authorization: 234APSDFA"
  - "X-Custom-Header: 123"
  custom-static-pod:
  - "Authorization: 223EWRWER"
  - "X-Custom-Header: 345"

9 - Kubelet Device Manager API Versions

This page provides details of version compatibility between the Kubernetes device plugin API, and different versions of Kubernetes itself.

Compatibility matrix

	`v1alpha1`	`v1beta1`
Kubernetes 1.21	-	✓
Kubernetes 1.22	-	✓
Kubernetes 1.23	-	✓
Kubernetes 1.24	-	✓
Kubernetes 1.25	-	✓
Kubernetes 1.26	-	✓

Key:

✓ Exactly the same features / API objects in both device plugin API and the Kubernetes version.
+ The device plugin API has features or API objects that may not be present in the Kubernetes cluster, either because the device plugin API has added additional new API calls, or that the server has removed an old API call. However, everything they have in common (most other APIs) will work. Note that alpha APIs may vanish or change significantly between one minor release and the next.
- The Kubernetes cluster has features the device plugin API can't use, either because server has added additional API calls, or that device plugin API has removed an old API call. However, everything they share in common (most APIs) will work.

10 - Kubelet Systemd Watchdog

FEATURE STATE: Kubernetes v1.32 [beta](enabled by default)

On Linux nodes, Kubernetes 1.36 supports integrating with systemd to allow the operating system supervisor to recover a failed kubelet. This integration is not enabled by default. It can be used as an alternative to periodically requesting the kubelet's /healthz endpoint for health checks. If the kubelet does not respond to the watchdog within the timeout period, the watchdog will kill the kubelet.

The systemd watchdog works by requiring the service to periodically send a keep-alive signal to the systemd process. If the signal is not received within a specified timeout period, the service is considered unresponsive and is terminated. The service can then be restarted according to the configuration.

Configuration

Using the systemd watchdog requires configuring the WatchdogSec parameter in the [Service] section of the kubelet service unit file:

[Service]
WatchdogSec=30s

Setting WatchdogSec=30s indicates a service watchdog timeout of 30 seconds. Within the kubelet, the sd_notify() function is invoked, at intervals of \( WatchdogSec \div 2\). to send WATCHDOG=1 (a keep-alive message). If the watchdog is not fed within the timeout period, the kubelet will be killed. Setting Restart to "always", "on-failure", "on-watchdog", or "on-abnormal" will ensure that the service is automatically restarted.

Some details about the systemd configuration:

If you set the systemd value for WatchdogSec to 0, or omit setting it, the systemd watchdog is not enabled for this unit.
The kubelet supports a minimum watchdog period of 1.0 seconds; this is to prevent the kubelet from being killed unexpectedly. You can set the value of WatchdogSec in a systemd unit definition to a period shorter than 1 second, but Kubernetes does not support any shorter interval. The timeout does not have to be a whole integer number of seconds.
The Kubernetes project suggests setting WatchdogSec to approximately a 15s period. Periods longer than 10 minutes are supported but explicitly not recommended.

Example Configuration

[Unit]
Description=kubelet: The Kubernetes Node Agent
Documentation=https://kubernetes.io/docs/home/
Wants=network-online.target
After=network-online.target

[Service]
ExecStart=/usr/bin/kubelet
# Configures the watchdog timeout
WatchdogSec=30s
Restart=on-failure
StartLimitInterval=0
RestartSec=10

[Install]
WantedBy=multi-user.target

What's next

For more details about systemd configuration, refer to the systemd documentation

11 - Node Status

The status of a node in Kubernetes is a critical aspect of managing a Kubernetes cluster. In this article, we'll cover the basics of monitoring and maintaining node status to ensure a healthy and stable cluster.

Node status fields

A Node's status contains the following information:

Addresses
Conditions
Capacity and Allocatable
Info
Declared Features

You can use kubectl to view a Node's status and other details:

kubectl describe node <insert-node-name-here>

Each section of the output is described below.

Addresses

The usage of these fields varies depending on your cloud provider or bare metal configuration.

HostName: The hostname as reported by the node's kernel. Can be overridden via the kubelet --hostname-override parameter.
ExternalIP: Typically the IP address of the node that is externally routable (available from outside the cluster).
InternalIP: Typically the IP address of the node that is routable only within the cluster.

Conditions

The conditions field describes the status of all Running nodes. Examples of conditions include:

Node conditions, and a description of when each condition applies.
Node Condition	Description
`Ready`	`True` if the node is healthy and ready to accept pods, `False` if the node is not healthy and is not accepting pods, and `Unknown` if the node controller has not heard from the node in the last `node-monitor-grace-period` (default is 50 seconds)
`DiskPressure`	`True` if pressure exists on the disk size—that is, if the disk capacity is low; otherwise `False`
`MemoryPressure`	`True` if pressure exists on the node memory—that is, if the node memory is low; otherwise `False`
`PIDPressure`	`True` if pressure exists on the processes—that is, if there are too many processes on the node; otherwise `False`
`NetworkUnavailable`	`True` if the network for the node is not correctly configured, otherwise `False`

Note:

If you use command-line tools to print details of a cordoned Node, the Condition includes SchedulingDisabled. SchedulingDisabled is not a Condition in the Kubernetes API; instead, cordoned nodes are marked Unschedulable in their spec.

In the Kubernetes API, a node's condition is represented as part of the .status of the Node resource. For example, the following JSON structure describes a healthy node:

"conditions": [
  {
    "type": "Ready",
    "status": "True",
    "reason": "KubeletReady",
    "message": "kubelet is posting ready status",
    "lastHeartbeatTime": "2019-06-05T18:38:35Z",
    "lastTransitionTime": "2019-06-05T11:41:27Z"
  }
]

When problems occur on nodes, the Kubernetes control plane automatically creates taints that match the conditions affecting the node. An example of this is when the status of the Ready condition remains Unknown or False for longer than the kube-controller-manager's NodeMonitorGracePeriod, which defaults to 50 seconds. This will cause either an node.kubernetes.io/unreachable taint, for an Unknown status, or a node.kubernetes.io/not-ready taint, for a False status, to be added to the Node.

These taints affect pending pods as the scheduler takes the Node's taints into consideration when assigning a pod to a Node. Existing pods scheduled to the node may be evicted due to the application of NoExecute taints. Pods may also have tolerations that let them schedule to and continue running on a Node even though it has a specific taint.

See Taint Based Evictions and Taint Nodes by Condition for more details.

Capacity and Allocatable

Describes the resources available on the node: CPU, memory, and the maximum number of pods that can be scheduled onto the node.

The fields in the capacity block indicate the total amount of resources that a Node has. The allocatable block indicates the amount of resources on a Node that is available to be consumed by normal Pods.

You may read more about capacity and allocatable resources while learning how to reserve compute resources on a Node.

Info

Describes general information about the node, such as kernel version, Kubernetes version (kubelet and kube-proxy version), container runtime details, and which operating system the node uses. The kubelet gathers this information from the node and publishes it into the Kubernetes API.

Declared features

FEATURE STATE: Kubernetes v1.36 [beta](enabled by default)

This field lists specific Kubernetes features that are currently enabled on the node's kubelet via feature gates. The features are reported by the kubelet as a list of strings in the .status.declaredFeatures field of the Node object.

This field is intended for newer features under active development; features that have graduated and no longer require a feature gate are considered baseline and are not declared in this field. This reflects the enablement of Kubernetes features, not the underlying operating system or kernel capabilities of the node.

See Node Declared Features for more details.

Heartbeats

Heartbeats, sent by Kubernetes nodes, help your cluster determine the availability of each node, and to take action when failures are detected.

For nodes there are two forms of heartbeats:

updates to the .status of a Node
Lease objects within the kube-node-lease namespace. Each Node has an associated Lease object.

Compared to updates to .status of a Node, a Lease is a lightweight resource. Using Leases for heartbeats reduces the performance impact of these updates for large clusters.

The kubelet is responsible for creating and updating the .status of Nodes, and for updating their related Leases.

The kubelet updates the node's .status either when there is change in status or if there has been no update for a configured interval. The default interval for .status updates to Nodes is 5 minutes, which is much longer than the 40 second default timeout for unreachable nodes.
The kubelet creates and then updates its Lease object every 10 seconds (the default update interval). Lease updates occur independently from updates to the Node's .status. If the Lease update fails, the kubelet retries, using exponential backoff that starts at 200 milliseconds and capped at 7 seconds.

12 - Seccomp and Kubernetes

Seccomp stands for secure computing mode and has been a feature of the Linux kernel since version 2.6.12. It can be used to sandbox the privileges of a process, restricting the calls it is able to make from userspace into the kernel. Kubernetes lets you automatically apply seccomp profiles loaded onto a node to your Pods and containers.

Seccomp fields

FEATURE STATE: Kubernetes v1.19 [stable]

There are four ways to specify a seccomp profile for a pod:

for the whole Pod using spec.securityContext.seccompProfile
for a single container using spec.containers[*].securityContext.seccompProfile
for an (restartable / sidecar) init container using spec.initContainers[*].securityContext.seccompProfile
for an ephemeral container using spec.ephemeralContainers[*].securityContext.seccompProfile

pods/security/seccomp/fields.yaml
apiVersion: v1
kind: Pod
metadata:
  name: pod
spec:
  securityContext:
    seccompProfile:
      type: Unconfined
  # NOTE: ephemeralContainers cannot be specified when creating a Pod.
  # It can be specified only when updating a Pod.
  ephemeralContainers:
  - name: ephemeral-container
    image: debian
    securityContext:
      seccompProfile:
        type: RuntimeDefault
  initContainers:
  - name: init-container
    image: debian
    securityContext:
      seccompProfile:
        type: RuntimeDefault
  containers:
  - name: container
    image: docker.io/library/debian:stable
    securityContext:
      seccompProfile:
        type: Localhost
        localhostProfile: my-profile.json

The Pod in the example above runs as Unconfined, while the ephemeral-container and init-container specifically defines RuntimeDefault. If the ephemeral or init container would not have set the securityContext.seccompProfile field explicitly, then the value would be inherited from the Pod. The same applies to the container, which runs a Localhost profile my-profile.json.

Generally speaking, fields from (ephemeral) containers have a higher priority than the Pod level value, while containers which do not set the seccomp field inherit the profile from the Pod.

Note:

It is not possible to apply a seccomp profile to a Pod or container running with privileged: true set in the container's securityContext. Privileged containers always run as Unconfined.

The following values are possible for the seccompProfile.type:

Unconfined: The workload runs without any seccomp restrictions.
RuntimeDefault: A default seccomp profile defined by the container runtime is applied. The default profiles aim to provide a strong set of security defaults while preserving the functionality of the workload. It is possible that the default profiles differ between container runtimes and their release versions, for example when comparing those from CRI-O and containerd.
Localhost: The localhostProfile will be applied, which has to be available on the node disk (on Linux it's /var/lib/kubelet/seccomp). The availability of the seccomp profile is verified by the container runtime on container creation. If the profile does not exist, then the container creation will fail with a CreateContainerError.

`Localhost` profiles

Seccomp profiles are JSON files following the scheme defined by the OCI runtime specification. A profile basically defines actions based on matched syscalls, but also allows to pass specific values as arguments to syscalls. For example:

{
  "defaultAction": "SCMP_ACT_ERRNO",
  "defaultErrnoRet": 38,
  "syscalls": [
    {
      "names": [
        "adjtimex",
        "alarm",
        "bind",
        "waitid",
        "waitpid",
        "write",
        "writev"
      ],
      "action": "SCMP_ACT_ALLOW"
    }
  ]
}

The defaultAction in the profile above is defined as SCMP_ACT_ERRNO and will return as fallback to the actions defined in syscalls. The error is defined as code 38 via the defaultErrnoRet field.

The following actions are generally possible:

SCMP_ACT_ERRNO: Return the specified error code.
SCMP_ACT_ALLOW: Allow the syscall to be executed.
SCMP_ACT_KILL_PROCESS: Kill the process.
SCMP_ACT_KILL_THREAD and SCMP_ACT_KILL: Kill only the thread.
SCMP_ACT_TRAP: Throw a SIGSYS signal.
SCMP_ACT_NOTIFY and SECCOMP_RET_USER_NOTIF.: Notify the user space.
SCMP_ACT_TRACE: Notify a tracing process with the specified value.
SCMP_ACT_LOG: Allow the syscall to be executed after the action has been logged to syslog or auditd.

Some actions like SCMP_ACT_NOTIFY or SECCOMP_RET_USER_NOTIF may be not supported depending on the container runtime, OCI runtime or Linux kernel version being used. There may be also further limitations, for example that SCMP_ACT_NOTIFY cannot be used as defaultAction or for certain syscalls like write. All those limitations are defined by either the OCI runtime (runc, crun) or libseccomp.

The syscalls JSON array contains a list of objects referencing syscalls by their respective names. For example, the action SCMP_ACT_ALLOW can be used to create a whitelist of allowed syscalls as outlined in the example above. It would also be possible to define another list using the action SCMP_ACT_ERRNO but a different return (errnoRet) value.

It is also possible to specify the arguments (args) passed to certain syscalls. More information about those advanced use cases can be found in the OCI runtime spec and the Seccomp Linux kernel documentation.

13 - What Happens After A Node Restart

System components on a node sometimes restart, either because of an upgrade, a crash, or an explicit operator action. This page describes what happens to Pods and to the node when the kubelet, the container runtime, or the node as a whole restarts.

In a healthy cluster these restarts are usually safe and do not break running workloads. The sections below describe the effects to be aware of, which become more pronounced on large or heavily loaded nodes. The most disruptive case is a node reboot, which encompasses both a container runtime restart and a kubelet restart, but with more consequences because every container on the node stops first.

Impact of a kubelet restart

If only the kubelet restarts, the containers that are already running continue to run. The kubelet re-establishes its view of the Node, and reconciles the running containers against the desired state. During this period of time, the following happens:

The kubelet re-initializes and re-synchronizes its caches, which produces a burst of requests to the API server. On large nodes with many Pods this burst can be significant.
The node is temporarily reported as NotReady until the kubelet finishes initializing. While the node is NotReady, the scheduler does not place new Pods on it.
Node heartbeats pause while the kubelet is down and resume once it has restarted and finished initializing, when the kubelet renews its Lease object and posts node status again.
The kubelet preserves the readiness of running containers across a restart. Each Pod's readiness drives EndpointSlices, Endpoints, and downstream configuration (such as Gateways or Ingresses); this means that resetting container readiness on every restart would place a large load on the API server and on components that watch endpoint state, and could briefly remove healthy Pods from Service load balancing. This behavior is described in KEP-4781: Fix inconsistent container ready state after kubelet restart. Resetting container readiness to false on every restart was the default behavior for a long time. The ChangeContainerStatusOnKubeletRestart feature gate lets you revert to that behavior, but it is a deprecated legacy escape hatch that is slated for removal, so you should not rely on it. For more detail, see Pod behavior during kubelet restarts.
During the initial kubelet startup, garbage collection of unused images and containers, and Pod evictions driven by node-pressure, are paused. This pause continues for a short grace period after the kubelet has completed its main startup routines. This delay can slow the node's reaction to memory or disk pressure.
Ongoing image pulls are cancelled. Depending on the container runtime, a cancelled pull may have to start over from the beginning when it is retried.
Pod admission runs again for the Pods on the node as the kubelet replays them through its admission checks. If the node's labels or taints have changed while the kubelet was down, a Pod can fail admission and be rejected even though it was already running. This is an existing behavior, and whether it should be considered a bug is still debated; see kubernetes/kubernetes#123859 for the discussion and details.

Overall, in a healthy cluster a kubelet restart does not break running workloads. On large clusters with overcommitted nodes, however, the re-initialization load and the paused garbage collection and eviction can contribute to system instability.

Kubernetes does not define the behavior of your container runtime if you restart it. Depending on the container runtime you use, a restart may trigger a stop or restart for all local containers. However, most container runtimes used with Kubernetes use a configuration that allows you to restart the runtime and leave containers executing.

Impact of a container runtime restart

When the container runtime (such as containerd or CRI-O) restarts, the kubelet loses its connection to the runtime until it comes back. During this window:

exec probes fail for the duration of the restart, because the kubelet cannot run commands inside containers. With a short timeout and failure threshold, a failing liveness probe can cause a container to be restarted, and a failing readiness probe can cause the Pod to flap out of the Ready state.
The node is reported as NotReady by the kubelet, which blocks scheduling of new Pods onto the node.
Container operations such as restarts, initialization, and status updates are delayed until the runtime is available again.
If an init container was executing when the runtime restarted, its execution state can be lost, in which case the init container runs again.
In rare cases, interrupting an operation at a precise moment can leave state inconsistent:
- An interrupted image pull may leave inconsistent image layers, which can render the image unusable until it is pulled again.
- An interrupted sandbox creation, if it is terminated in the middle of a CNI or NRI call, may leave the sandbox in an inconsistent state, with CNI only partially initialized and the possibility of a resource leak.

Interrupting an operation at a precise moment is a low-probability situation, so restarting a container runtime is generally a safe operation. On a heavily loaded node, where every operation is slower, the window for interrupting a critical operation is larger and the probability of hitting one of these edge cases increases.

Impact of a node reboot

A node reboot is the most disruptive of these events, because every container on the node stops. A reboot encompasses both a container runtime restart and a kubelet restart, but with more consequences: where a standalone kubelet or runtime restart leaves the already-running containers in place, a reboot stops every container first. After the node boots, the kubelet and container runtime start again with no containers actually running.

Before a planned reboot you can reduce the impact by cordoning the node, so the scheduler stops placing new Pods on it, and then draining it to evict the existing Pods gracefully. When graceful node shutdown is enabled, the kubelet also attempts to stop running Pods cleanly when it detects that the node is shutting down.

When the node comes back:

The reboot stops all containers, and the kubelet recreates them when the node comes back. If the node stays down longer than the configured toleration period described below, only Pods managed by a controller (such as a Deployment, StatefulSet, or DaemonSet) get a replacement Pod. The replacement Pod might schedule onto a different node. Standalone Pods (without another object or controller managing them) are not recreated after deletion.
The node renews the lease and reconciles its status. It is reported as NotReady until the kubelet, container runtime, and network are ready. While the node is NotReady, the node may be tainted with node.kubernetes.io/not-ready, and after the configured toleration period the control plane can evict Pods that do not tolerate it.
The kubelet re-runs admission for the Pods assigned to the node, so the label and taint considerations described under kubelet restart apply here as well.
For Pods that request devices, the kubelet calls the relevant device plugin again to confirm the device allocations for the Pods that are being restored on the node. The device plugin must re-register with the kubelet after the reboot so that these allocations can be reconciled.
Local storage tied to the lifetime of a container or Pod can be lost. A container's writable layer is discarded when the container is recreated, so data written there does not survive the reboot. An emptyDir volume lasts as long as the Pod stays on the node: a memory-backed emptyDir (medium: Memory) is always lost on reboot because it is held in RAM, while a disk-backed emptyDir survives a reboot as long as the Pod is not evicted or deleted, and is removed only when the Pod leaves the node.

For workloads that must tolerate node reboots, run Pods through a controller, use persistent volumes for data that must survive, and configure disruption budgets and probes so that traffic is only sent to Pods once they are ready.

What's next

Learn about the kubelet's sync loop.
Read about Pod lifecycle.
Read about node-pressure eviction.
Learn how to safely drain a node.

14 - Linux Node Swap Behaviors

To allow Kubernetes workloads to use swap, on a Linux node, you must disable the kubelet's default behavior of failing when swap is detected, and specify memory-swap behavior as LimitedSwap:

The available choices for swap behavior are:

NoSwap: (default) Workloads running as Pods on this node do not and cannot use swap. However, processes outside of Kubernetes' scope, such as system daemons (including the kubelet itself!) can utilize swap. This behavior is beneficial for protecting the node from system-level memory spikes, but it does not safeguard the workloads themselves from such spikes.
LimitedSwap: Kubernetes workloads can utilize swap memory. The amount of swap available to a Pod is determined automatically.

To learn more, read swap memory management.

Node Reference Information

1 - Kubelet Checkpoint API

Operations

post checkpoint the specified container

HTTP Request

Parameters

Response

2 - Linux Kernel Version Requirements

Pod sysctls

kube proxy nftables proxy mode

Version 2 control groups

Pressure Stall Information (PSI)

Other kernel requirements

Linux kernel long term maintenance

What's next

3 - Articles on dockershim Removal and on Using CRI-compatible Runtimes

Kubernetes project

External sources

4 - Kubelet Pod Info gRPC API

Endpoint

Note:

Operations

ListPods

WatchPods

GetPod

API Definition

Field selection

Reliability and availability

5 - Node Labels Populated By The Kubelet

Preset labels

Note:

What's next

6 - Kubelet Sync Loop

7 - Local Files And Paths Used By The Kubelet

Note:

Note:

Configuration

Kubelet configuration files

Certificates

Manifests

Systemd unit settings

State

Checkpoint files for resource managers

Checkpoint file for device manager

Pod resource checkpoints

Container runtime

Device plugins

Pod resources API

DRA, CSI, and Device plugins

Graceful node shutdown

Image Pull Records

Security profiles & configuration

Seccomp

AppArmor

Locking

What's next

8 - Kubelet Configuration Directory Merging

Structure Fields

Lists

Maps, including Nested Structures

9 - Kubelet Device Manager API Versions

Compatibility matrix

10 - Kubelet Systemd Watchdog

Configuration

Example Configuration

What's next

11 - Node Status

Node status fields

Addresses

Conditions

Note:

Capacity and Allocatable

Info

Declared features

Heartbeats

12 - Seccomp and Kubernetes

Seccomp fields

Note:

Localhost profiles

Further reading

`post` checkpoint the specified container

kube proxy `nftables` proxy mode

`ListPods`

`WatchPods`

`GetPod`

`Localhost` profiles