SELinux Volume Label Changes goes GA (and likely implications in v1.37)
If you run Kubernetes on Linux with SELinux in enforcing mode, plan ahead: a future release (anticipated to be v1.37) is
expected to turn the SELinuxMount feature gate on by default. This makes volume setup faster
for most workloads, but it can break applications that still depend on the older recursive relabeling
model in subtle ways (for example, sharing one volume between privileged and unprivileged Pods on the same node).
Kubernetes v1.36 is the right release to audit your cluster and fix or opt out of this change.
If your nodes do not use SELinux, nothing changes for you: the kubelet skips the whole SELinux logic when SELinux is unavailable or disabled in the Linux kernel. You can skip this article completely.
This blog builds on the earlier work described in the
Kubernetes 1.27: Efficient SELinux Relabeling (Beta)
post, where the SELinuxMountReadWriteOncePod feature gate was described. The problem to be addressed remains
the same, however, this blog extends that same approach to all volumes.
The problem
Linux systems with Security Enhanced Linux (SELinux) enabled use labels attached to objects
(for example, files and network sockets) to make access control decisions.
Historically, the container runtime applies SELinux labels to a Pod and all its volumes. Kubernetes only passes the SELinux label from a Pod's securityContext fields
to the container runtime.
The container runtime then recursively changes the SELinux label on all files that are visible to the Pod's containers. This can be time-consuming if there are many files on the volume, especially when the volume is on a remote filesystem.
Caution:
If a container usessubPath of a volume, only that subPath of the whole
volume is relabeled. This allows two Pods that have two different SELinux labels
to use the same volume, as long as they use different subpaths of it.If a Pod does not have any SELinux label assigned in the Kubernetes API, the container runtime assigns a unique random label, so a process that potentially escapes the container boundary cannot access data of any other container on the host. The container runtime still recursively relabels all Pod volumes with this random SELinux label.
What Kubernetes is improving
Where the stack supports it, the kubelet can mount the volume with -o context=<label> so the kernel
applies the correct label for all inodes on that mount without a recursive inode traversal. That path is
gated by feature flags and requires, among other things, that the Pod expose enough of an SELinux
label (for example spec.securityContext.seLinuxOptions.level) and that the volume driver opts in (for CSI,
CSIDriver field spec.seLinuxMount: true).
The project rolled this out in phases:
- ReadWriteOncePod volumes were handled under the
SELinuxMountReadWriteOncePodfeature gate, on by default since v1.28 and GA in v1.36. - Broader coverage was handled under the
SELinuxMountflag, paired with thespec.securityContext.seLinuxChangePolicyfield on Pods.
If a Pod and its volume meet all of the following conditions, Kubernetes will mount the volume directly with the right SELinux label. Such a mount will happen in a constant time and the container runtime will not need to recursively relabel any files on it. For such a mount to happen:
The operating system must support SELinux. Without SELinux support detected, the kubelet and the container runtime do not do anything with regard to SELinux.
The feature gate
SELinuxMountReadWriteOncePodmust be enabled. If you're running Kubernetes v1.36, the feature is enabled unconditionally.The Pod must use a PersistentVolumeClaim with applicable
accessModes:- Either the volume has
accessModes: ["ReadWriteOncePod"] - or the volume can use any other access mode(s), provided that the feature gates
SELinuxChangePolicyandSELinuxMountare both enabled and the Pod hasspec.securityContext.seLinuxChangePolicyset to nil (default) or asMountOption.
The feature gate
SELinuxMountis Beta and disabled by default in Kubernetes 1.36. All other SELinux-related feature gates are now General Availability (GA).With any of these feature gates disabled, SELinux labels will always be applied by the container runtime via recursively traversing through the volume (or its subPaths).
- Either the volume has
The Pod must have at least
seLinuxOptions.levelassigned in its security context or all containers in that Pod must have it set in their container-level security contexts. Kubernetes will read the defaultuser,roleandtypefrom the operating system defaults (typicallysystem_u,system_randcontainer_t).Without Kubernetes knowing at least the SELinux
level, the container runtime will assign a random level after the volumes are mounted. The container runtime will still relabel the volumes recursively in that case.The volume plugin or the CSI driver responsible for the volume supports mounting with SELinux mount options.
These in-tree volume plugins support mounting with SELinux mount options:
fcandiscsi.CSI drivers that support mounting with SELinux mount options must declare this capability in their CSIDriver instance by setting the
seLinuxMountfield.Volumes managed by other volume plugins or CSI drivers that do not set
seLinuxMount: truewill be recursively relabeled by the container runtime.
The breaking change
The SELinuxMount feature gate changes what volumes can be shared among multiple Pods in a subtle way.
Both of these cases work with recursive relabeling:
- Two Pods with different SELinux labels share the same volume, but each of them uses a different
subPathto the volume. - A privileged Pod and an unprivileged Pod share the same volume.
The above scenarios will not work with modern, target behavior for Kubernetes mounting when SELinux is active. Instead, one of these Pods will be stuck in ContainerCreating until the other Pod is terminated.
The first case is very niche and hasn't been seen in practice.
Although the second case is still quite rare, this setup has been observed in applications.
Kubernetes v1.36 offers metrics and events to identify these Pods and allows cluster administrators to opt out of the
mount option through the Pod field spec.securityContext.seLinuxChangePolicy.
seLinuxChangePolicy
The new Pod field spec.securityContext.seLinuxChangePolicy specifies how the SELinux label is applied to all Pod volumes.
In Kubernetes v1.36, this field is part of the stable Pod API.
There are three choices available:
- field not set (default)
- In Kubernetes v1.36, the behavior depends on whether the
SELinuxMountfeature gate is enabled. By default that feature gate is not enabled, and the SELinux label is applied recursively. If you enable that feature gate in your cluster, and all other conditions are met, labelling will be applied using the mount option. Recursive- the SELinux label is applied recursively. This opts out from using the mount option.
MountOption- the SELinux label is applied using the mount option, if all other conditions are met.
This choice is available only when the
SELinuxMountfeature gate is enabled.
SELinux warning controller (optional)
Kubernetes v1.36 provides a new controller within the control plane, selinux-warning-controller.
This controller runs within the kube-controller-manager controller.
To use it, you pass --controllers=*,selinux-warning-controller on the kube-controller-manager command line;
you also must not have explicitly overridden the SELinuxChangePolicy feature gate to be disabled.
The controller watches all Pods in the cluster and emits an Event when it finds two Pods that share the same
volume in a way that is not compatible with the SELinuxMount feature gate.
All such conflicting Pods will receive an event, such as:
SELinuxLabel "system_u:system_r:container_t:s0:c98,c99" conflicts with pod my-other-pod that uses the same volume as this pod with SELinuxLabel "system_u:system_r:container_t:s0:c0,c1". If both pods land on the same node, only one of them may access the volume.
The actual Pod name may be censored when the conflicting Pods run in different namespaces to prevent leaking information across namespace boundaries.
The controller reports such an event even when these Pods don't run on the same node, to make sure all Pods work regardless of the Kubernetes scheduler decision. They could run on the same node next time.
In addition, the controller emits the metric selinux_warning_controller_selinux_volume_conflict that lists all current conflicts among Pods.
The metric has labels that identify the conflicting Pods and their SELinux labels, such as:
selinux_warning_controller_selinux_volume_conflict{pod1_name="my-other-pod",pod1_namespace="default",pod1_value="system_u:object_r:container_file_t:s0:c0,c1",pod2_name="my-pod",pod2_namespace="default",pod2_value="system_u:object_r:container_file_t:s0:c0,c2",property="SELinuxLabel"} 1
There is a security consequence from enabling this opt-in controller: it may reveal namespace names, which are always present in the metric. The Kubernetes project assumes only cluster administrators can access kube-controller-manager metrics.
Suggested upgrade path
To ensure a smooth upgrade path from v1.36 to a release with SELinuxMount enabled (anticipated to be v1.37), we suggest you follow these steps:
- Enable selinux-warning-controller in the kube-controller-manager.
- Check the
selinux_warning_controller_selinux_volume_conflictmetric. It shows all potential conflicts between Pods. For each conflicting Pod (Deployment, StatefulSet, etc.), either apply the opt-out (set Pod'sspec.securityContext.seLinuxChangePolicy: Recursive) or re-architect the application to remove such a conflict. For example, do your Pods really need to run as privileged? - Check the
volume_manager_selinux_volume_context_mismatch_warnings_totalmetric. This metric is emitted by the kubelet when it actually starts a Pod that runs whenSELinuxMountis disabled, but such a Pod won't start whenSELinuxMountis enabled. This metric lists the number of Pods that will experience a true conflict. Unfortunately, this metric does not expose the exact Pod name as a label. The full Pod name is available only in theselinux_warning_controller_selinux_volume_conflictmetric. - Once both metrics have been accounted for, upgrade to a Kubernetes version that has
SELinuxMountenabled.
Consider using a MutatingAdmissionPolicy, a mutating webhook, or a policy engine like Kyverno or Gatekeeper to apply the opt-out to all Pods in a namespace or across the entire cluster.
When SELinuxMount is enabled, the kubelet will emit the metric volume_manager_selinux_volume_context_mismatch_errors_total with the number of
Pods that could not start because their SELinux label conflicts with an existing Pod that uses the same volume.
The exact Pod names should still be available in the selinux_warning_controller_selinux_volume_conflict metric,
if the selinux-warning-controller is enabled.
Further reading
- KEP: Speed up SELinux volume relabeling using mounts
- SELinux Volume Relabeling Feature Gates
- Story 3: cluster upgrade
- Configure a security context for a Pod — Efficient SELinux volume relabeling and selinux-warning-controller
Acknowledgements
If you run into issues, have feedback, or want to contribute, find us
on the Kubernetes Slack in #sig-node and #sig-storage or join a
SIG Node or SIG Storage meetings.