User Namespaces: Now Supports Running Stateful Pods in Alpha!
Authors: Rodrigo Campos Catelin (Microsoft), Giuseppe Scrivano (Red Hat), Sascha Grunert (Red Hat)
Kubernetes v1.25 introduced support for user namespaces for only stateless pods. Kubernetes 1.28 lifted that restriction, after some design changes were done in 1.27.
The beauty of this feature is that:
- it is trivial to adopt (you just need to set a bool in the pod spec)
- doesn't need any changes for most applications
- improves security by drastically enhancing the isolation of containers and mitigating CVEs rated HIGH and CRITICAL.
This post explains the basics of user namespaces and also shows:
- the changes that arrived in the recent Kubernetes v1.28 release
- a demo of a vulnerability rated as HIGH that is not exploitable with user namespaces
- the runtime requirements to use this feature
- what you can expect in future releases regarding user namespaces.
What is a user namespace?
A user namespace is a Linux feature that isolates the user and group identifiers (UIDs and GIDs) of the containers from the ones on the host. The indentifiers in the container can be mapped to indentifiers on the host in a way where the host UID/GIDs used for different containers never overlap. Even more, the identifiers can be mapped to unprivileged non-overlapping UIDs and GIDs on the host. This basically means two things:
As the UIDs and GIDs for different containers are mapped to different UIDs and GIDs on the host, containers have a harder time to attack each other even if they escape the container boundaries. For example, if container A is running with different UIDs and GIDs on the host than container B, the operations it can do on container B's files and process are limited: only read/write what a file allows to others, as it will never have permission for the owner or group (the UIDs/GIDs on the host are guaranteed to be different for different containers).
As the UIDs and GIDs are mapped to unprivileged users on the host, if a container escapes the container boundaries, even if it is running as root inside the container, it has no privileges on the host. This greatly protects what host files it can read/write, which process it can send signals to, etc.
Furthermore, capabilities granted are only valid inside the user namespace and not on the host.
Without using a user namespace a container running as root, in the case of a container breakout, has root privileges on the node. And if some capabilities were granted to the container, the capabilities are valid on the host too. None of this is true when using user namespaces (modulo bugs, of course 🙂).
Changes in 1.28
As already mentioned, starting from 1.28, Kubernetes supports user namespaces with stateful pods. This means that pods with user namespaces can use any type of volume, they are no longer limited to only some volume types as before.
The feature gate to activate this feature was renamed, it is no longer
UserNamespacesStatelessPodsSupport but from 1.28 onwards you should use
UserNamespacesSupport. There were many changes done and the requirements on
the node hosts changed. So with Kubernetes 1.28 the feature flag was renamed to
Rodrigo created a demo which exploits CVE 2022-0492 and shows how the exploit can occur without user namespaces. He also shows how it is not possible to use this exploit from a Pod where the containers are using this feature.
This vulnerability is rated HIGH and allows a container with no special privileges to read/write to any path on the host and launch processes as root on the host too.
Most applications in containers run as root today, or as a semi-predictable non-root user (user ID 65534 is a somewhat popular choice). When you run a Pod with containers using a userns, Kubernetes runs those containers as unprivileged users, with no changes needed in your app.
This means two containers running as user 65534 will effectively be mapped to different users on the host, limiting what they can do to each other in case of an escape, and if they are running as root, the privileges on the host are reduced to the one of an unprivileged user.
Node system requirements
There are requirements on the Linux kernel version as well as the container runtime to use this feature.
On Linux you need Linux 6.3 or greater. This is because the feature relies on a kernel feature named idmap mounts, and support to use idmap mounts with tmpfs was merged in Linux 6.3.
If you are using CRI-O with crun, this is supported in CRI-O 1.28.1 and crun 1.9 or greater. If you are using CRI-O with runc, this is still not supported.
containerd support is currently targeted for containerd 2.0; it is likely that it won't matter if you use it with crun or runc.
One limitation present in containerd 1.7 is that it needs to change the ownership of every file and directory inside the container image, during Pod startup. This means it has a storage overhead and can significantly impact the container startup latency. Containerd 2.0 will probably include a implementation that will eliminate the startup latency added and the storage overhead. Take this into account if you plan to use containerd 1.7 with user namespaces in production.
None of these containerd limitations apply to CRI-O 1.28.
Looking ahead to Kubernetes 1.29, the plan is to work with SIG Auth to integrate user
namespaces to Pod Security Standards (PSS) and the Pod Security Admission. For
the time being, the plan is to relax checks in PSS policies when user namespaces are
in use. This means that the fields
capabilities will not trigger a
violation if user namespaces are in use. The behavior will probably be controlled by
utilizing a API Server feature gate, like
How do I get involved?
You can reach SIG Node by several means:
You can also contact us directly:
- GitHub: @rata @giuseppe @saschagrunert
- Slack: @rata @giuseppe @sascha