This article is more than one year old. Older articles may contain outdated content. Check that the information in the page has not become incorrect since its publication.
Finding suspicious syscalls with the seccomp notifier
Debugging software in production is one of the biggest challenges we have to face in our containerized environments. Being able to understand the impact of the available security options, especially when it comes to configuring our deployments, is one of the key aspects to make the default security in Kubernetes stronger. We have all those logging, tracing and metrics data already at hand, but how do we assemble the information they provide into something human readable and actionable?
Seccomp is one of the standard mechanisms to protect a Linux based Kubernetes application from malicious actions by interfering with its system calls. This allows us to restrict the application to a defined set of actionable items, like modifying files or responding to HTTP requests. Linking the knowledge of which set of syscalls is required to, for example, modify a local file, to the actual source code is in the same way non-trivial. Seccomp profiles for Kubernetes have to be written in JSON and can be understood as an architecture specific allow-list with superpowers, for example:
{
"defaultAction": "SCMP_ACT_ERRNO",
"defaultErrnoRet": 38,
"defaultErrno": "ENOSYS",
"syscalls": [
{
"names": ["chmod", "chown", "open", "write"],
"action": "SCMP_ACT_ALLOW"
}
]
}
The above profile errors by default specifying the defaultAction
of
SCMP_ACT_ERRNO
. This means we have to allow a set of syscalls via
SCMP_ACT_ALLOW
, otherwise the application would not be able to do anything at
all. Okay cool, for being able to allow file operations, all we have to do is
adding a bunch of file specific syscalls like open
or write
, and probably
also being able to change the permissions via chmod
and chown
, right?
Basically yes, but there are issues with the simplicity of that approach:
Seccomp profiles need to include the minimum set of syscalls required to start the application. This also includes some syscalls from the lower level Open Container Initiative (OCI) container runtime, for example runc or crun. Beside that, we can only guarantee the required syscalls for a very specific version of the runtimes and our application, because the code parts can change between releases. The same applies to the termination of the application as well as the target architecture we're deploying on. Features like executing commands within containers also require another subset of syscalls. Not to mention that there are multiple versions for syscalls doing slightly different things and the seccomp profiles are able to modify their arguments. It's also not always clearly visible to the developers which syscalls are used by their own written code parts, because they rely on programming language abstractions or frameworks.
How can we know which syscalls are even required then? Who should create and maintain those profiles during its development life-cycle?
Well, recording and distributing seccomp profiles is one of the problem domains of the Security Profiles Operator, which is already solving that. The operator is able to record seccomp, SELinux and even AppArmor profiles into a Custom Resource Definition (CRD), reconciles them to each node and makes them available for usage.
The biggest challenge about creating security profiles is to catch all code paths which execute syscalls. We could achieve that by having 100% logical coverage of the application when running an end-to-end test suite. You get the problem with the previous statement: It's too idealistic to be ever fulfilled, even without taking all the moving parts during application development and deployment into account.
Missing a syscall in the seccomp profiles' allow list can have tremendously
negative impact on the application. It's not only that we can encounter crashes,
which are trivially detectable. It can also happen that they slightly change
logical paths, change the business logic, make parts of the application
unusable, slow down performance or even expose security vulnerabilities. We're
simply not able to see the whole impact of that, especially because blocked
syscalls via SCMP_ACT_ERRNO
do not provide any additional audit
logging on the system.
Does that mean we're lost? Is it just not realistic to dream about a Kubernetes where everyone uses the default seccomp profile? Should we stop striving towards maximum security in Kubernetes and accept that it's not meant to be secure by default?
Definitely not. Technology evolves over time and there are many folks working behind the scenes of Kubernetes to indirectly deliver features to address such problems. One of the mentioned features is the seccomp notifier, which can be used to find suspicious syscalls in Kubernetes.
The seccomp notify feature consists of a set of changes introduced in Linux 5.9. It makes the kernel capable of communicating seccomp related events to the user space. That allows applications to act based on the syscalls and opens for a wide range of possible use cases. We not only need the right kernel version, but also at least runc v1.1.0 (or crun v0.19) to be able to make the notifier work at all. The Kubernetes container runtime CRI-O gets support for the seccomp notifier in v1.26.0. The new feature allows us to identify possibly malicious syscalls in our application, and therefore makes it possible to verify profiles for consistency and completeness. Let's give that a try.
First of all we need to run the latest main
version of CRI-O, because v1.26.0
has not been released yet at time of writing. You can do that by either
compiling it from the source code or by using the pre-built binary
bundle via the get-script. The seccomp notifier feature of CRI-O is
guarded by an annotation, which has to be explicitly allowed, for example by
using a configuration drop-in like this:
> cat /etc/crio/crio.conf.d/02-runtimes.conf
[crio.runtime]
default_runtime = "runc"
[crio.runtime.runtimes.runc]
allowed_annotations = [ "io.kubernetes.cri-o.seccompNotifierAction" ]
If CRI-O is up and running, then it should indicate that the seccomp notifier is available as well:
> sudo ./bin/crio --enable-metrics
…
INFO[…] Starting seccomp notifier watcher
INFO[…] Serving metrics on :9090 via HTTP
…
We also enable the metrics, because they provide additional telemetry data about
the notifier. Now we need a running Kubernetes cluster for demonstration
purposes. For this demo, we mainly stick to the
hack/local-up-cluster.sh
approach to locally spawn a single node
Kubernetes cluster.
If everything is up and running, then we would have to define a seccomp profile
for testing purposes. But we do not have to create our own, we can just use the
RuntimeDefault
profile which gets shipped with each container runtime. For
example the RuntimeDefault
profile for CRI-O can be found in the
containers/common library.
Now we need a test container, which can be a simple nginx pod like this:
apiVersion: v1
kind: Pod
metadata:
name: nginx
annotations:
io.kubernetes.cri-o.seccompNotifierAction: "stop"
spec:
restartPolicy: Never
containers:
- name: nginx
image: nginx:1.23.2
securityContext:
seccompProfile:
type: RuntimeDefault
Please note the annotation io.kubernetes.cri-o.seccompNotifierAction
, which
enables the seccomp notifier for this workload. The value of the annotation can
be either stop
for stopping the workload or anything else for doing nothing
else than logging and throwing metrics. Because of the termination we also use
the restartPolicy: Never
to not automatically recreate the container on
failure.
Let's run the pod and check if it works:
> kubectl apply -f nginx.yaml
> kubectl get pods -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
nginx 1/1 Running 0 3m39s 10.85.0.3 127.0.0.1 <none> <none>
We can also test if the web server itself works as intended:
> curl 10.85.0.3
<!DOCTYPE html>
<html>
<head>
<title>Welcome to nginx!</title>
…
While everything is now up and running, CRI-O also indicates that it has started the seccomp notifier:
…
INFO[…] Injecting seccomp notifier into seccomp profile of container 662a3bb0fdc7dd1bf5a88a8aa8ef9eba6296b593146d988b4a9b85822422febb
…
If we would now run a forbidden syscall inside of the container, then we can
expect that the workload gets terminated. Let's give that a try by running
chroot
in the containers namespaces:
> kubectl exec -it nginx -- bash
root@nginx:/# chroot /tmp
chroot: cannot change root directory to '/tmp': Function not implemented
root@nginx:/# command terminated with exit code 137
The exec session got terminated, so it looks like the container is not running any more:
> kubectl get pods
NAME READY STATUS RESTARTS AGE
nginx 0/1 seccomp killed 0 96s
Alright, the container got killed by seccomp, do we get any more information about what was going on?
> kubectl describe pod nginx
Name: nginx
…
Containers:
nginx:
…
State: Terminated
Reason: seccomp killed
Message: Used forbidden syscalls: chroot (1x)
Exit Code: 137
Started: Mon, 14 Nov 2022 12:19:46 +0100
Finished: Mon, 14 Nov 2022 12:20:26 +0100
…
The seccomp notifier feature of CRI-O correctly set the termination reason and
message, including which forbidden syscall has been used how often (1x
). How
often? Yes, the notifier gives the application up to 5 seconds after the last
seen syscall until it starts the termination. This means that it's possible to
catch multiple forbidden syscalls within one test by avoiding time-consuming
trial and errors.
> kubectl exec -it nginx -- chroot /tmp
chroot: cannot change root directory to '/tmp': Function not implemented
command terminated with exit code 125
> kubectl exec -it nginx -- chroot /tmp
chroot: cannot change root directory to '/tmp': Function not implemented
command terminated with exit code 125
> kubectl exec -it nginx -- swapoff -a
command terminated with exit code 32
> kubectl exec -it nginx -- swapoff -a
command terminated with exit code 32
> kubectl describe pod nginx | grep Message
Message: Used forbidden syscalls: chroot (2x), swapoff (2x)
The CRI-O metrics will also reflect that:
> curl -sf localhost:9090/metrics | grep seccomp_notifier
# HELP container_runtime_crio_containers_seccomp_notifier_count_total Amount of containers stopped because they used a forbidden syscalls by their name
# TYPE container_runtime_crio_containers_seccomp_notifier_count_total counter
container_runtime_crio_containers_seccomp_notifier_count_total{name="…",syscalls="chroot (1x)"} 1
container_runtime_crio_containers_seccomp_notifier_count_total{name="…",syscalls="chroot (2x), swapoff (2x)"} 1
How does it work in detail? CRI-O uses the chosen seccomp profile and injects
the action SCMP_ACT_NOTIFY
instead of SCMP_ACT_ERRNO
, SCMP_ACT_KILL
,
SCMP_ACT_KILL_PROCESS
or SCMP_ACT_KILL_THREAD
. It also sets a local listener
path which will be used by the lower level OCI runtime (runc or crun) to create
the seccomp notifier socket. If the connection between the socket and CRI-O has
been established, then CRI-O will receive notifications for each syscall being
interfered by seccomp. CRI-O stores the syscalls, allows a bit of timeout for
them to arrive and then terminates the container if the chosen
seccompNotifierAction=stop
. Unfortunately, the seccomp notifier is not able to
notify on the defaultAction
, which means that it's required to have
a list of syscalls to test for custom profiles. CRI-O does also state that
limitation in the logs:
INFO[…] The seccomp profile default action SCMP_ACT_ERRNO cannot be overridden to SCMP_ACT_NOTIFY,
which means that syscalls using that default action can't be traced by the notifier
As a conclusion, the seccomp notifier implementation in CRI-O can be used to
verify if your applications behave correctly when using RuntimeDefault
or any
other custom profile. Alerts can be created based on the metrics to create long
running test scenarios around that feature. Making seccomp understandable and
easier to use will increase adoption as well as help us to move towards a more
secure Kubernetes by default!
Thank you for reading this blog post. If you'd like to read more about the seccomp notifier, checkout the following resources:
- The Seccomp Notifier - New Frontiers in Unprivileged Container Development: https://brauner.io/2020/07/23/seccomp-notify.html
- Bringing Seccomp Notify to Runc and Kubernetes: https://kinvolk.io/blog/2022/03/bringing-seccomp-notify-to-runc-and-kubernetes
- Seccomp Agent reference implementation: https://github.com/opencontainers/runc/tree/6b16d00/contrib/cmd/seccompagent