Thursday, December 07, 2017
Kubernetes provides a high-level API and a set of components that hides almost all of the intricate and—to some of us—interesting details of what happens at the systems level. Application developers are not required to have knowledge of the machines’ IP tables, cgroups, namespaces, seccomp, or, nowadays, even the container runtime that their application runs on top of. But underneath, Kubernetes and the technologies upon which it relies (for example, the container runtime) heavily leverage core Linux functionalities.
This article focuses on a core Linux functionality increasingly used in networking, security and auditing, and tracing and monitoring tools. This functionality is called extended Berkeley Packet Filter (eBPF)
Note: In this article we use both acronyms: eBPF and BPF. The former is used for the extended BPF functionality, and the latter for “classic” BPF functionality.
What is BPF?
BPF is a mini-VM residing in the Linux kernel that runs BPF programs. Before running, BPF programs are loaded with the bpf() syscall and are validated for safety: checking for loops, code size, etc. BPF programs are attached to kernel objects and executed when events happen on those objects—for example, when a network interface emits a packet.
The ‘e’ in eBPF
Traditionally, BPF could only be attached to sockets for socket filtering. BPF’s first use case was in
tcpdump. When you run
tcpdump the filter is compiled into a BPF program and attached to a raw
AF_PACKET socket in order to print out filtered packets.
But over the years, eBPF added the ability to attach to other kernel objects. In addition to socket filtering, some supported attach points are:
- Kprobes (and userspace equivalents uprobes)
- Network schedulers or qdiscs for classification or action (tc)
XDP (eXpress Data Path) This and other, newer features like in-kernel helper functions and shared data-structures (maps) that can be used to communicate with user space, extend BPF’s capabilities.
Existing Use Cases of eBPF with Kubernetes
Several open-source Kubernetes tools already use eBPF and many use cases warrant a closer look, especially in areas such as networking, monitoring and security tools.
Dynamic Network Control and Visibility with Cilium
Cilium is a networking project that makes heavy use of eBPF superpowers to route and filter network traffic for container-based systems. By using eBPF, Cilium can dynamically generate and apply rules—even at the device level with XDP—without making changes to the Linux kernel itself.
The Cilium Agent runs on each host. Instead of managing IP tables, it translates network policy definitions to BPF programs that are loaded into the kernel and attached to a container’s virtual ethernet device. These programs are executed—rules applied—on each packet that is sent or received.
This diagram shows how the Cilium project works:
Depending on what network rules are applied, BPF programs may be attached with tc or XDP. By using XDP, Cilium can attach the BPF programs at the lowest possible point, which is also the most performant point in the networking software stack.
If you’d like to learn more about how Cilium uses eBPF, take a look at the project’s BPF and XDP reference guide.
Tracking TCP Connections in Weave Scope
Weave Scope is a tool for monitoring, visualizing and interacting with container-based systems. For our purposes, we’ll focus on how Weave Scope gets the TCP connections.
Weave Scope employs an agent that runs on each node of a cluster. The agent monitors the system, generates a report and sends it to the app server. The app server compiles the reports it receives and presents the results in the Weave Scope UI.
To accurately draw connections between containers, the agent attaches a BPF program to kprobes that track socket events: opening and closing connections. The BPF program, tcptracer-bpf, is compiled into an ELF object file and loaded using gopbf.
(As a side note, Weave Scope also has a plugin that make use of eBPF: HTTP statistics.)
Limiting syscalls with seccomp-bpf
Linux has more than 300 system calls (read, write, open, close, etc.) available for use—or misuse. Most applications only need a small subset of syscalls to function properly. seccomp is a Linux security facility used to limit the set of syscalls that an application can use, thereby limiting potential misuse.
The original implementation of seccomp was highly restrictive. Once applied, if an application attempted to do anything beyond reading and writing to files it had already opened, seccomp sent a
seccomp-bpf enables more complex filters and a wider range of actions. Seccomp-bpf, also known as seccomp mode 2, allows for applying custom filters in the form of BPF programs. When the BPF program is loaded, the filter is applied to each syscall and the appropriate action is taken (Allow, Kill, Trap, etc.).
seccomp-bpf is widely used in Kubernetes tools and exposed in Kubernetes itself. For example, seccomp-bpf is used in Docker to apply custom seccomp security profiles, in rkt to apply seccomp isolators, and in Kubernetes itself in its Security Context.
But in all of these cases the use of BPF is hidden behind libseccomp. Behind the scenes, libseccomp generates BPF code from rules provided to it. Once generated, the BPF program is loaded and the rules applied.
Potential Use Cases for eBPF with Kubernetes
eBPF is a relatively new Linux technology. As such, there are many uses that remain unexplored. eBPF itself is also evolving: new features are being added in eBPF that will enable new use cases that aren’t currently possible. In the following sections, we’re going to look at some of these that have only recently become possible and ones on the horizon. Our hope is that these features will be leveraged by open source tooling.
Pod and container level network statistics
BPF socket filtering is nothing new, but BPF socket filtering per cgroup is. Introduced in Linux 4.10, cgroup-bpf allows attaching eBPF programs to cgroups. Once attached, the program is executed for all packets entering or exiting any process in the cgroup.
A cgroup is, amongst other things, a hierarchical grouping of processes. In Kubernetes, this grouping is found at the container level. One idea for making use of cgroup-bpf, is to install BPF programs that collect detailed per-pod and/or per-container network statistics.
Generally, such statistics are collected by periodically checking the relevant file in Linux’s
/sys directory or using Netlink. By using BPF programs attached to cgroups for this, we can get much more detailed statistics: for example, how many packets/bytes on tcp port 443, or how many packets/bytes from IP 10.2.3.4. In general, because BPF programs have a kernel context, they can safely and efficiently deliver more detailed information to user space.
There are of course other interesting possibilities, like doing actual packet filtering. But the obstacle currently standing in the way of this is having cgroup v2 support—required by cgroup-bpf—in Docker and Kubernetes.
Linux Security Modules (LSM) implements a generic framework for security policies in the Linux kernel. SELinux and AppArmor are examples of these. Both of these implement rules at a system-global scope, placing the onus on the administrator to configure the security policies.
Landlock is another LSM under development that would co-exist with SELinux and AppArmor. An initial patchset has been submitted to the Linux kernel and is in an early stage of development. The main difference with other LSMs is that Landlock is designed to allow unprivileged applications to build their own sandbox, effectively restricting themselves instead of using a global configuration. With Landlock, an application can load a BPF program and have it executed when the process performs a specific action. For example, when the application opens a file with the open() system call, the kernel will execute the BPF program, and, depending on what the BPF program returns, the action will be accepted or denied.
In some ways, it is similar to seccomp-bpf: using a BPF program, seccomp-bpf allows unprivileged processes to restrict what system calls they can perform. Landlock will be more powerful and provide more flexibility. Consider the following system call:
C fd = open(“myfile.txt”, O\_RDWR);
The first argument is a “char *”, a pointer to a memory address, such as
With seccomp, a BPF program only has access to the parameters of the syscall but cannot dereference the pointers, making it impossible to make security decisions based on a file. seccomp also uses classic BPF, meaning it cannot make use of eBPF maps, the mechanism for interfacing with user space. This restriction means security policies cannot be changed in seccomp-bpf based on a configuration in an eBPF map.
BPF programs with Landlock don’t receive the arguments of the syscalls but a reference to a kernel object. In the example above, this means it will have a reference to the file, so it does not need to dereference a pointer, consider relative paths, or perform chroots.
Use Case: Landlock in Kubernetes-based serverless frameworks
In Kubernetes, the unit of deployment is a pod. Pods and containers are the main unit of isolation. In serverless frameworks, however, the main unit of deployment is a function. Ideally, the unit of deployment equals the unit of isolation. This puts serverless frameworks like Kubeless or OpenFaaS into a predicament: optimize for unit of isolation or deployment?
To achieve the best possible isolation, each function call would have to happen in its own container—ut what’s good for isolation is not always good for performance. Inversely, if we run function calls within the same container, we increase the likelihood of collisions.
By using Landlock, we could isolate function calls from each other within the same container, making a temporary file created by one function call inaccessible to the next function call, for example. Integration between Landlock and technologies like Kubernetes-based serverless frameworks would be a ripe area for further exploration.
Auditing kubectl-exec with eBPF
In Kubernetes 1.7 the audit proposal started making its way in. It’s currently pre-stable with plans to be stable in the 1.10 release. As the name implies, it allows administrators to log and audit events that take place in a Kubernetes cluster.
While these events log Kubernetes events, they don’t currently provide the level of visibility that some may require. For example, while we can see that someone has used
kubectl exec to enter a container, we are not able to see what commands were executed in that session. With eBPF one can attach a BPF program that would record any commands executed in the
kubectl exec session and pass those commands to a user-space program that logs those events. We could then play that session back and know the exact sequence of events that took place.
Learn more about eBPF
If you’re interested in learning more about eBPF, here are some resources: - A comprehensive reading list about eBPF for doing just that - BCC (BPF Compiler Collection) provides tools for working with eBPF as well as many example tools making use of BCC. - Some videos
- BPF: Tracing and More by Brendan Gregg
- Cilium - Container Security and Networking Using BPF and XDP by Thomas Graf
- Using BPF in Kubernetes by Alban Crequy
We are just starting to see the Linux superpowers of eBPF being put to use in Kubernetes tools and technologies. We will undoubtedly see increased use of eBPF. What we have highlighted here is just a taste of what you might expect in the future. What will be really exciting is seeing how these technologies will be used in ways that we have not yet thought about. Stay tuned!
The Kinvolk team will be hanging out at the Kinvolk booth at KubeCon in Austin. Come by to talk to us about all things, Kubernetes, Linux, container runtimes and yeah, eBPF.
- Gardener - The Kubernetes Botanist May 17
- Docs are Migrating from Jekyll to Hugo May 5
- Announcing Kubeflow 0.1 May 4
- Current State of Policy in Kubernetes May 2
- Developing on Kubernetes May 1
- Zero-downtime Deployment in Kubernetes with Jenkins Apr 30
- Kubernetes Community - Top of the Open Source Charts in 2017 Apr 25
- Local Persistent Volumes for Kubernetes Goes Beta Apr 13
- Container Storage Interface (CSI) for Kubernetes Goes Beta Apr 10
- Fixing the Subpath Volume Vulnerability in Kubernetes Apr 4
- Principles of Container-based Application Design Mar 15
- Expanding User Support with Office Hours Mar 14
- How to Integrate RollingUpdate Strategy for TPR in Kubernetes Mar 13
- Apache Spark 2.3 with Native Kubernetes Support Mar 6
- Kubernetes: First Beta Version of Kubernetes 1.10 is Here Mar 2
- Reporting Errors from Control Plane to Applications Using Kubernetes Events Jan 25
- Core Workloads API GA Jan 15
- Introducing client-go version 6 Jan 12
- Extensible Admission is Beta Jan 11
- Introducing Container Storage Interface (CSI) Alpha for Kubernetes Jan 10
- Kubernetes v1.9 releases beta support for Windows Server Containers Jan 9
- Five Days of Kubernetes 1.9 Jan 8
- Introducing Kubeflow - A Composable, Portable, Scalable ML Stack Built for Kubernetes Dec 21
- Kubernetes 1.9: Apps Workloads GA and Expanded Ecosystem Dec 15
- Using eBPF in Kubernetes Dec 7
- PaddlePaddle Fluid: Elastic Deep Learning on Kubernetes Dec 6
- Autoscaling in Kubernetes Nov 17
- Certified Kubernetes Conformance Program: Launch Celebration Round Up Nov 16
- Kubernetes is Still Hard (for Developers) Nov 15
- Securing Software Supply Chain with Grafeas Nov 3
- Containerd Brings More Container Runtime Options for Kubernetes Nov 2
- Kubernetes the Easy Way Nov 1
- Enforcing Network Policies in Kubernetes Oct 30
- Using RBAC, Generally Available in Kubernetes v1.8 Oct 28
- It Takes a Village to Raise a Kubernetes Oct 26
- kubeadm v1.8 Released: Introducing Easy Upgrades for Kubernetes Clusters Oct 25
- Five Days of Kubernetes 1.8 Oct 24
- Introducing Software Certification for Kubernetes Oct 19
- Request Routing and Policy Management with the Istio Service Mesh Oct 10
- Kubernetes Community Steering Committee Election Results Oct 5
- Kubernetes 1.8: Security, Workloads and Feature Depth Sep 29
- Kubernetes StatefulSets & DaemonSets Updates Sep 27
- Introducing the Resource Management Working Group Sep 21
- Windows Networking at Parity with Linux for Kubernetes Sep 8
- Kubernetes Meets High-Performance Computing Aug 22
- High Performance Networking with EC2 Virtual Private Clouds Aug 11
- Kompose Helps Developers Move Docker Compose Files to Kubernetes Aug 10
- Happy Second Birthday: A Kubernetes Retrospective Jul 28
- How Watson Health Cloud Deploys Applications with Kubernetes Jul 14
- Kubernetes 1.7: Security Hardening, Stateful Application Updates and Extensibility Jun 30
- Draft: Kubernetes container development made easy May 31
- Managing microservices with the Istio service mesh May 31
- Kubespray Ansible Playbooks foster Collaborative Kubernetes Ops May 19
- Kubernetes: a monitoring guide May 19
- Dancing at the Lip of a Volcano: The Kubernetes Security Process - Explained May 18
- How Bitmovin is Doing Multi-Stage Canary Deployments with Kubernetes in the Cloud and On-Prem Apr 21
- RBAC Support in Kubernetes Apr 6
- Configuring Private DNS Zones and Upstream Nameservers in Kubernetes Apr 4
- Advanced Scheduling in Kubernetes Mar 31
- Scalability updates in Kubernetes 1.6: 5,000 node and 150,000 pod clusters Mar 30
- Five Days of Kubernetes 1.6 Mar 29
- Dynamic Provisioning and Storage Classes in Kubernetes Mar 29
- Kubernetes 1.6: Multi-user, Multi-workloads at Scale Mar 28
- The K8sPort: Engaging Kubernetes Community One Activity at a Time Mar 24
- Deploying PostgreSQL Clusters using StatefulSets Feb 24
- Containers as a Service, the foundation for next generation PaaS Feb 21
- Inside JD.com's Shift to Kubernetes from OpenStack Feb 10
- Run Deep Learning with PaddlePaddle on Kubernetes Feb 8
- Highly Available Kubernetes Clusters Feb 2
- Running MongoDB on Kubernetes with StatefulSets Jan 30
- Fission: Serverless Functions as a Service for Kubernetes Jan 30
- How we run Kubernetes in Kubernetes aka Kubeception Jan 20
- Scaling Kubernetes deployments with Policy-Based Networking Jan 19
- A Stronger Foundation for Creating and Managing Kubernetes Clusters Jan 12
- Kubernetes UX Survey Infographic Jan 9
- Kubernetes supports OpenAPI Dec 23
- Cluster Federation in Kubernetes 1.5 Dec 22
- Windows Server Support Comes to Kubernetes Dec 21
- StatefulSet: Run and Scale Stateful Applications Easily in Kubernetes Dec 20
- Introducing Container Runtime Interface (CRI) in Kubernetes Dec 19
- Five Days of Kubernetes 1.5 Dec 19
- Kubernetes 1.5: Supporting Production Workloads Dec 13
- From Network Policies to Security Policies Dec 8
- Kompose: a tool to go from Docker-compose to Kubernetes Nov 22
- Kubernetes Containers Logging and Monitoring with Sematext Nov 18
- Visualize Kubelet Performance with Node Dashboard Nov 17
- CNCF Partners With The Linux Foundation To Launch New Kubernetes Certification, Training and Managed Service Provider Program Nov 8
- Modernizing the Skytap Cloud Micro-Service Architecture with Kubernetes Nov 7
- Bringing Kubernetes Support to Azure Container Service Nov 7
- Tail Kubernetes with Stern Oct 31
- Introducing Kubernetes Service Partners program and a redesigned Partners page Oct 31
- How We Architected and Run Kubernetes on OpenStack at Scale at Yahoo! JAPAN Oct 24
- Building Globally Distributed Services using Kubernetes Cluster Federation Oct 14
- Helm Charts: making it simple to package and deploy common applications on Kubernetes Oct 10
- Dynamic Provisioning and Storage Classes in Kubernetes Oct 7
- How we improved Kubernetes Dashboard UI in 1.4 for your production needs Oct 3
- How we made Kubernetes insanely easy to install Sep 28
- How Qbox Saved 50% per Month on AWS Bills Using Kubernetes and Supergiant Sep 27
- Kubernetes 1.4: Making it easy to run on Kubernetes anywhere Sep 26
- High performance network policies in Kubernetes clusters Sep 21
- Creating a PostgreSQL Cluster using Helm Sep 9
- Deploying to Multiple Kubernetes Clusters with kit Sep 6
- Cloud Native Application Interfaces Sep 1
- Security Best Practices for Kubernetes Deployment Aug 31
- Scaling Stateful Applications using Kubernetes Pet Sets and FlexVolumes with Datera Elastic Data Fabric Aug 29
- SIG Apps: build apps for and operate them in Kubernetes Aug 16
- Kubernetes Namespaces: use cases and insights Aug 16
- Create a Couchbase cluster using Kubernetes Aug 15
- Challenges of a Remotely Managed, On-Premises, Bare-Metal Kubernetes Cluster Aug 2
- Why OpenStack's embrace of Kubernetes is great for both communities Jul 26
- The Bet on Kubernetes, a Red Hat Perspective Jul 21
- Happy Birthday Kubernetes. Oh, the places you’ll go! Jul 21
- A Very Happy Birthday Kubernetes Jul 21
- Bringing End-to-End Kubernetes Testing to Azure (Part 2) Jul 18
- Steering an Automation Platform at Wercker with Kubernetes Jul 15
- Dashboard - Full Featured Web Interface for Kubernetes Jul 15
- Cross Cluster Services - Achieving Higher Availability for your Kubernetes Applications Jul 14
- Citrix + Kubernetes = A Home Run Jul 14
- Thousand Instances of Cassandra using Kubernetes Pet Set Jul 13
- Stateful Applications in Containers!? Kubernetes 1.3 Says “Yes!” Jul 13
- Kubernetes in Rancher: the further evolution Jul 12
- Autoscaling in Kubernetes Jul 12
- rktnetes brings rkt container engine to Kubernetes Jul 11
- Minikube: easily run Kubernetes locally Jul 11
- Five Days of Kubernetes 1.3 Jul 11
- Updates to Performance and Scalability in Kubernetes 1.3 -- 2,000 node 60,000 pod clusters Jul 7
- Kubernetes 1.3: Bridging Cloud Native and Enterprise Workloads Jul 6
- Container Design Patterns Jun 21
- The Illustrated Children's Guide to Kubernetes Jun 9
- Bringing End-to-End Kubernetes Testing to Azure (Part 1) Jun 6
- Hypernetes: Bringing Security and Multi-tenancy to Kubernetes May 24
- CoreOS Fest 2016: CoreOS and Kubernetes Community meet in Berlin (& San Francisco) May 3
- Introducing the Kubernetes OpenStack Special Interest Group Apr 22
- SIG-UI: the place for building awesome user interfaces for Kubernetes Apr 20
- SIG-ClusterOps: Promote operability and interoperability of Kubernetes clusters Apr 19
- SIG-Networking: Kubernetes Network Policy APIs Coming in 1.3 Apr 18
- How to deploy secure, auditable, and reproducible Kubernetes clusters on AWS Apr 15
- Container survey results - March 2016 Apr 8
- Adding Support for Kubernetes in Rancher Apr 8
- Configuration management with Containers Apr 4
- Using Deployment objects with Kubernetes 1.2 Apr 1
- Kubernetes 1.2 and simplifying advanced networking with Ingress Mar 31
- Using Spark and Zeppelin to process big data on Kubernetes 1.2 Mar 30
- Building highly available applications using Kubernetes new multi-zone clusters (a.k.a. 'Ubernetes Lite') Mar 29
- AppFormix: Helping Enterprises Operationalize Kubernetes Mar 29
- How container metadata changes your point of view Mar 28
- Five Days of Kubernetes 1.2 Mar 28
- 1000 nodes and beyond: updates to Kubernetes performance and scalability in 1.2 Mar 28
- Scaling neural network image classification using Kubernetes with TensorFlow Serving Mar 23
- Kubernetes 1.2: Even more performance upgrades, plus easier application deployment and management Mar 17
- Kubernetes in the Enterprise with Fujitsu’s Cloud Load Control Mar 11
- ElasticBox introduces ElasticKube to help manage Kubernetes within the enterprise Mar 11
- State of the Container World, February 2016 Mar 1
- Kubernetes Community Meeting Notes - 20160225 Mar 1
- KubeCon EU 2016: Kubernetes Community in London Feb 24
- Kubernetes Community Meeting Notes - 20160218 Feb 23
- Kubernetes Community Meeting Notes - 20160211 Feb 16
- ShareThis: Kubernetes In Production Feb 11
- Kubernetes Community Meeting Notes - 20160204 Feb 9
- Kubernetes Community Meeting Notes - 20160128 Feb 2
- State of the Container World, January 2016 Feb 1
- Kubernetes Community Meeting Notes - 20160121 Jan 28
- Kubernetes Community Meeting Notes - 20160114 Jan 28
- Why Kubernetes doesn’t use libnetwork Jan 14
- Simple leader election with Kubernetes and Docker Jan 11
- Creating a Raspberry Pi cluster running Kubernetes, the installation (Part 2) Dec 22
- Managing Kubernetes Pods, Services and Replication Controllers with Puppet Dec 17
- How Weave built a multi-deployment solution for Scope using Kubernetes Dec 12
- Creating a Raspberry Pi cluster running Kubernetes, the shopping list (Part 1) Nov 25
- Monitoring Kubernetes with Sysdig Nov 19
- One million requests per second: Dependable and dynamic distributed systems at scale Nov 11
- Kubernetes 1.1 Performance upgrades, improved tooling and a growing community Nov 9
- Kubernetes as Foundation for Cloud Native PaaS Nov 3
- Some things you didn’t know about kubectl Oct 28
- Kubernetes Performance Measurements and Roadmap Sep 10
- Using Kubernetes Namespaces to Manage Environments Aug 28
- Weekly Kubernetes Community Hangout Notes - July 31 2015 Aug 4
- The Growing Kubernetes Ecosystem Jul 24
- Weekly Kubernetes Community Hangout Notes - July 17 2015 Jul 23
- Strong, Simple SSL for Kubernetes Services Jul 14
- Weekly Kubernetes Community Hangout Notes - July 10 2015 Jul 13
- Announcing the First Kubernetes Enterprise Training Course Jul 8
- Kubernetes 1.0 Launch Event at OSCON Jul 2
- How did the Quake demo from DockerCon Work? Jul 2
- The Distributed System ToolKit: Patterns for Composite Containers Jun 29
- Slides: Cluster Management with Kubernetes, talk given at the University of Edinburgh Jun 26
- Cluster Level Logging with Kubernetes Jun 11
- Weekly Kubernetes Community Hangout Notes - May 22 2015 Jun 2
- Kubernetes on OpenStack May 19
- Weekly Kubernetes Community Hangout Notes - May 15 2015 May 18
- Docker and Kubernetes and AppC May 18
- Kubernetes Release: 0.17.0 May 15
- Resource Usage Monitoring in Kubernetes May 12
- Weekly Kubernetes Community Hangout Notes - May 1 2015 May 11
- Kubernetes Release: 0.16.0 May 11
- AppC Support for Kubernetes through RKT May 4
- Weekly Kubernetes Community Hangout Notes - April 24 2015 Apr 30
- Borg: The Predecessor to Kubernetes Apr 23
- Kubernetes and the Mesosphere DCOS Apr 22
- Weekly Kubernetes Community Hangout Notes - April 17 2015 Apr 17
- Kubernetes Release: 0.15.0 Apr 16
- Introducing Kubernetes API Version v1beta3 Apr 16
- Weekly Kubernetes Community Hangout Notes - April 10 2015 Apr 11
- Faster than a speeding Latte Apr 6
- Weekly Kubernetes Community Hangout Notes - April 3 2015 Apr 4
- Paricipate in a Kubernetes User Experience Study Mar 31
- Weekly Kubernetes Community Hangout Notes - March 27 2015 Mar 28
- Kubernetes Gathering Videos Mar 23
- Welcome to the Kubernetes Blog! Mar 20