Kubernetes Blog

Kubernetes v1.9 releases beta support for Windows Server Containers

January 09 2018

With the release of Kubernetes v1.9, our mission of ensuring Kubernetes works well everywhere and for everyone takes a great step forward. We’ve advanced support for Windows Server to beta along with continued feature and functional advancements on both the Kubernetes and Windows platforms. SIG-Windows has been working since March of 2016 to open the door for many Windows-specific applications and workloads to run on Kubernetes, significantly expanding the implementation scenarios and the enterprise reach of Kubernetes.

Enterprises of all sizes have made significant investments in .NET and Windows based applications. Many enterprise portfolios today contain .NET and Windows, with Gartner claiming that 80% of enterprise apps run on Windows. According to StackOverflow Insights, 40% of professional developers use the .NET programming languages (including .NET Core).

But why is all this information important? It means that enterprises have both legacy and new born-in-the-cloud (microservice) applications that utilize a wide array of programming frameworks. There is a big push in the industry to modernize existing/legacy applications to containers, using an approach similar to “lift and shift”. Modernizing existing applications into containers also provides added flexibility for new functionality to be introduced in additional Windows or Linux containers. Containers are becoming the de facto standard for packaging, deploying, and managing both existing and microservice applications. IT organizations are looking for an easier and homogenous way to orchestrate and manage containers across their Linux and Windows environments. Kubernetes v1.9 now offers beta support for Windows Server containers, making it the clear choice for orchestrating containers of any kind.


Alpha support for Windows Server containers in Kubernetes was great for proof-of-concept projects and visualizing the road map for support of Windows in Kubernetes. The alpha release had significant drawbacks, however, and lacked many features, especially in networking. SIG-Windows, Microsoft, Cloudbase Solutions, Apprenda, and other community members banded together to create a comprehensive beta release, enabling Kubernetes users to start evaluating and using Windows.

Some key feature improvements for Windows Server containers on Kubernetes include:

  • Improved support for pods! Multiple Windows Server containers in a pod can now share the network namespace using network compartments in Windows Server. This feature brings the concept of a pod to parity with Linux-based containers
  • Reduced network complexity by using a single network endpoint per pod
  • Kernel-Based load-balancing using the Virtual Filtering Platform (VFP) Hyper-v Switch Extension (analogous to Linux iptables)
  • Container Runtime Interface (CRI) pod and node level statistics. Windows Server containers can now be profiled for Horizontal Pod Autoscaling using performance metrics gathered from the pod and the node
  • Support for kubeadm commands to add Windows Server nodes to a Kubernetes environment. Kubeadm simplifies the provisioning of a Kubernetes cluster, and with the support for Windows Server, you can use a single tool to deploy Kubernetes in your infrastructure
  • Support for ConfigMaps, Secrets, and Volumes. These are key features that allow you to separate, and in some cases secure, the configuration of the containers from the implementation The crown jewels of Kubernetes 1.9 Windows support, however, are the networking enhancements. With the release of Windows Server 1709, Microsoft has enabled key networking capabilities in the operating system and the Windows Host Networking Service (HNS) that paved the way to produce a number of CNI plugins that work with Windows Server containers in Kubernetes. The Layer-3 routed and network overlay plugins that are supported with Kubernetes 1.9 are listed below:
  1. Upstream L3 Routing - IP routes configured in upstream ToR
  2. Host-Gateway - IP routes configured on each host
  3. Open vSwitch (OVS) & Open Virtual Network (OVN) with Overlay - Supports STT and Geneve tunneling types You can read more about each of their configuration, setup, and runtime capabilities to make an informed selection for your networking stack in Kubernetes.

Even though you have to continue running the Kubernetes Control Plane and Master Components in Linux, you are now able to introduce Windows Server as a Node in Kubernetes. As a community, this is a huge milestone and achievement. We will now start seeing .NET, .NET Core, ASP.NET, IIS, Windows Services, Windows executables and many more windows-based applications in Kubernetes.

What’s coming next

A lot of work went into this beta release, but the community realizes there are more areas of investment needed before we can release Windows support as GA (General Availability) for production workloads. Some keys areas of focus for the first two quarters of 2018 include:

  1. Continue to make progress in the area of networking. Additional CNI plugins are under development and nearing completion
    • Overlay - win-overlay (vxlan or IP-in-IP encapsulation using Flannel) 
    • Win-l2bridge (host-gateway) 
    • OVN using cloud networking - without overlays
    • Support for Kubernetes network policies in ovn-kubernetes
    • Support for Hyper-V Isolation
    • Support for StatefulSet functionality for stateful applications
    • Produce installation artifacts and documentation that work on any infrastructure and across many public cloud providers like Microsoft Azure, Google Cloud, and Amazon AWS
    • Continuous Integration/Continuous Delivery (CI/CD) infrastructure for SIG-Windows
    • Scalability and Performance testing Even though we have not committed to a timeline for GA, SIG-Windows estimates a GA release in the first half of 2018.

Get Involved

As we continue to make progress towards General Availability of this feature in Kubernetes, we welcome you to get involved, contribute code, provide feedback, deploy Windows Server containers to your Kubernetes cluster, or simply join our community.

Thank you,

Michael Michael (@michmike77)
SIG-Windows Lead
Senior Director of Product Management, Apprenda

Five Days of Kubernetes 1.9

January 08 2018

Kubernetes 1.9 is live, made possible by hundreds of contributors pushing thousands of commits in this latest releases.

The community has tallied around 32,300 commits in the main repo and continues rapid growth outside of the main repo, which signals growing maturity and stability for the project. The community has logged more than 90,700 commits across all repos and 7,800 commits across all repos for v1.8.0 to v1.9.0 alone.

With the help of our growing community of 1,400 plus contributors, we issued more than 4,490 PRs and pushed more than 7,800 commits to deliver Kubernetes 1.9 with many notable updates, including enhancements for the workloads and stateful application support areas. This all points to increased extensibility and standards-based Kubernetes ecosystem.

While many improvements have been contributed, we highlight key features in this series of in-depth posts listed below. Follow along and see what’s new and improved with workloads, Windows support and more.

Day 1: 5 Days of Kubernetes 1.9
Day 2: Windows and Docker support for Kubernetes (beta)
Day 3: Storage, CSI framework (alpha)
Day 4:  Web Hook and Mission Critical, Dynamic Admission Control
Day 5: Introducing client-go version 6
Day 6: Workloads API


  • Post questions (or answer questions) on Stack Overflow
  • Join the community portal for advocates on K8sPort
  • Follow us on Twitter @Kubernetesio for latest updates 
  • Connect with the community on Slack
  • Get involved with the Kubernetes project on GitHub

Introducing Kubeflow - A Composable, Portable, Scalable ML Stack Built for Kubernetes

December 21 2017

Today’s post is by David Aronchick and Jeremy Lewi, a PM and Engineer on the Kubeflow project, a new open source Github repo dedicated to making using machine learning (ML) stacks on Kubernetes easy, fast and extensible.

Kubernetes and Machine Learning

Kubernetes has quickly become the hybrid solution for deploying complicated workloads anywhere. While it started with just stateless services, customers have begun to move complex workloads to the platform, taking advantage of rich APIs, reliability and performance provided by Kubernetes. One of the fastest growing use cases is to use Kubernetes as the deployment platform of choice for machine learning.

Building any production-ready machine learning system involves various components, often mixing vendors and hand-rolled solutions. Connecting and managing these services for even moderately sophisticated setups introduces huge barriers of complexity in adopting machine learning. Infrastructure engineers will often spend a significant amount of time manually tweaking deployments and hand rolling solutions before a single model can be tested.

Worse, these deployments are so tied to the clusters they have been deployed to that these stacks are immobile, meaning that moving a model from a laptop to a highly scalable cloud cluster is effectively impossible without significant re-architecture. All these differences add up to wasted effort and create opportunities to introduce bugs at each transition.

Introducing Kubeflow

To address these concerns, we’re announcing the creation of the Kubeflow project, a new open source Github repo dedicated to making using ML stacks on Kubernetes easy, fast and extensible. This repository contains:

  • JupyterHub to create & manage interactive Jupyter notebooks
  • A Tensorflow Custom Resource (CRD) that can be configured to use CPUs or GPUs, and adjusted to the size of a cluster with a single setting
  • A TF Serving container Because this solution relies on Kubernetes, it runs wherever Kubernetes runs. Just spin up a cluster and go!

Using Kubeflow

Let’s suppose you are working with two different Kubernetes clusters: a local minikube cluster; and a GKE cluster with GPUs; and that you have two kubectl contexts defined named minikube and gke.

First we need to initialize our ksonnet application and install the Kubeflow packages. (To use ksonnet, you must first install it on your operating system - the instructions for doing so are here)

     ks init my-kubeflow  
     cd my-kubeflow  
     ks registry add kubeflow \  
     ks pkg install kubeflow/core  
     ks pkg install kubeflow/tf-serving  
     ks pkg install kubeflow/tf-job  
     ks generate core kubeflow-core --name=kubeflow-core

We can now define environments corresponding to our two clusters.

     kubectl config use-context minikube  
     ks env add minikube  

     kubectl config use-context gke  
     ks env add gke  

And we’re done! Now just create the environments on your cluster. First, on minikube:

     ks apply minikube -c kubeflow-core  

And to create it on our multi-node GKE cluster for quicker training:

     ks apply gke -c kubeflow-core  

By making it easy to deploy the same rich ML stack everywhere, the drift and rewriting between these environments is kept to a minimum.

To access either deployments, you can execute the following command:

     kubectl port-forward tf-hub-0 8100:8000  

and then open up to access JupyterHub. To change the environment used by kubectl, use either of these commands:

     # To access minikube  
     kubectl config use-context minikube  

     # To access GKE  
     kubectl config use-context gke  

When you execute apply you are launching on K8s

  • JupyterHub for launching and managing Jupyter notebooks on K8s
  • A TF CRD

Let’s suppose you want to submit a training job. Kubeflow provides ksonnet prototypes that make it easy to define components. The tf-job prototype makes it easy to create a job for your code but for this example, we’ll use the tf-cnn prototype which runs TensorFlow’s CNN benchmark.

To submit a training job, you first generate a new job from a prototype:

     ks generate tf-cnn cnn --name=cnn  

By default the tf-cnn prototype uses 1 worker and no GPUs which is perfect for our minikube cluster so we can just submit it.

     ks apply minikube -c cnn

On GKE, we’ll want to tweak the prototype to take advantage of the multiple nodes and GPUs. First, let’s list all the parameters available:

     # To see a list of parameters  
     ks prototype list tf-job  

Now let’s adjust the parameters to take advantage of GPUs and access to multiple nodes.

     ks param set --env=gke cnn num\_gpus 1  
     ks param set --env=gke cnn num\_workers 1  

     ks apply gke -c cnn  

Note how we set those parameters so they are used only when you deploy to GKE. Your minikube parameters are unchanged!

After training, you export your model to a serving location.

Kubeflow also includes a serving package as well. In a separate example, we trained a standard Inception model, and stored the trained model in a bucket we’ve created called ‘gs://kubeflow-models’ with the path ‘/inception’.

To deploy a the trained model for serving, execute the following:

     ks generate tf-serving inception --name=inception  
     ---namespace=default --model\_path=gs://kubeflow-models/inception  
     ks apply gke -c inception  

This highlights one more option in Kubeflow - the ability to pass in inputs based on your deployment. This command creates a tf-serving service on the GKE cluster, and makes it available to your application.

For more information about of deploying and monitoring TensorFlow training jobs and TensorFlow models please refer to the user guide.

Kubeflow + ksonnet

One choice we want to call out is the use of the ksonnet project. We think working with multiple environments (dev, test, prod) will be the norm for most Kubeflow users. By making environments a first class concept, ksonnet makes it easy for Kubeflow users to easily move their workloads between their different environments.

Particularly now that Helm is integrating ksonnet with the next version of their platform, we felt like it was the perfect choice for us. More information about ksonnet can be found in the ksonnet docs.

We also want to thank the team at Heptio for expediting features critical to Kubeflow’s use of ksonnet.

What’s Next?

We are in the midst of building out a community effort right now, and we would love your help! We’ve already been collaborating with many teams - CaiCloud, Red Hat & OpenShift, Canonical, Weaveworks, Container Solutions and many others. CoreOS, for example, is already seeing the promise of Kubeflow:

“The Kubeflow project was a needed advancement to make it significantly easier to set up and productionize machine learning workloads on Kubernetes, and we anticipate that it will greatly expand the opportunity for even more enterprises to embrace the platform. We look forward to working with the project members in providing tight integration of Kubeflow with Tectonic, the enterprise Kubernetes platform.” – Reza Shafii, VP of product, CoreOS

If you’d like to try out Kubeflow right now right in your browser, we’ve partnered with Katacoda to make it super easy. You can try it here!

And we’re just getting started! We would love for you to help. How you might ask? Well…

  • Please join theslack channel
  • Please join thekubeflow-discuss email list
  • Please subscribe to theKubeflow twitter account
  • Please download and run kubeflow, and submit bugs! Thank you for your support so far, we could not be more excited!

Jeremy Lewi & David Aronchick Google

Kubernetes 1.9: Apps Workloads GA and Expanded Ecosystem

December 15 2017

We’re pleased to announce the delivery of Kubernetes 1.9, our fourth and final release this year.

Today’s release continues the evolution of an increasingly rich feature set, more robust stability, and even greater community contributions. As the fourth release of the year, it gives us an opportunity to look back at the progress made in key areas. Particularly notable is the advancement of the Apps Workloads API to stable. This removes any reservations potential adopters might have had about the functional stability required to run mission-critical workloads. Another big milestone is the beta release of Windows support, which opens the door for many Windows-specific applications and workloads to run in Kubernetes, significantly expanding the implementation scenarios and enterprise readiness of Kubernetes.

Workloads API GA

We’re excited to announce General Availability (GA) of the apps/v1 Workloads API, which is now enabled by default. The Apps Workloads API groups the DaemonSet, Deployment, ReplicaSet, and StatefulSet APIs together to form the foundation for long-running stateless and stateful workloads in Kubernetes. Note that the Batch Workloads API (Job and CronJob) is not part of this effort and will have a separate path to GA stability.

Deployment and ReplicaSet, two of the most commonly used objects in Kubernetes, are now stabilized after more than a year of real-world use and feedback. SIG Apps has applied the lessons from this process to all four resource kinds over the last several release cycles, enabling DaemonSet and StatefulSet to join this graduation. The v1 (GA) designation indicates production hardening and readiness, and comes with the guarantee of long-term backwards compatibility.

Windows Support (beta)

Kubernetes was originally developed for Linux systems, but as our users are realizing the benefits of container orchestration at scale, we are seeing demand for Kubernetes to run Windows workloads. Work to support Windows Server in Kubernetes began in earnest about 12 months ago. SIG-Windowshas now promoted this feature to beta status, which means that we can evaluate it for usage.

Storage Enhancements

From the first release, Kubernetes has supported multiple options for persistent data storage, including commonly-used NFS or iSCSI, along with native support for storage solutions from the major public and private cloud providers. As the project and ecosystem grow, more and more storage options have become available for Kubernetes. Adding volume plugins for new storage systems, however, has been a challenge.

Container Storage Interface (CSI) is a cross-industry standards initiative that aims to lower the barrier for cloud native storage development and ensure compatibility. SIG-Storage and the CSI Community are collaborating to deliver a single interface for provisioning, attaching, and mounting storage compatible with Kubernetes.

Kubernetes 1.9 introduces an alpha implementation of the Container Storage Interface (CSI), which will make installing new volume plugins as easy as deploying a pod, and enable third-party storage providers to develop their solutions without the need to add to the core Kubernetes codebase.

Because the feature is alpha in 1.9, it must be explicitly enabled and is not recommended for production usage, but it indicates the roadmap working toward a more extensible and standards-based Kubernetes storage ecosystem.

Additional Features

Custom Resource Definition (CRD) Validation, now graduating to beta and enabled by default, helps CRD authors give clear and immediate feedback for invalid objects

SIG Node hardware accelerator moves to alpha, enabling GPUs and consequently machine learning and other high performance workloads

CoreDNS alpha makes it possible to install CoreDNS with standard tools

IPVS mode for kube-proxy goes beta, providing better scalability and performance for large clusters

Each Special Interest Group (SIG) in the community continues to deliver the most requested user features for their area. For a complete list, please visit the release notes.


Kubernetes 1.9 is available for download on GitHub. To get started with Kubernetes, check out these interactive tutorials

Release team

This release is made possible through the effort of hundreds of individuals who contributed both technical and non-technical content. Special thanks to the release team led by Anthony Yeh, Software Engineer at Google. The 14 individuals on the release team coordinate many aspects of the release, from documentation to testing, validation, and feature completeness.

As the Kubernetes community has grown, our release process has become an amazing demonstration of collaboration in open source software development. Kubernetes continues to gain new users at a rapid clip. This growth creates a positive feedback cycle where more contributors commit code creating a more vibrant ecosystem. 

Project Velocity

The CNCF has embarked on an ambitious project to visualize the myriad contributions that go into the project. K8s DevStats illustrates the breakdown of contributions from major company contributors. Open issues remained relatively stable over the course of the release, while forks rose approximately 20%, as did individuals starring the various project repositories. Approver volume has risen slightly since the last release, but a lull is commonplace during the last quarter of the year. With 75,000+ comments, Kubernetes remains one of the most actively discussed projects on GitHub.

User highlights

According to the latest survey conducted by CNCF, 61 percent of organizations are evaluating and 83 percent are using Kubernetes in production. Example of user stories from the community include:

BlaBlaCar, the world’s largest long distance carpooling community connects 40 million members across 22 countries. The company has about 3,000 pods, with 1,200 of them running on Kubernetes, leading to improved website availability for customers.

Pokémon GO, the popular free-to-play, location-based augmented reality game developed by Niantic for iOS and Android devices, has its application logic running on Google Container Engine powered by Kubernetes. This was the largest Kubernetes deployment ever on Google Container Engine.

Is Kubernetes helping your team? Share your story with the community. 

Ecosystem updates

Announced on November 13, the Certified Kubernetes Conformance Program ensures that Certified Kubernetes™ products deliver consistency and portability. Thirty-two Certified Kubernetes Distributions and Platforms are now available. Development of the certification program involved close collaboration between CNCF and the rest of the Kubernetes community, especially the Testing and Architecture Special Interest Groups (SIGs). The Kubernetes Architecture SIG is the final arbiter of the definition of API conformance for the program. The program also includes strong guarantees that commercial providers of Kubernetes will continue to release new versions to ensure that customers can take advantage of the rapid pace of ongoing development.

CNCF also offers online training that teaches the skills needed to create and configure a real-world Kubernetes cluster.


For recorded sessions from the largest Kubernetes gathering, KubeCon + CloudNativeCon in Austin from December 6-8, 2017, visit YouTube/CNCF. The premiere Kubernetes event will be back May 2-4, 2018 in Copenhagen and will feature technical sessions, case studies, developer deep dives, salons and more! CFP closes January 12, 2018. 


Join members of the Kubernetes 1.9 release team on January 9th from 10am-11am PT to learn about the major features in this release as they demo some of the highlights in the areas of Windows and Docker support, storage, admission control, and the workloads API. Register here.

Get involved:

The simplest way to get involved with Kubernetes is by joining one of the many Special Interest Groups (SIGs) that align with your interests. Have something you’d like to broadcast to the Kubernetes community? Share your voice at our weekly community meeting, and through the channels below.

Thank you for your continued feedback and support.

  • Post questions (or answer questions) on Stack Overflow
  • Join the community portal for advocates on K8sPort
  • Follow us on Twitter @Kubernetesio for latest updates
  • Chat with the community on Slack
  • Share your Kubernetes story.

Using eBPF in Kubernetes

December 07 2017


Kubernetes provides a high-level API and a set of components that hides almost all of the intricate and—to some of us—interesting details of what happens at the systems level. Application developers are not required to have knowledge of the machines’ IP tables, cgroups, namespaces, seccomp, or, nowadays, even the container runtime that their application runs on top of. But underneath, Kubernetes and the technologies upon which it relies (for example, the container runtime) heavily leverage core Linux functionalities.

This article focuses on a core Linux functionality increasingly used in networking, security and auditing, and tracing and monitoring tools. This functionality is called extended Berkeley Packet Filter (eBPF)

Note: In this article we use both acronyms: eBPF and BPF. The former is used for the extended BPF functionality, and the latter for “classic” BPF functionality.

What is BPF?

BPF is a mini-VM residing in the Linux kernel that runs BPF programs. Before running, BPF programs are loaded with the bpf() syscall and are validated for safety: checking for loops, code size, etc. BPF programs are attached to kernel objects and executed when events happen on those objects—for example, when a network interface emits a packet.

BPF Superpowers

BPF programs are event-driven by definition, an incredibly powerful concept, and executes code in the kernel when an event occurs. Netflix’s Brendan Gregg refers to BPF as a Linux superpower.

The ‘e’ in eBPF

Traditionally, BPF could only be attached to sockets for socket filtering. BPF’s first use case was in tcpdump. When you run tcpdump the filter is compiled into a BPF program and attached to a raw AF_PACKET socket in order to print out filtered packets.

But over the years, eBPF added the ability to attach to other kernel objects. In addition to socket filtering, some supported attach points are:

  • Kprobes (and userspace equivalents uprobes)
  • Tracepoints
  • Network schedulers or qdiscs for classification or action (tc)
  • XDP (eXpress Data Path) This and other, newer features like in-kernel helper functions and shared data-structures (maps) that can be used to communicate with user space, extend BPF’s capabilities.

    Existing Use Cases of eBPF with Kubernetes

    Several open-source Kubernetes tools already use eBPF and many use cases warrant a closer look, especially in areas such as networking, monitoring and security tools.

Dynamic Network Control and Visibility with Cilium

Cilium is a networking project that makes heavy use of eBPF superpowers to route and filter network traffic for container-based systems. By using eBPF, Cilium can dynamically generate and apply rules—even at the device level with XDP—without making changes to the Linux kernel itself.

The Cilium Agent runs on each host. Instead of managing IP tables, it translates network policy definitions to BPF programs that are loaded into the kernel and attached to a container’s virtual ethernet device. These programs are executed—rules applied—on each packet that is sent or received.

This diagram shows how the Cilium project works:

Depending on what network rules are applied, BPF programs may be attached with tc or XDP. By using XDP, Cilium can attach the BPF programs at the lowest possible point, which is also the most performant point in the networking software stack.

If you’d like to learn more about how Cilium uses eBPF, take a look at the project’s BPF and XDP reference guide.

Tracking TCP Connections in Weave Scope

Weave Scope is a tool for monitoring, visualizing and interacting with container-based systems. For our purposes, we’ll focus on how Weave Scope gets the TCP connections.

Weave Scope employs an agent that runs on each node of a cluster. The agent monitors the system, generates a report and sends it to the app server. The app server compiles the reports it receives and presents the results in the Weave Scope UI.

To accurately draw connections between containers, the agent attaches a BPF program to kprobes that track socket events: opening and closing connections. The BPF program, tcptracer-bpf, is compiled into an ELF object file and loaded using gopbf.

(As a side note, Weave Scope also has a plugin that make use of eBPF: HTTP statistics.)

To learn more about how this works and why it’s done this way, read this extensive post that the Kinvolk team wrote for the Weaveworks Blog. You can also watch a recent talk about the topic.

Limiting syscalls with seccomp-bpf

Linux has more than 300 system calls (read, write, open, close, etc.) available for use—or misuse. Most applications only need a small subset of syscalls to function properly. seccomp is a Linux security facility used to limit the set of syscalls that an application can use, thereby limiting potential misuse.

The original implementation of seccomp was highly restrictive. Once applied, if an application attempted to do anything beyond reading and writing to files it had already opened, seccomp sent a SIGKILL signal.

seccomp-bpf enables more complex filters and a wider range of actions. Seccomp-bpf, also known as seccomp mode 2, allows for applying custom filters in the form of BPF programs. When the BPF program is loaded, the filter is applied to each syscall and the appropriate action is taken (Allow, Kill, Trap, etc.).

seccomp-bpf is widely used in Kubernetes tools and exposed in Kubernetes itself. For example, seccomp-bpf is used in Docker to apply custom seccomp security profiles, in rkt to apply seccomp isolators, and in Kubernetes itself in its Security Context.

But in all of these cases the use of BPF is hidden behind libseccomp. Behind the scenes, libseccomp generates BPF code from rules provided to it. Once generated, the BPF program is loaded and the rules applied.

Potential Use Cases for eBPF with Kubernetes

eBPF is a relatively new Linux technology. As such, there are many uses that remain unexplored. eBPF itself is also evolving: new features are being added in eBPF that will enable new use cases that aren’t currently possible. In the following sections, we’re going to look at some of these that have only recently become possible and ones on the horizon. Our hope is that these features will be leveraged by open source tooling.

Pod and container level network statistics

BPF socket filtering is nothing new, but BPF socket filtering per cgroup is. Introduced in Linux 4.10, cgroup-bpf allows attaching eBPF programs to cgroups. Once attached, the program is executed for all packets entering or exiting any process in the cgroup.

A cgroup is, amongst other things, a hierarchical grouping of processes. In Kubernetes, this grouping is found at the container level. One idea for making use of cgroup-bpf, is to install BPF programs that collect detailed per-pod and/or per-container network statistics.

Generally, such statistics are collected by periodically checking the relevant file in Linux’s /sys directory or using Netlink. By using BPF programs attached to cgroups for this, we can get much more detailed statistics: for example, how many packets/bytes on tcp port 443, or how many packets/bytes from IP In general, because BPF programs have a kernel context, they can safely and efficiently deliver more detailed information to user space.

To explore the idea, the Kinvolk team implemented a proof-of-concept: https://github.com/kinvolk/cgnet. This project attaches a BPF program to each cgroup and exports the information to Prometheus.

There are of course other interesting possibilities, like doing actual packet filtering. But the obstacle currently standing in the way of this is having cgroup v2 support—required by cgroup-bpf—in Docker and Kubernetes.

Application-applied LSM

Linux Security Modules (LSM) implements a generic framework for security policies in the Linux kernel. SELinux and AppArmor are examples of these. Both of these implement rules at a system-global scope, placing the onus on the administrator to configure the security policies.

Landlock is another LSM under development that would co-exist with SELinux and AppArmor. An initial patchset has been submitted to the Linux kernel and is in an early stage of development. The main difference with other LSMs is that Landlock is designed to allow unprivileged applications to build their own sandbox, effectively restricting themselves instead of using a global configuration. With Landlock, an application can load a BPF program and have it executed when the process performs a specific action. For example, when the application opens a file with the open() system call, the kernel will execute the BPF program, and, depending on what the BPF program returns, the action will be accepted or denied.

In some ways, it is similar to seccomp-bpf: using a BPF program, seccomp-bpf allows unprivileged processes to restrict what system calls they can perform. Landlock will be more powerful and provide more flexibility. Consider the following system call:

fd = open(“myfile.txt”, O\_RDWR);

The first argument is a “char *”, a pointer to a memory address, such as 0xab004718.

With seccomp, a BPF program only has access to the parameters of the syscall but cannot dereference the pointers, making it impossible to make security decisions based on a file. seccomp also uses classic BPF, meaning it cannot make use of eBPF maps, the mechanism for interfacing with user space. This restriction means security policies cannot be changed in seccomp-bpf based on a configuration in an eBPF map.

BPF programs with Landlock don’t receive the arguments of the syscalls but a reference to a kernel object. In the example above, this means it will have a reference to the file, so it does not need to dereference a pointer, consider relative paths, or perform chroots.

Use Case: Landlock in Kubernetes-based serverless frameworks

In Kubernetes, the unit of deployment is a pod. Pods and containers are the main unit of isolation. In serverless frameworks, however, the main unit of deployment is a function. Ideally, the unit of deployment equals the unit of isolation. This puts serverless frameworks like Kubeless or OpenFaaS into a predicament: optimize for unit of isolation or deployment?

To achieve the best possible isolation, each function call would have to happen in its own container—ut what’s good for isolation is not always good for performance. Inversely, if we run function calls within the same container, we increase the likelihood of collisions.

By using Landlock, we could isolate function calls from each other within the same container, making a temporary file created by one function call inaccessible to the next function call, for example. Integration between Landlock and technologies like Kubernetes-based serverless frameworks would be a ripe area for further exploration.

Auditing kubectl-exec with eBPF

In Kubernetes 1.7 the audit proposal started making its way in. It’s currently pre-stable with plans to be stable in the 1.10 release. As the name implies, it allows administrators to log and audit events that take place in a Kubernetes cluster.

While these events log Kubernetes events, they don’t currently provide the level of visibility that some may require. For example, while we can see that someone has used kubectl exec to enter a container, we are not able to see what commands were executed in that session. With eBPF one can attach a BPF program that would record any commands executed in the kubectl exec session and pass those commands to a user-space program that logs those events. We could then play that session back and know the exact sequence of events that took place.

Learn more about eBPF

If you’re interested in learning more about eBPF, here are some resources:


We are just starting to see the Linux superpowers of eBPF being put to use in Kubernetes tools and technologies. We will undoubtedly see increased use of eBPF. What we have highlighted here is just a taste of what you might expect in the future. What will be really exciting is seeing how these technologies will be used in ways that we have not yet thought about. Stay tuned!

The Kinvolk team will be hanging out at the Kinvolk booth at KubeCon in Austin. Come by to talk to us about all things, Kubernetes, Linux, container runtimes and yeah, eBPF.

PaddlePaddle Fluid: Elastic Deep Learning on Kubernetes

December 06 2017

Editor’s note: Today’s post is a joint post from the deep learning team at Baidu and the etcd team at CoreOS.

PaddlePaddle Fluid: Elastic Deep Learning on Kubernetes

Two open source communities—PaddlePaddle, the deep learning framework originated in Baidu, and Kubernetes®, the most famous containerized application scheduler—are announcing the Elastic Deep Learning (EDL) feature in PaddlePaddle’s new release codenamed Fluid.

Fluid EDL includes a Kubernetes controller, PaddlePaddle auto-scaler, which changes the number of processes of distributed jobs according to the idle hardware resource in the cluster, and a new fault-tolerable architecture as described in the PaddlePaddle design doc.

Industrial deep learning requires significant computation power. Research labs and companies often build GPU clusters managed by SLURM, MPI, or SGE. These clusters either run a submitted job if it requires less than the idle resource, or pend the job for an unpredictably long time. This approach has its drawbacks: in an example with 99 available nodes and a submitted job that requires 100, the job has to wait without using any of the available nodes. Fluid works with Kubernetes to power elastic deep learning jobs, which often lack optimal resources, by helping to expose potential algorithmic problems as early as possible.

Another challenge is that industrial users tend to run deep learning jobs as a subset stage of the complete data pipeline, including the web server and log collector. Such general-purpose clusters require priority-based elastic scheduling. This makes it possible to run more processes in the web server job and less in deep learning during periods of high web traffic, then prioritize deep learning when web traffic is low. Fluid talks to Kubernetes’ API server to understand the global picture and orchestrate the number of processes affiliated with various jobs.

In both scenarios, PaddlePaddle jobs are tolerant to a process spikes and decreases. We achieved this by implementing the new design, which introduces a master process in addition to the old PaddlePaddle architecture as described in a previous blog post. In the new design, as long as there are three processes left in a job, it continues. In extreme cases where all processes are killed, the job can be restored and resume.

We tested Fluid EDL for two use cases: 1) the Kubernetes cluster runs only PaddlePaddle jobs; and 2) the cluster runs PaddlePaddle and Nginx jobs.

In the first test, we started up to 20 PaddlePaddle jobs one by one with a 10-second interval. Each job has 60 trainers and 10 parameter server processes, and will last for hours. We repeated the experiment 20 times: 10 with FluidEDL turned off and 10 with FluidEDL turned on. In Figure one, solid lines correspond to the first 10 experiments and dotted lines the rest. In the upper part of the figure, we see that the number of pending jobs increments monotonically without EDL. However, when EDL is turned on, resources are evenly distributed to all jobs. Fluid EDL kills some existing processes to make room for new jobs and jobs coming in at a later point in time. In both cases, the cluster is equally utilized (see lower part of figure).

Figure 1. Fluid EDL evenly distributes resource among jobs.

In the second test, each experiment ran 400 Nginx pods, which has higher priority than the six PaddlePaddle jobs. Initially, each PaddlePaddle job had 15 trainers and 10 parameter servers. We killed 100 Nginx pods every 90 seconds until 100 left, and then we started to increase the number of Nginx jobs by 100 every 90 seconds. The upper part of Figure 2 shows this process. The middle of the diagram shows that Fluid EDL automatically started some PaddlePaddle processes by decreasing Nginx pods, and killed PaddlePaddle processes by increasing Nginx pods later on. As a result, the cluster maintains around 90% utilization as shown in the bottom of the figure. When Fluid EDL was turned off, there were no PaddlePaddle processes autoincrement, and the utilization fluctuated with the varying number of Nginx pods.

Figure 2. Fluid changes PaddlePaddle processes with the change of Nginx processes.

We continue to work on FluidEDL and welcome comments and contributions. Visit the PaddlePaddle repo, where you can find the design doc, a simple tutorial, and experiment details.

  • Xu Yan (Baidu Research)
  • Helin Wang (Baidu Research)
  • Yi Wu (Baidu Research)
  • Xi Chen (Baidu Research)
  • Weibao Gong (Baidu Research)
  • Xiang Li (CoreOS)

  • Yi Wang (Baidu Research)

Autoscaling in Kubernetes

November 17 2017

Kubernetes allows developers to automatically adjust cluster sizes and the number of pod replicas based on current traffic and load. These adjustments reduce the amount of unused nodes, saving money and resources. In this talk, Marcin Wielgus of Google walks you through the current state of pod and node autoscaling in Kubernetes: .how it works, and how to use it, including best practices for deployments in production applications.

Enjoyed this talk? Join us for more exciting sessions on scaling and automating your Kubernetes clusters at KubeCon in Austin on December 6-8. Register Now

Be sure to check out Automating and Testing Production Ready Kubernetes Clusters in the Public Cloud by Ron Lipke, Senior Developer, Platform as a Service, Gannet/USA Today Network.

@Kubernetesio View on Github #kubernetes-users Stack Overflow Download Kubernetes