Kubernetes Blog

Cluster Federation in Kubernetes 1.5

December 22 2016

Editor’s note: this post is part of a series of in-depth articles on what’s new in Kubernetes 1.5

In the latest Kubernetes 1.5 release, you’ll notice that support for Cluster Federation is maturing. That functionality was introduced in Kubernetes 1.3, and the 1.5 release includes a number of new features, including an easier setup experience and a step closer to supporting all Kubernetes API objects.

A new command line tool called ‘kubefed’ was introduced to make getting started with Cluster Federation much simpler. Also, alpha level support was added for Federated DaemonSets, Deployments and ConfigMaps. In summary:

  • DaemonSets are Kubernetes deployment rules that guarantee that a given pod is always present at every node, as new nodes are added to the cluster (more info).
  • Deployments describe the desired state of Replica Sets (more info).
  • ConfigMaps are variables applied to Replica Sets (which greatly improves image reusability as their parameters can be externalized - more info). Federated DaemonSets , Federated Deployments , Federated ConfigMaps take the qualities of the base concepts to the next level. For instance, Federated DaemonSets guarantee that a pod is deployed on every node of the newly added cluster.

But what actually is “federation”? Let’s explain it by what needs it satisfies. Imagine a service that operates globally. Naturally, all its users expect to get the same quality of service, whether they are located in Asia, Europe, or the US. What this means is that the service must respond equally fast to requests at each location. This sounds simple, but there’s lots of logic involved behind the scenes. This is what Kubernetes Cluster Federation aims to do.

How does it work? One of the Kubernetes clusters must become a master by running a Federation Control Plane. In practice, this is a controller that monitors the health of other clusters, and provides a single entry point for administration. The entry point behaves like a typical Kubernetes cluster. It allows creating Replica Sets, Deployments, Services, but the federated control plane passes the resources to underlying clusters. This means that if we request the federation control plane to create a Replica Set with 1,000 replicas, it will spread the request across all underlying clusters. If we have 5 clusters, then by default each will get its share of 200 replicas.

This on its own is a powerful mechanism. But there’s more. It’s also possible to create a Federated Ingress. Effectively, this is a global application-layer load balancer. Thanks to an understanding of the application layer, it allows load balancing to be “smarter” – for instance, by taking into account the geographical location of clients and servers, and routing the traffic between them in an optimal way.

In summary, with Kubernetes Cluster Federation, we can facilitate administration of all the clusters (single access point), but also optimize global content delivery around the globe. In the following sections, we will show how it works.

Creating a Federation Plane

In this exercise, we will federate a few clusters. For convenience, all commands have been grouped into 6 scripts available here:

  • 0-settings.sh
  • 1-create.sh
  • 2-getcredentials.sh
  • 3-initfed.sh
  • 4-joinfed.sh
  • 5-destroy.sh First we need to define several variables (0-settings.sh)
$ cat 0-settings.sh && . 0-settings.sh

# this project create 3 clusters in 3 zones. FED\_HOST\_CLUSTER points to the one, which will be used to deploy federation control plane

export FED\_HOST\_CLUSTER=us-east1-b

# Google Cloud project name

export FED\_PROJECT=\<YOUR PROJECT e.g. company-project\>

# DNS suffix for this federation. Federated Service DNS names are published with this suffix. This must be a real domain name that you control and is programmable by one of the DNS providers (Google Cloud DNS or AWS Route53)

export FED\_DNS\_ZONE=\<YOUR DNS SUFFIX e.g. example.com\>

And get kubectl and kubefed binaries. (for installation instructions refer to guides here and here).
Now the setup is ready to create a few Google Container Engine (GKE) clusters with gcloud container clusters create (1-create.sh). In this case one is in US, one in Europe and one in Asia.

$ cat 1-create.sh && . 1-create.sh

gcloud container clusters create gce-us-east1-b --project=${FED\_PROJECT} --zone=us-east1-b --scopes cloud-platform,storage-ro,logging-write,monitoring-write,service-control,service-management,https://www.googleapis.com/auth/ndev.clouddns.readwrite

gcloud container clusters create gce-europe-west1-b --project=${FED\_PROJECT} --zone=europe-west1-b --scopes cloud-platform,storage-ro,logging-write,monitoring-write,service-control,service-management,https://www.googleapis.com/auth/ndev.clouddns.readwrite

gcloud container clusters create gce-asia-east1-a --project=${FED\_PROJECT} --zone=asia-east1-a --scopes cloud-platform,storage-ro,logging-write,monitoring-write,service-control,service-management,https://www.googleapis.com/auth/ndev.clouddns.readwrite

The next step is fetching kubectl configuration with gcloud -q container clusters get-credentials (2-getcredentials.sh). The configurations will be used to indicate the current context for kubectl commands.

$ cat 2-getcredentials.sh && . 2-getcredentials.sh

gcloud -q container clusters get-credentials gce-us-east1-b --zone=us-east1-b --project=${FED\_PROJECT}

gcloud -q container clusters get-credentials gce-europe-west1-b --zone=europe-west1-b --project=${FED\_PROJECT}

gcloud -q container clusters get-credentials gce-asia-east1-a --zone=asia-east1-a --project=${FED\_PROJECT}

Let’s verify the setup:

$ kubectl config get-contexts












We have 3 clusters. One, indicated by the FED_HOST_CLUSTER environment variable, will be used to run the federation plane. For this, we will use the kubefed init federation command (3-initfed.sh).

$ cat 3-initfed.sh && . 3-initfed.sh

kubefed init federation --host-cluster-context=gke\_${FED\_PROJECT}\_${FED\_HOST\_CLUSTER}\_gce-${FED\_HOST\_CLUSTER} --dns-zone-name=${FED\_DNS\_ZONE}

You will notice that after executing the above command, a new kubectl context has appeared:

$ kubectl config get-contexts





The federation context will become our administration entry point. Now it’s time to join clusters (4-joinfed.sh):

$ cat 4-joinfed.sh && . 4-joinfed.sh

kubefed --context=federation join cluster-europe-west1-b --cluster-context=gke\_${FED\_PROJECT}\_europe-west1-b\_gce-europe-west1-b --host-cluster-context=gke\_${FED\_PROJECT}\_${FED\_HOST\_CLUSTER}\_gce-${FED\_HOST\_CLUSTER}

kubefed --context=federation join cluster-asia-east1-a --cluster-context=gke\_${FED\_PROJECT}\_asia-east1-a\_gce-asia-east1-a --host-cluster-context=gke\_${FED\_PROJECT}\_${FED\_HOST\_CLUSTER}\_gce-${FED\_HOST\_CLUSTER}

kubefed --context=federation join cluster-us-east1-b --cluster-context=gke\_${FED\_PROJECT}\_us-east1-b\_gce-us-east1-b --host-cluster-context=gke\_${FED\_PROJECT}\_${FED\_HOST\_CLUSTER}\_gce-${FED\_HOST\_CLUSTER}

Note that cluster gce-us-east1-b is used here to run the federation control plane and also to work as a worker cluster. This circular dependency helps to use resources more efficiently and it can be verified by using the kubectl –context=federation get clusters command:

$ kubectl --context=federation get clusters

NAME                        STATUS    AGE

cluster-asia-east1-a        Ready     7s

cluster-europe-west1-b      Ready     10s

cluster-us-east1-b          Ready     10s

We are good to go.

Using Federation To Run An Application

In our repository you will find instructions on how to build a docker image with a web service that displays the container’s hostname and the Google Cloud Platform (GCP) zone.

An example output might look like this:


Now we will deploy the Replica Set (k8shserver.yaml):

$ kubectl --context=federation create -f rs/k8shserver

And a Federated Service (k8shserver.yaml):

$ kubectl --context=federation create -f service/k8shserver

As you can see, the two commands refer to the “federation” context, i.e. to the federation control plane. After a few minutes, you will realize that underlying clusters run the Replica Set and the Service.

Creating The Ingress

After the Service is ready, we can create Ingress - the global load balancer. The command is like this:

kubectl --context=federation create -f ingress/k8shserver.yaml

The contents of the file point to the service we created in the previous step:

apiVersion: extensions/v1beta1

kind: Ingress


  name: k8shserver



    serviceName: k8shserver

    servicePort: 80

After a few minutes, we should get a global IP address:

$ kubectl --context=federation get ingress

NAME         HOSTS     ADDRESS          PORTS     AGE

k8shserver   \*   80        20m

Effectively, the response of:

$ curl

depends on the location of client. Something like this would be expected in the US:


Whereas in Europe, we might have:


Please refer to this issue for additional details on how everything we’ve described works.



Cluster Federation is actively being worked on and is still not fully General Available. Some APIs are in beta and others are in alpha. Some features are missing, for instance cross-cloud load balancing is not supported (federated ingress currently only works on Google Cloud Platform as it depends on GCP HTTP(S) Load Balancing).

Nevertheless, as the functionality matures, it will become an enabler for all companies that aim at global markets, but currently cannot afford sophisticated administration techniques as used by the likes of Netflix or Amazon. That’s why we closely watch the technology, hoping that it soon fulfills its promise.

PS. When done, remember to destroy your clusters:

$ . 5-destroy.sh

-__-Lukasz Guminski, Software Engineer at Container Solutions. Allan Naim, Product Manager, Google

Windows Server Support Comes to Kubernetes

December 21 2016

Editor’s note: this post is part of a series of in-depth articles on what’s new in Kubernetes 1.5

Extending on the theme of giving users choice, Kubernetes 1.5 release includes the support for Windows Servers. WIth more than 80% of enterprise apps running Java on Linux or .Net on Windows, Kubernetes is previewing capabilities that extends its reach to the mass majority of enterprise workloads. 

The new Kubernetes Windows Server 2016 and Windows Container support includes public preview with the following features:

  • Containerized Multiplatform Applications - Applications developed in operating system neutral languages like Go and .NET Core were previously impossible to orchestrate between Linux and Windows. Now, with support for Windows Server 2016 in Kubernetes, such applications can be deployed on both Windows Server as well as Linux, giving the developer choice of the operating system runtime. This capability has been desired by customers for almost two decades. 

  • Support for Both Windows Server Containers and Hyper-V Containers - There are two types of containers in Windows Server 2016. Windows Containers is similar to Docker containers on Linux, and uses kernel sharing. The other, called Hyper-V Containers, is more lightweight than a virtual machine while at the same time offering greater isolation, its own copy of the kernel, and direct memory assignment. Kubernetes can orchestrate both these types of containers. 

  • Expanded Ecosystem of Applications - One of the key drivers of introducing Windows Server support in Kubernetes is to expand the ecosystem of applications supported by Kubernetes: IIS, .NET, Windows Services, ASP.NET, .NET Core, are some of the application types that can now be orchestrated by Kubernetes, running inside a container on Windows Server.

  • Coverage for Heterogeneous Data Centers - Organizations already use Kubernetes to host tens of thousands of application instances across Global 2000 and Fortune 500. This will allow them to expand Kubernetes to the large footprint of Windows Server. 

The process to bring Windows Server to Kubernetes has been a truly multi-vendor effort and championed by the Windows Special Interest Group (SIG) - Apprenda, Google, Red Hat and Microsoft were all involved in bringing Kubernetes to Windows Server. On the community effort to bring Kubernetes to Windows Server, Taylor Brown, Principal Program Manager at Microsoft stated that “This new Kubernetes community work furthers Windows Server container support options for popular orchestrators, reinforcing Microsoft’s commitment to choice and flexibility for both Windows and Linux ecosystems.”

Guidance for Current Usage

| Where to use Windows Server support? | Right now organizations should start testing Kubernetes on Windows Server and provide feedback. Most organizations take months to set up hardened production environments and general availability should be available in next few releases of Kubernetes. | | What works? | Most of the Kubernetes constructs, such as Pods, Services, Labels, etc. work with Windows Containers. | | What doesn’t work yet? | - Pod abstraction is not same due to networking namespaces. Net result is that Windows containers in a single POD cannot communicate over localhost. Linux containers can share networking stack by placing them in the same network namespace. - DNS capabilities are not fully implemented - UDP is not supported inside a container | | When will it be ready for all production workloads (general availability)? | The goal is to refine the networking and other areas that need work to get Kubernetes users a production version of Windows Server 2016 - including with Windows Nano Server and Windows Server Core installation options - support in the next couple releases. |

Technical Demo


Support for Windows Server-based containers is in alpha release mode for Kubernetes 1.5, but the community is not stopping there. Customers want enterprise hardened container scheduling and management for their entire tech portfolio. That has to include full parity of features among Linux and Windows Server in production. The Windows Server SIG will deliver that parity within the next one or two releases of Kubernetes through a few key areas of investment:

  • Networking - the SIG will continue working side by side with Microsoft to enhance the networking backbone of Windows Server Containers, specifically around lighting up container mode networking and native network overlay support for container endpoints. 
  • OOBE - Improving the setup, deployment, and diagnostics for a Windows Server node, including the ability to deploy to any cloud (Azure, AWS, GCP)
  • Runtime Operations - the SIG will play a key part in defining the monitoring interface of the Container Runtime Interface (CRI), leveraging it to provide deep insight and monitoring for Windows Server-based containers Get Started

To get started with Kubernetes on Windows Server 2016, please visit the GitHub guide for more details.
If you want to help with Windows Server support, then please connect with the Windows Server SIG or connect directly with Michael Michael, the SIG lead, on GitHub

Michael Michael, Senior Director of Product Management, Apprenda 

Kubernetes on Windows Server 2016 Architecture

StatefulSet: Run and Scale Stateful Applications Easily in Kubernetes

December 20 2016

Editor’s note: this post is part of a series of in-depth articles on what’s new in Kubernetes 1.5

In the latest release, Kubernetes 1.5, we’ve moved the feature formerly known as PetSet into beta as StatefulSet. There were no major changes to the API Object, other than the community selected name, but we added the semantics of “at most one pod per index” for deployment of the Pods in the set. Along with ordered deployment, ordered termination, unique network names, and persistent stable storage, we think we have the right primitives to support many containerized stateful workloads. We don’t claim that the feature is 100% complete (it is software after all), but we believe that it is useful in its current form, and that we can extend the API in a backwards-compatible way as we progress toward an eventual GA release.

When is StatefulSet the Right Choice for my Storage Application?

Deployments and ReplicaSets are a great way to run stateless replicas of an application on Kubernetes, but their semantics aren’t really right for deploying stateful applications. The purpose of StatefulSet is to provide a controller with the correct semantics for deploying a wide range of stateful workloads. However, moving your storage application onto Kubernetes isn’t always the correct choice. Before you go all in on converging your storage tier and your orchestration framework, you should ask yourself a few questions.

Can your application run using remote storage or does it require local storage media?

Currently, we recommend using StatefulSets with remote storage. Therefore, you must be ready to tolerate the performance implications of network attached storage. Even with storage optimized instances, you won’t likely realize the same performance as locally attached, solid state storage media. Does the performance of network attached storage, on your cloud, allow your storage application to meet its SLAs? If so, running your application in a StatefulSet provides compelling benefits from the perspective of automation. If the node on which your storage application is running fails, the Pod containing the application can be rescheduled onto another node, and, as it’s using network attached storage media, its data are still available after it’s rescheduled.

Do you need to scale your storage application?

What is the benefit you hope to gain by running your application in a StatefulSet? Do you have a single instance of your storage application for your entire organization? Is scaling your storage application a problem that you actually have? If you have a few instances of your storage application, and they are successfully meeting the demands of your organization, and those demands are not rapidly increasing, you’re already at a local optimum.

If, however, you have an ecosystem of microservices, or if you frequently stamp out new service footprints that include storage applications, then you might benefit from automation and consolidation. If you’re already using Kubernetes to manage the stateless tiers of your ecosystem, you should consider using the same infrastructure to manage your storage applications.

How important is predictable performance?

Kubernetes doesn’t yet support isolation for network or storage I/O across containers. Colocating your storage application with a noisy neighbor can reduce the QPS that your application can handle. You can mitigate this by scheduling the Pod containing your storage application as the only tenant on a node (thus providing it a dedicated machine) or by using Pod anti-affinity rules to segregate Pods that contend for network or disk, but this means that you have to actively identify and mitigate hot spots.

If squeezing the absolute maximum QPS out of your storage application isn’t your primary concern, if you’re willing and able to mitigate hotspots to ensure your storage applications meet their SLAs, and if the ease of turning up new “footprints” (services or collections of services), scaling them, and flexibly re-allocating resources is your primary concern, Kubernetes and StatefulSet might be the right solution to address it.

Does your application require specialized hardware or instance types?

If you run your storage application on high-end hardware or extra-large instance sizes, and your other workloads on commodity hardware or smaller, less expensive images, you may not want to deploy a heterogenous cluster. If you can standardize on a single instance size for all types of apps, then you may benefit from the flexible resource reallocation and consolidation, that you get from Kubernetes.

A Practical Example - ZooKeeper

ZooKeeper is an interesting use case for StatefulSet for two reasons. First, it demonstrates that StatefulSet can be used to run a distributed, strongly consistent storage application on Kubernetes. Second, it’s a prerequisite for running workloads like Apache Hadoop and Apache Kakfa on Kubernetes. An in-depth tutorial on deploying a ZooKeeper ensemble on Kubernetes is available in the Kubernetes documentation, and we’ll outline a few of the key features below.

Creating a ZooKeeper Ensemble
Creating an ensemble is as simple as using kubectl create to generate the objects stored in the manifest.

$ kubectl create -f [http://k8s.io/docs/tutorials/stateful-application/zookeeper.yaml](https://raw.githubusercontent.com/kubernetes/kubernetes.github.io/master/docs/tutorials/stateful-application/zookeeper.yaml)

service "zk-headless" created

configmap "zk-config" created

poddisruptionbudget "zk-budget" created

statefulset "zk" created

When you create the manifest, the StatefulSet controller creates each Pod, with respect to its ordinal, and waits for each to be Running and Ready prior to creating its successor.

$ kubectl get -w -l app=zk


zk-0      0/1       Pending   0          0s

zk-0      0/1       Pending   0         0s

zk-0      0/1       Pending   0         7s

zk-0      0/1       ContainerCreating   0         7s

zk-0      0/1       Running   0         38s

zk-0      1/1       Running   0         58s

zk-1      0/1       Pending   0         1s

zk-1      0/1       Pending   0         1s

zk-1      0/1       ContainerCreating   0         1s

zk-1      0/1       Running   0         33s

zk-1      1/1       Running   0         51s

zk-2      0/1       Pending   0         0s

zk-2      0/1       Pending   0         0s

zk-2      0/1       ContainerCreating   0         0s

zk-2      0/1       Running   0         25s

zk-2      1/1       Running   0         40s

Examining the hostnames of each Pod in the StatefulSet, you can see that the Pods’ hostnames also contain the Pods’ ordinals.

$ for i in 0 1 2; do kubectl exec zk-$i -- hostname; done




ZooKeeper stores the unique identifier of each server in a file called “myid”. The identifiers used for ZooKeeper servers are just natural numbers. For the servers in the ensemble, the “myid” files are populated by adding one to the ordinal extracted from the Pods’ hostnames.

$ for i in 0 1 2; do echo "myid zk-$i";kubectl exec zk-$i -- cat /var/lib/zookeeper/data/myid; done

myid zk-0


myid zk-1


myid zk-2


Each Pod has a unique network address based on its hostname and the network domain controlled by the zk-headless Headless Service.

$  for i in 0 1 2; do kubectl exec zk-$i -- hostname -f; done




The combination of a unique Pod ordinal and a unique network address allows you to populate the ZooKeeper servers’ configuration files with a consistent ensemble membership.

$  kubectl exec zk-0 -- cat /opt/zookeeper/conf/zoo.cfg








minSessionTimeout= 4000

maxSessionTimeout= 40000






StatefulSet lets you deploy ZooKeeper in a consistent and reproducible way. You won’t create more than one server with the same id, the servers can find each other via a stable network addresses, and they can perform leader election and replicate writes because the ensemble has consistent membership.

The simplest way to verify that the ensemble works is to write a value to one server and to read it from another. You can use the “zkCli.sh” script that ships with the ZooKeeper distribution, to create a ZNode containing some data.

$  kubectl exec zk-0 zkCli.sh create /hello world



WatchedEvent state:SyncConnected type:None path:null

Created /hello

You can use the same script to read the data from another server in the ensemble.

$  kubectl exec zk-1 zkCli.sh get /hello



WatchedEvent state:SyncConnected type:None path:null



You can take the ensemble down by deleting the zk StatefulSet.

$  kubectl delete statefulset zk

statefulset "zk" deleted

The cascading delete destroys each Pod in the StatefulSet, with respect to the reverse order of the Pods’ ordinals, and it waits for each to terminate completely before terminating its predecessor.

$  kubectl get pods -w -l app=zk


zk-0      1/1       Running   0          14m

zk-1      1/1       Running   0          13m

zk-2      1/1       Running   0          12m


zk-2      1/1       Terminating   0          12m

zk-1      1/1       Terminating   0         13m

zk-0      1/1       Terminating   0         14m

zk-2      0/1       Terminating   0         13m

zk-2      0/1       Terminating   0         13m

zk-2      0/1       Terminating   0         13m

zk-1      0/1       Terminating   0         14m

zk-1      0/1       Terminating   0         14m

zk-1      0/1       Terminating   0         14m

zk-0      0/1       Terminating   0         15m

zk-0      0/1       Terminating   0         15m

zk-0      0/1       Terminating   0         15m

You can use kubectl apply to recreate the zk StatefulSet and redeploy the ensemble.

$  kubectl apply -f [http://k8s.io/docs/tutorials/stateful-application/zookeeper.yaml](https://raw.githubusercontent.com/kubernetes/kubernetes.github.io/master/docs/tutorials/stateful-application/zookeeper.yaml)

service "zk-headless" configured

configmap "zk-config" configured

statefulset "zk" created

If you use the “zkCli.sh” script to get the value entered prior to deleting the StatefulSet, you will find that the ensemble still serves the data.

$  kubectl exec zk-2 zkCli.sh get /hello



WatchedEvent state:SyncConnected type:None path:null



StatefulSet ensures that, even if all Pods in the StatefulSet are destroyed, when they are rescheduled, the ZooKeeper ensemble can elect a new leader and continue to serve requests.

Tolerating Node Failures

ZooKeeper replicates its state machine to different servers in the ensemble for the explicit purpose of tolerating node failure. By default, the Kubernetes Scheduler could deploy more than one Pod in the zk StatefulSet to the same node. If the zk-0 and zk-1 Pods were deployed on the same node, and that node failed, the ZooKeeper ensemble couldn’t form a quorum to commit writes, and the ZooKeeper service would experience an outage until one of the Pods could be rescheduled.

You should always provision headroom capacity for critical processes in your cluster, and if you do, in this instance, the Kubernetes Scheduler will reschedule the Pods on another node and the outage will be brief.

If the SLAs for your service preclude even brief outages due to a single node failure, you should use a PodAntiAffinity annotation. The manifest used to create the ensemble contains such an annotation, and it tells the Kubernetes Scheduler to not place more than one Pod from the zk StatefulSet on the same node.

Tolerating Planned Maintenance

The manifest used to create the ZooKeeper ensemble also creates a PodDistruptionBudget, zk-budget. The zk-budget informs Kubernetes about the upper limit of disruptions (unhealthy Pods) that the service can tolerate.


              "podAntiAffinity": {

                "requiredDuringSchedulingRequiredDuringExecution": [{

                  "labelSelector": {

                    "matchExpressions": [{

                      "key": "app",

                      "operator": "In",

                      "values": ["zk-headless"]



                  "topologyKey": "kubernetes.io/hostname"




$ kubectl get poddisruptionbudget zk-budget


zk-budget   2               1                     2h

zk-budget indicates that at least two members of the ensemble must be available at all times for the ensemble to be healthy. If you attempt to drain a node prior taking it offline, and if draining it would terminate a Pod that violates the budget, the drain operation will fail. If you use kubectl drain, in conjunction with PodDisruptionBudgets, to cordon your nodes and to evict all Pods prior to maintenance or decommissioning, you can ensure that the procedure won’t be disruptive to your stateful applications.

Looking Forward

As the Kubernetes development looks towards GA, we are looking at a long list of suggestions from users. If you want to dive into our backlog, checkout the GitHub issues with the stateful label. However, as the resulting API would be hard to comprehend, we don’t expect to implement all of these feature requests. Some feature requests, like support for rolling updates, better integration with node upgrades, and using fast local storage, would benefit most types of stateful applications, and we expect to prioritize these. The intention of StatefulSet is to be able to run a large number of applications well, and not to be able to run all applications perfectly. With this in mind, we avoided implementing StatefulSets in a way that relied on hidden mechanisms or inaccessible features. Anyone can write a controller that works similarly to StatefulSets. We call this “making it forkable.”

Over the next year, we expect many popular storage applications to each have their own community-supported, dedicated controllers or “operators”. We’ve already heard of work on custom controllers for etcd, Redis, and ZooKeeper. We expect to write some more ourselves and to support the community in developing others.

The Operators for etcd and Prometheus from CoreOS, demonstrate an approach to running stateful applications on Kubernetes that provides a level of automation and integration beyond that which is possible with StatefulSet alone. On the other hand, using a generic controller like StatefulSet or Deployment means that a wide range of applications can be managed by understanding a single config object. We think Kubernetes users will appreciate having the choice of these two approaches.

–Kenneth Owens & Eric Tune, Software Engineers, Google

Five Days of Kubernetes 1.5

December 19 2016

With the help of our growing community of 1,000 contributors, we pushed some 5,000 commits to extend support for production workloads and deliver Kubernetes 1.5. While many improvements and new features have been added, we selected few to highlight in a series of in-depths posts listed below. 

This progress is our commitment in continuing to make Kubernetes best way to manage your production workloads at scale.

  Five Days of Kubernetes 1.5
Day 1 Introducing Container Runtime Interface (CRI) in Kubernetes
Day 2 StatefulSet: Run and Scale Stateful Applications Easily in Kubernetes
Day 3 Windows Server Support Comes to Kubernetes
Day 4 Cluster Federation in Kubernetes 1.5
Day 5 Kubernetes supports OpenAPI


- Download Kubernetes - Get involved with the Kubernetes project on GitHub - Post questions (or answer questions) on Stack Overflow - Connect with the community on Slack - Follow us on Twitter @Kubernetesio for latest updates

Introducing Container Runtime Interface (CRI) in Kubernetes

December 19 2016

Editor’s note: this post is part of a series of in-depth articles on what’s new in Kubernetes 1.5

At the lowest layers of a Kubernetes node is the software that, among other things, starts and stops containers. We call this the “Container Runtime”. The most widely known container runtime is Docker, but it is not alone in this space. In fact, the container runtime space has been rapidly evolving. As part of the effort to make Kubernetes more extensible, we’ve been working on a new plugin API for container runtimes in Kubernetes, called “CRI”.

What is the CRI and why does Kubernetes need it?

Each container runtime has it own strengths, and many users have asked for Kubernetes to support more runtimes. In the Kubernetes 1.5 release, we are proud to introduce the Container Runtime Interface (CRI) – a plugin interface which enables kubelet to use a wide variety of container runtimes, without the need to recompile. CRI consists of a protocol buffers and gRPC API, and libraries, with additional specifications and tools under active development. CRI is being released as Alpha in Kubernetes 1.5.

Supporting interchangeable container runtimes is not a new concept in Kubernetes. In the 1.3 release, we announced the rktnetes project to enable rkt container engine as an alternative to the Docker container runtime. However, both Docker and rkt were integrated directly and deeply into the kubelet source code through an internal and volatile interface. Such an integration process requires a deep understanding of Kubelet internals and incurs significant maintenance overhead to the Kubernetes community. These factors form high barriers to entry for nascent container runtimes. By providing a clearly-defined abstraction layer, we eliminate the barriers and allow developers to focus on building their container runtimes. This is a small, yet important step towards truly enabling pluggable container runtimes and building a healthier ecosystem.

Overview of CRI
Kubelet communicates with the container runtime (or a CRI shim for the runtime) over Unix sockets using the gRPC framework, where kubelet acts as a client and the CRI shim as the server.

The protocol buffers API includes two gRPC services, ImageService, and RuntimeService. The ImageService provides RPCs to pull an image from a repository, inspect, and remove an image. The RuntimeService contains RPCs to manage the lifecycle of the pods and containers, as well as calls to interact with containers (exec/attach/port-forward). A monolithic container runtime that manages both images and containers (e.g., Docker and rkt) can provide both services simultaneously with a single socket. The sockets can be set in Kubelet by –container-runtime-endpoint and –image-service-endpoint flags.
Pod and container lifecycle management

service RuntimeService {

    // Sandbox operations.

    rpc RunPodSandbox(RunPodSandboxRequest) returns (RunPodSandboxResponse) {}  
    rpc StopPodSandbox(StopPodSandboxRequest) returns (StopPodSandboxResponse) {}  
    rpc RemovePodSandbox(RemovePodSandboxRequest) returns (RemovePodSandboxResponse) {}  
    rpc PodSandboxStatus(PodSandboxStatusRequest) returns (PodSandboxStatusResponse) {}  
    rpc ListPodSandbox(ListPodSandboxRequest) returns (ListPodSandboxResponse) {}  

    // Container operations.  
    rpc CreateContainer(CreateContainerRequest) returns (CreateContainerResponse) {}  
    rpc StartContainer(StartContainerRequest) returns (StartContainerResponse) {}  
    rpc StopContainer(StopContainerRequest) returns (StopContainerResponse) {}  
    rpc RemoveContainer(RemoveContainerRequest) returns (RemoveContainerResponse) {}  
    rpc ListContainers(ListContainersRequest) returns (ListContainersResponse) {}  
    rpc ContainerStatus(ContainerStatusRequest) returns (ContainerStatusResponse) {}


A Pod is composed of a group of application containers in an isolated environment with resource constraints. In CRI, this environment is called PodSandbox. We intentionally leave some room for the container runtimes to interpret the PodSandbox differently based on how they operate internally. For hypervisor-based runtimes, PodSandbox might represent a virtual machine. For others, such as Docker, it might be Linux namespaces. The PodSandbox must respect the pod resources specifications. In the v1alpha1 API, this is achieved by launching all the processes within the pod-level cgroup that kubelet creates and passes to the runtime.

Before starting a pod, kubelet calls RuntimeService.RunPodSandbox to create the environment. This includes setting up networking for a pod (e.g., allocating an IP). Once the PodSandbox is active, individual containers can be created/started/stopped/removed independently. To delete the pod, kubelet will stop and remove containers before stopping and removing the PodSandbox.

Kubelet is responsible for managing the lifecycles of the containers through the RPCs, exercising the container lifecycles hooks and liveness/readiness checks, while adhering to the restart policy of the pod.

Why an imperative container-centric interface?

Kubernetes has a declarative API with a Pod resource. One possible design we considered was for CRI to reuse the declarative Pod object in its abstraction, giving the container runtime freedom to implement and exercise its own control logic to achieve the desired state. This would have greatly simplified the API and allowed CRI to work with a wider spectrum of runtimes. We discussed this approach early in the design phase and decided against it for several reasons. First, there are many Pod-level features and specific mechanisms (e.g., the crash-loop backoff logic) in kubelet that would be a significant burden for all runtimes to reimplement. Second, and more importantly, the Pod specification was (and is) still evolving rapidly. Many of the new features (e.g., init containers) would not require any changes to the underlying container runtimes, as long as the kubelet manages containers directly. CRI adopts an imperative container-level interface so that runtimes can share these common features for better development velocity. This doesn’t mean we’re deviating from the “level triggered” philosophy - kubelet is responsible for ensuring that the actual state is driven towards the declared state.

Exec/attach/port-forward requests

service RuntimeService {


    // ExecSync runs a command in a container synchronously.  
    rpc ExecSync(ExecSyncRequest) returns (ExecSyncResponse) {}  
    // Exec prepares a streaming endpoint to execute a command in the container.  
    rpc Exec(ExecRequest) returns (ExecResponse) {}  
    // Attach prepares a streaming endpoint to attach to a running container.  
    rpc Attach(AttachRequest) returns (AttachResponse) {}  
    // PortForward prepares a streaming endpoint to forward ports from a PodSandbox.  
    rpc PortForward(PortForwardRequest) returns (PortForwardResponse) {}


Kubernetes provides features (e.g. kubectl exec/attach/port-forward) for users to interact with a pod and the containers in it. Kubelet today supports these features either by invoking the container runtime’s native method calls or by using the tools available on the node (e.g., nsenter and socat). Using tools on the node is not a portable solution because most tools assume the pod is isolated using Linux namespaces. In CRI, we explicitly define these calls in the API to allow runtime-specific implementations.

Another potential issue with the kubelet implementation today is that kubelet handles the connection of all streaming requests, so it can become a bottleneck for the network traffic on the node. When designing CRI, we incorporated this feedback to allow runtimes to eliminate the middleman. The container runtime can start a separate streaming server upon request (and can potentially account the resource usage to the pod!), and return the location of the server to kubelet. Kubelet then returns this information to the Kubernetes API server, which opens a streaming connection directly to the runtime-provided server and connects it to the client.

There are many other aspects of CRI that are not covered in this blog post. Please see the list of design docs and proposals for all the details.

Current status

Although CRI is still in its early stages, there are already several projects under development to integrate container runtimes using CRI. Below are a few examples:

If you are interested in trying these alternative runtimes, you can follow the individual repositories for the latest progress and instructions.

For developers interested in integrating a new container runtime, please see the developer guide for the known limitations and issues of the API. We are actively incorporating feedback from early developers to improve the API. Developers should expect occasional API breaking changes (it is Alpha, after all).

Try the new CRI-Docker integration

Kubelet does not yet use CRI by default, but we are actively working on making this happen. The first step is to re-integrate Docker with kubelet using CRI. In the 1.5 release, we extended kubelet to support CRI, and also added a built-in CRI shim for Docker. This allows kubelet to start the gRPC server on Docker’s behalf. To try out the new kubelet-CRI-Docker integration, you simply have to start the Kubernetes API server with –feature-gates=StreamingProxyRedirects=true to enable the new streaming redirect feature, and then start the kubelet with –experimental-cri=true.

Besides a few missing features, the new integration has consistently passed the main end-to-end tests. We plan to expand the test coverage soon and would like to encourage the community to report any issues to help with the transition.

CRI with Minikube

If you want to try out the new integration, but don’t have the time to spin up a new test cluster in the cloud yet, minikube is a great tool to quickly spin up a local cluster. Before you start, follow the instructions to download and install minikube.

  1. Check the available Kubernetes versions and pick the latest 1.5.x version available. We will use v1.5.0-beta.1 as an example.
$ minikube get-k8s-versions
  1. Start a minikube cluster with the built-in docker CRI integration.
$ minikube start --kubernetes-version=v1.5.0-beta.1 --extra-config=kubelet.EnableCRI=true --network-plugin=kubenet --extra-config=kubelet.PodCIDR= --iso-url=http://storage.googleapis.com/minikube/iso/buildroot/minikube-v0.0.6.iso

–extra-config=kubelet.EnableCRI=true` turns on the CRI implementation in kubelet. –network-plugin=kubenet and –extra-config=kubelet.PodCIDR= sets the network plugin to kubenet and ensures a PodCIDR is assigned to the node. Alternatively, you can use the cni plugin which does not rely on the PodCIDR. –iso-url sets an iso image for minikube to launch the node with. The image used in the example

  1. Check the minikube log to check that CRI is enabled.
$ minikube logs | grep EnableCRI

I1209 01:48:51.150789    3226 localkube.go:116] Setting EnableCRI to true on kubelet.
  1. Create a pod and check its status. You should see a “SandboxReceived” event as a proof that Kubelet is using CRI!
$ kubectl run foo --image=gcr.io/google\_containers/pause-amd64:3.0

deployment "foo" created

$ kubectl describe pod foo


... From                Type   Reason          Message  
... -----------------   -----  --------------- -----------------------------

...{default-scheduler } Normal Scheduled       Successfully assigned foo-141968229-v1op9 to minikube  
...{kubelet minikube}   Normal SandboxReceived Pod sandbox received, it will be created.


_Note that kubectl attach/exec/port-forward does not work with CRI enabled in minikube yet, but this will be addressed in the newer version of minikube. _


CRI is being actively developed and maintained by the Kubernetes SIG-Node community. We’d love to hear feedback from you. To join the community:

- Post issues or feature requests on GitHub - Join the #sig-node channel on Slack - Subscribe to the SIG-Node mailing list - Follow us on Twitter @Kubernetesio for latest updates

–Yu-Ju Hong, Software Engineer, Google

Kubernetes 1.5: Supporting Production Workloads

December 13 2016

Today we’re announcing the release of Kubernetes 1.5. This release follows close on the heels of KubeCon/CloundNativeCon, where users gathered to share how they’re running their applications on Kubernetes. Many of you expressed interest in running stateful applications in containers with the eventual goal of running all applications on Kubernetes. If you have been waiting to try running a distributed database on Kubernetes, or for ways to guarantee application disruption SLOs for stateful and stateless apps, this release has solutions for you. 

StatefulSet and PodDisruptionBudget are moving to beta. Together these features provide an easier way to deploy and scale stateful applications, and make it possible to perform cluster operations like node upgrade without violating application disruption SLOs.

You will also find usability improvements throughout the release, starting with the kubectl command line interface you use so often. For those who have found it hard to set up a multi-cluster federation, a new command line tool called ‘kubefed’ is here to help. And a much requested multi-zone Highly Available (HA) master setup script has been added to kube-up. 

Did you know the Kubernetes community is working to support Windows containers? If you have .NET developers, take a look at the work on Windows containers in this release. This work is in early stage alpha and we would love your feedback.

Lastly, for those interested in the internals of Kubernetes, 1.5 introduces Container Runtime Interface or CRI, which provides an internal API abstracting the container runtime from kubelet. This decoupling of the runtime gives users choice in selecting a runtime that best suits their needs. This release also introduces containerized node conformance tests that verify that the node software meets the minimum requirements to join a Kubernetes cluster.

What’s New

StatefulSet beta (formerly known as PetSet) allows workloads that require persistent identity or per-instance storage to be created, scaled, deleted and repaired on Kubernetes. You can use StatefulSets to ease the deployment of any stateful service, and tutorial examples are available in the repository. In order to ensure that there are never two pods with the same identity, the Kubernetes node controller no longer force deletes pods on unresponsive nodes. Instead, it waits until the old pod is confirmed dead in one of several ways: automatically when the kubelet reports back and confirms the old pod is terminated; automatically when a cluster-admin deletes the node; or when a database admin confirms it is safe to proceed by force deleting the old pod. Users are now warned if they try to force delete pods via the CLI. For users who will be migrating from PetSets to StatefulSets, please follow the upgrade guide.

PodDisruptionBudget beta is an API object that specifies the minimum number or minimum percentage of replicas of a collection of pods that must be up at any time. With PodDisruptionBudget, an application deployer can ensure that cluster operations that voluntarily evict pods will never take down so many simultaneously as to cause data loss, an outage, or an unacceptable service degradation. In Kubernetes 1.5 the “kubectl drain” command supports PodDisruptionBudget, allowing safe draining of nodes for maintenance activities, and it will soon also be used by node upgrade and cluster autoscaler (when removing nodes). This can be useful for a quorum based application to ensure the number of replicas running is never below the number needed for quorum, or for a web front end to ensure the number of replicas serving load never falls below a certain percentage.

Kubefed alpha is a new command line tool to help you manage federated clusters, making it easy to deploy new federation control planes and add or remove clusters from existing federations. Also new in cluster federation is the addition of ConfigMaps alpha and DaemonSets alpha and deployments alpha to the federation API allowing you to create, update and delete these objects across multiple clusters from a single endpoint.

HA Masters alpha provides the ability to create and delete clusters with highly available (replicated) masters on GCE using the kube-up/kube-down scripts. Allows setup of zone distributed HA masters, with at least one etcd replica per zone, at least one API server per zone, and master-elected components like scheduler and controller-manager distributed across zones.

Windows server containers alpha provides initial support for Windows Server 2016 nodes and scheduling Windows Server Containers. 

Container Runtime Interface (CRI) alpha introduces the v1 CRI API to allow pluggable container runtimes; an experimental docker-CRI integration is ready for testing and feedback.

Node conformance test beta is a containerized test framework that provides a system verification and functionality test for nodes. The test validates whether the node meets the minimum requirements for Kubernetes; a node that passes the tests is qualified to join a Kubernetes. Node conformance test is available at: gcr.io/google_containers/node-test:0.2 for users to verify node setup.

These are just some of the highlights in our last release for the year. For a complete list please visit the release notes

Kubernetes 1.5 is available for download here on GitHub and via get.k8s.io. To get started with Kubernetes, try one of the new interactive tutorials. Don’t forget to take 1.5 for a spin before the holidays! 

User Adoption
It’s been a year-and-a-half since GA, and the rate of Kubernetes user adoption continues to surpass estimates. Organizations running production workloads on Kubernetes include the world’s largest companies, young startups, and everything in between. Since Kubernetes is open and runs anywhere, we’ve seen adoption on a diverse set of platforms; Pokémon Go (Google Cloud), Ticketmaster (AWS), SAP (OpenStack), Box (bare-metal), and hybrid environments that mix-and-match the above. Here are a few user highlights:

  • Yahoo! JAPAN – built an automated tool chain making it easy to go from code push to deployment, all while running OpenStack on Kubernetes. 
  • Walmart – will use Kubernetes with OneOps to manage its incredible distribution centers, helping its team with speed of delivery, systems uptime and asset utilization.  
  • Monzo – a European startup building a mobile first bank, is using Kubernetes to power its core platform that can handle extreme performance and consistency requirements.

Kubernetes Ecosystem
The Kubernetes ecosystem is growing rapidly, including Microsoft’s support for Kubernetes in Azure Container Service, VMware’s integration of Kubernetes in its Photon Platform, and Canonical’s commercial support for Kubernetes. This is in addition to the thirty plus Technology & Service Partners that already provide commercial services for Kubernetes users. 

The CNCF recently announced the Kubernetes Managed Service Provider (KMSP) program, a pre-qualified tier of service providers with experience helping enterprises successfully adopt Kubernetes. Furthering the knowledge and awareness of Kubernetes, The Linux Foundation, in partnership with CNCF, will develop and operate the Kubernetes training and certification program – the first course designed is Kubernetes Fundamentals.

Community Velocity
In the past three months we’ve seen more than a hundred new contributors join the project with some 5,000 commits pushed, reaching new milestones by bringing the total for the core project to 1,000+ contributors and 40,000+ commits. This incredible momentum is only possible by having an open design, being open to new ideas, and empowering an open community to be welcoming to new and senior contributors alike. A big thanks goes out to the release team for 1.5 – Saad Ali of Google, Davanum Srinivas of Mirantis, and Caleb Miles of CoreOS for their work bringing the 1.5 release to light.

Offline, the community can be found at one of the many Kubernetes related meetups around the world. The strength and scale of the community was visible in the crowded halls of CloudNativeCon/KubeCon Seattle (the recorded user talks are here). The next CloudNativeCon + KubeCon is in Berlin March 29-30, 2017, be sure to get your ticket and submit your talk before the CFP deadline of Dec 16th.

Ready to start contributing? Share your voice at our weekly community meeting

Thank you for your contributions and support!

– Aparna Sinha, Senior Product Manager, Google

From Network Policies to Security Policies

December 08 2016

Editor’s note: Today’s post is by Bernard Van De Walle, Kubernetes Lead Engineer, at Aporeto, showing how they took a new approach to the Kubernetes network policy enforcement.

Kubernetes Network Policies 

Kubernetes supports a new API for network policies that provides a sophisticated model for isolating applications and reducing their attack surface. This feature, which came out of the SIG-Network group, makes it very easy and elegant to define network policies by using the built-in labels and selectors Kubernetes constructs.

Kubernetes has left it up to third parties to implement these network policies and does not provide a default implementation.

We want to introduce a new way to think about “Security” and “Network Policies”. We want to show that security and reachability are two different problems, and that security policies defined using endpoints (pods labels for example) do not specifically need to be implemented using network primitives.

Most of us at Aporeto come from a Network/SDN background, and we knew how to implement those policies by using traditional networking and firewalling: Translating the pods identity and policy definitions to network constraints, such as IP addresses, subnets, and so forth.

However, we also knew from past experiences that using an external control plane also introduces a whole new set of challenges: This distribution of ACLs requires very tight synchronization between Kubernetes workers; and every time a new pod is instantiated, ACLs need to be updated on all other pods that have some policy related to the new pod. Very tight synchronization is fundamentally a quadratic state problem and, while shared state mechanisms can work at a smaller scale, they often have convergence, security, and eventual consistency issues in large scale clusters. 

From Network Policies to Security Policies

At Aporeto, we took a different approach to the network policy enforcement, by actually decoupling the network from the policy. We open sourced our solution as Trireme, which translates the network policy to an authorization policy, and it implements a transparent authentication and authorization function for any communication between pods. Instead of using IP addresses to identify pods, it defines a cryptographically signed identity for each pod as the set of its associated labels. Instead of using ACLs or packet filters to enforce policy, it uses an authorization function where a container can only receive traffic from containers with an identity that matches the policy requirements. 

The authentication and authorization function in Trireme is overlaid on the TCP negotiation sequence. Identity (i.e. set of labels) is captured as a JSON Web Token (JWT), signed by local keys, and exchanged during the Syn/SynAck negotiation. The receiving worker validates that the JWTs are signed by a trusted authority (authentication step) and validates against a cached copy of the policy that the connection can be accepted. Once the connection is accepted, the rest of traffic flows through the Linux kernel and all of the protections that it can potentially offer (including conntrack capabilities if needed). The current implementation uses a simple user space process that captures the initial negotiation packets and attaches the authorization information as payload. The JWTs include nonces that are validated during the Ack packet and can defend against man-in-the-middle or replay attacks.

The Trireme implementation talks directly to the Kubernetes master without an external controller and receives notifications on policy updates and pod instantiations so that it can maintain a local cache of the policy and update the authorization rules as needed. There is no requirement for any shared state between Trireme components that needs to be synchronized. Trireme can be deployed either as a standalone process in every worker or by using Daemon Sets. In the latter case, Kubernetes takes ownership of the lifecycle of the Trireme pods. 

Trireme’s simplicity is derived from the separation of security policy from network transport. Policy enforcement is linked directly to the labels present on the connection, irrespective of the networking scheme used to make the pods communicate. This identity linkage enables tremendous flexibility to operators to use any networking scheme they like without tying security policy enforcement to network implementation details. Also, the implementation of security policy across the federated clusters becomes simple and viable.

Kubernetes and Trireme deployment

Kubernetes is unique in its ability to scale and provide an extensible security support for the deployment of containers and microservices. Trireme provides a simple, secure, and scalable mechanism for enforcing these policies. 

You can deploy and try Trireme on top of Kubernetes by using a provided Daemon Set. You’ll need to modify some of the YAML parameters based on your cluster architecture. All the steps are described in detail in the deployment GitHub folder. The same folder contains an example 3-tier policy that you can use to test the traffic pattern.

To learn more, download the code, and engage with the project, visit:

  • Trireme on GitHub
  • Trireme for Kubernetes by Aporeto on GitHub

–Bernard Van De Walle, Kubernetes lead engineer, Aporeto

@Kubernetesio View on Github #kubernetes-users Stack Overflow Download Kubernetes