Kubernetes Blog

How Bitmovin is Doing Multi-Stage Canary Deployments with Kubernetes in the Cloud and On-Prem

April 21 2017

Editor’s Note: Today’s post is by Daniel Hoelbling-Inzko, Infrastructure Architect at Bitmovin, a company that provides services that transcode digital video and audio to streaming formats, sharing insights about their use of Kubernetes.

Running a large scale video encoding infrastructure on multiple public clouds is tough. At Bitmovin, we have been doing it successfully for the last few years, but from an engineering perspective, it’s neither been enjoyable nor particularly fun.

So obviously, one of the main things that really sold us on using Kubernetes, was it’s common abstraction from the different supported cloud providers and the well thought out programming interface it provides. More importantly, the Kubernetes project did not settle for the lowest common denominator approach. Instead, they added the necessary abstract concepts that are required and useful to run containerized workloads in a cloud and then did all the hard work to map these concepts to the different cloud providers and their offerings.

The great stability, speed and operational reliability we saw in our early tests in mid-2016 made the migration to Kubernetes a no-brainer.

And, it didn’t hurt that the vision for scale the Kubernetes project has been pursuing is closely aligned with our own goals as a company. Aiming for >1,000 node clusters might be a lofty goal, but for a fast growing video company like ours, having your infrastructure aim to support future growth is essential. Also, after initial brainstorming for our new infrastructure, we immediately knew that we would be running a huge number of containers and having a system, with the expressed goal of working at global scale, was the perfect fit for us. Now with the recent Kubernetes 1.6 release and its support for 5,000 node clusters, we feel even more validated in our choice of a container orchestration system.

During the testing and migration phase of getting our infrastructure running on Kubernetes, we got quite familiar with the Kubernetes API and the whole ecosystem around it. So when we were looking at expanding our cloud video encoding offering for customers to use in their own datacenters or cloud environments, we quickly decided to leverage Kubernetes as our ubiquitous cloud operating system to base the solution on.

Just a few months later this effort has become our newest service offering: Bitmovin Managed On-Premise encoding. Since all Kubernetes clusters share the same API, adapting our cloud encoding service to also run on Kubernetes enabled us to deploy into our customer’s datacenter, regardless of the hardware infrastructure running underneath. With great tools from the community, like kube-up and turnkey solutions, like Google Container Engine, anyone can easily provision a new Kubernetes cluster, either within their own infrastructure or in their own cloud accounts.

To give us the maximum flexibility for customers that deploy to bare metal and might not have any custom cloud integrations for Kubernetes yet, we decided to base our solution solely on facilities that are available in any Kubernetes install and don’t require any integration into the surrounding infrastructure (it will even run inside Minikube!). We don’t rely on Services of type LoadBalancer, primarily because enterprise IT is usually reluctant to open up ports to the open internet - and not every bare metal Kubernetes install supports externally provisioned load balancers out of the box. To avoid these issues, we deploy a BitmovinAgent that runs inside the Cluster and polls our API for new encoding jobs without requiring any network setup. This agent then uses the locally available Kubernetes credentials to start up new deployments that run the encoders on the available hardware through the Kubernetes API.

Even without having a full cloud integration available, the consistent scheduling, health checking and monitoring we get from using the Kubernetes API really enabled us to focus on making the encoder work inside a container rather than spending precious engineering resources on integrating a bunch of different hypervisors, machine provisioners and monitoring systems.

Multi-Stage Canary Deployments

Our first encounters with the Kubernetes API were not for the On-Premise encoding product. Building our containerized encoding workflow on Kubernetes was rather a decision we made after seeing how incredibly easy and powerful the Kubernetes platform proved during development and rollout of our Bitmovin API infrastructure. We migrated to Kubernetes around four months ago and it has enabled us to provide rapid development iterations to our service while meeting our requirements of downtime-free deployments and a stable development to production pipeline. To achieve this we came up with an architecture that runs almost a thousand containers and meets the following requirements we had laid out on day one:

  1. 1.Zero downtime deployments for our customers
  2. 2.Continuous deployment to production on each git mainline push
  3. 3.High stability of deployed services for customers

Obviously #2 and #3 are at odds with each other, if each merged feature gets deployed to production right away - how can we ensure these releases are bug-free and don’t have adverse side effects for our customers?

To overcome this oxymoron, we came up with a four-stage canary pipeline for each microservice where we simultaneously deploy to production and keep changes away from customers until the new build has proven to work reliably and correctly in the production environment.

Once a new build is pushed, we deploy it to an internal stage that’s only accessible for our internal tests and the integration test suite. Once the internal test suite passes, QA reports no issues, and we don’t detect any abnormal behavior, we push the new build to our free stage. This means that 5% of our free users would get randomly assigned to this new build. After some time in this stage the build gets promoted to the next stage that gets 5% of our paid users routed to it. Only once the build has successfully passed all 3 of these hurdles, does it get deployed to the production tier, where it will receive all traffic from our remaining users as well as our enterprise customers, which are not part of the paid bucket and never see their traffic routed to a canary track.

This setup makes us a pretty big Kubernetes installation by default, since all of our canary tiers are available at a minimum replication of 2. Since we are currently deploying around 30 microservices (and growing) to our clusters, it adds up to a minimum of 10 pods per service (8 application pods + minimum 2 HAProxy pods that do the canary routing). Although, in reality our preferred standard configuration is usually running 2 internal, 4 free, 4 others and 10 production pods alongside 4 HAProxy pods - totalling around 700 pods in total. This also means that we are running at least 150 services that provide a static ClusterIP to their underlying microservice canary tier.

A typical deployment looks like this:

Services (ClusterIP) Deployments #Pods
account-service account-service-haproxy 4
account-service-internal account-service-internal-v1.18.0 2
account-service-canary account-service-canary-v1.17.0 4
account-service-paid account-service-paid-v1.15.0 4
account-service-production account-service-production-v1.15.0 10

An example service definition the production track will have the following label selectors:

apiVersion: v1

kind: Service

metadata:

 name: account-service-production

 labels:

 app: account-service-production

 tier: service

 lb: private

spec:

 ports:

 - port: 8080

 name: http

 targetPort: 8080

 protocol: TCP

 selector:

 app: account-service

 tier: service

 track: production

In front of the Kubernetes services, load balancing the different canary versions of the service, lives a small cluster of HAProxy pods that get their haproxy.conf from the Kubernetes ConfigMaps that looks something like this:

frontend http-in

 bind \*:80

 log 127.0.0.1 local2 debug


 acl traffic\_internal hdr(X-Traffic-Group) -m str -i INTERNAL

 acl traffic\_free  hdr(X-Traffic-Group) -m str -i FREE

 acl traffic\_enterprise hdr(X-Traffic-Group) -m str -i ENTERPRISE


 use\_backend internal if traffic\_internal

 use\_backend canary if traffic\_free

 use\_backend enterprise if traffic\_enterprise


 default\_backend paid


backend internal

 balance roundrobin

 server internal-lb  user-resource-service-internal:8080 resolvers dns check inter 2000

backend canary

 balance roundrobin

 server canary-lb    user-resource-service-canary:8080 resolvers dns check inter 2000 weight 5

 server production-lb user-resource-service-production:8080 resolvers dns check inter 2000 weight 95

backend paid

 balance roundrobin

 server canary-paid-lb user-resource-service-paid:8080 resolvers dns check inter 2000 weight 5

 server production-lb user-resource-service-production:8080 resolvers dns check inter 2000 weight 95

backend enterprise

 balance roundrobin

 server production-lb user-resource-service-production:8080 resolvers dns check inter 2000 weight 100

Each HAProxy will inspect a header that gets assigned by our API-Gateway called X-Traffic-Group that determines which bucket of customers this request belongs to. Based on that, a decision is made to hit either a canary deployment or the production deployment.

Obviously, at this scale, kubectl (while still our main day-to-day tool to work on the cluster) doesn’t really give us a good overview of whether everything is actually running as it’s supposed to and what is maybe over or under replicated.

Since we do blue/green deployments, we sometimes forget to shut down the old version after the new one comes up, so some services might be running over replicated and finding these issues in a soup of 25 deployments listed in kubectl is not trivial, to say the least.

So, having a container orchestrator like Kubernetes, that’s very API driven, was really a godsend for us, as it allowed us to write tools that take care of that.

We built tools that either run directly off kubectl (eg bash-scripts) or interact directly with the API and understand our special architecture to give us a quick overview of the system. These tools were mostly built in Go using the client-go library.

One of these tools is worth highlighting, as it’s basically our only way to really see service health at a glance. It goes through all our Kubernetes services that have the tier: service selector and checks if the accompanying HAProxy deployment is available and all pods are running with 4 replicas. It also checks if the 4 services behind the HAProxys (internal, free, others and production) have at least 2 endpoints running. If any of these conditions are not met, we immediately get a notification in Slack and by email.

Managing this many pods with our previous orchestrator proved very unreliable and the overlay network frequently caused issues. Not so with Kubernetes - even doubling our current workload for test purposes worked flawlessly and in general, the cluster has been working like clockwork ever since we installed it.

Another advantage of switching over to Kubernetes was the availability of the kubernetes resource specifications, in addition to the API (which we used to write some internal tools for deployment). This enabled us to have a Git repo with all our Kubernetes specifications, where each track is generated off a common template and only contains placeholders for variable things like the canary track and the names.

All changes to the cluster have to go through tools that modify these resource specifications and get checked into git automatically so, whenever we see issues, we can debug what changes the infrastructure went through over time!

To summarize this post - by migrating our infrastructure to Kubernetes, Bitmovin is able to have:

  • Zero downtime deployments, allowing our customers to encode 24/7 without interruption
  • Fast development to production cycles, enabling us to ship new features faster
  • Multiple levels of quality assurance and high confidence in production deployments
  • Ubiquitous abstractions across cloud architectures and on-premise deployments
  • Stable and reliable health-checking and scheduling of services
  • Custom tooling around our infrastructure to check and validate the system
  • History of deployments (resource specifications in git + custom tooling)

We want to thank the Kubernetes community for the incredible job they have done with the project. The velocity at which the project moves is just breathtaking! Maintaining such a high level of quality and robustness in such a diverse environment is really astonishing.

–Daniel Hoelbling-Inzko, Infrastructure Architect, Bitmovin

  • Post questions (or answer questions) on Stack Overflow
  • Join the community portal for advocates on K8sPort
  • Get involved with the Kubernetes project on GitHub
  • Follow us on Twitter @Kubernetesio for latest updates
  • Connect with the community on Slack
  • Download Kubernetes

RBAC Support in Kubernetes

April 06 2017

Editor’s note: this post is part of a series of in-depth articles on what’s new in Kubernetes 1.6

One of the highlights of the Kubernetes 1.6 release is the RBAC authorizer feature moving to beta. RBAC, Role-based access control, is an an authorization mechanism for managing permissions around Kubernetes resources. RBAC allows configuration of flexible authorization policies that can be updated without cluster restarts.

The focus of this post is to highlight some of the interesting new capabilities and best practices.

RBAC vs ABAC

Currently there are several authorization mechanisms available for use with Kubernetes. Authorizers are the mechanisms that decide who is permitted to make what changes to the cluster using the Kubernetes API. This affects things like kubectl, system components, and also certain applications that run in the cluster and manipulate the state of the cluster, like Jenkins with the Kubernetes plugin, or Helm that runs in the cluster and uses the Kubernetes API to install applications in the cluster. Out of the available authorization mechanisms, ABAC and RBAC are the mechanisms local to a Kubernetes cluster that allow configurable permissions policies.

ABAC, Attribute Based Access Control, is a powerful concept. However, as implemented in Kubernetes, ABAC is difficult to manage and understand. It requires ssh and root filesystem access on the master VM of the cluster to make authorization policy changes. For permission changes to take effect the cluster API server must be restarted.

RBAC permission policies are configured using kubectl or the Kubernetes API directly. Users can be authorized to make authorization policy changes using RBAC itself, making it possible to delegate resource management without giving away ssh access to the cluster master. RBAC policies map easily to the resources and operations used in the Kubernetes API.

Based on where the Kubernetes community is focusing their development efforts, going forward RBAC should be preferred over ABAC.

Basic Concepts

The are a few basic ideas behind RBAC that are foundational in understanding it. At its core, RBAC is a way of granting users granular access to Kubernetes API resources.

The connection between user and resources is defined in RBAC using two objects.

Roles
A Role is a collection of permissions. For example, a role could be defined to include read permission on pods and list permission for pods. A ClusterRole is just like a Role, but can be used anywhere in the cluster.

Role Bindings
A RoleBinding maps a Role to a user or set of users, granting that Role’s permissions to those users for resources in that namespace. A ClusterRoleBinding allows users to be granted a ClusterRole for authorization across the entire cluster.

Additionally there are cluster roles and cluster role bindings to consider. Cluster roles and cluster role bindings function like roles and role bindings except they have wider scope. The exact differences and how cluster roles and cluster role bindings interact with roles and role bindings are covered in the Kubernetes documentation.

RBAC in Kubernetes

RBAC is now deeply integrated into Kubernetes and used by the system components to grant the permissions necessary for them to function. System roles are typically prefixed with system: so they can be easily recognized.

➜  kubectl get clusterroles --namespace=kube-system

NAME                    KIND

admin ClusterRole.v1beta1.rbac.authorization.k8s.io

cluster-admin ClusterRole.v1beta1.rbac.authorization.k8s.io

edit ClusterRole.v1beta1.rbac.authorization.k8s.io

kubelet-api-admin ClusterRole.v1beta1.rbac.authorization.k8s.io

system:auth-delegator ClusterRole.v1beta1.rbac.authorization.k8s.io

system:basic-user ClusterRole.v1beta1.rbac.authorization.k8s.io

system:controller:attachdetach-controller ClusterRole.v1beta1.rbac.authorization.k8s.io

system:controller:certificate-controller ClusterRole.v1beta1.rbac.authorization.k8s.io

...

The RBAC system roles have been expanded to cover the necessary permissions for running a Kubernetes cluster with RBAC only.

During the permission translation from ABAC to RBAC, some of the permissions that were enabled by default in many deployments of ABAC authorized clusters were identified as unnecessarily broad and were scoped down in RBAC. The area most likely to impact workloads on a cluster is the permissions available to service accounts. With the permissive ABAC configuration, requests from a pod using the pod mounted token to authenticate to the API server have broad authorization. As a concrete example, the curl command at the end of this sequence will return a JSON formatted result when ABAC is enabled and an error when only RBAC is enabled.

➜  kubectl run nginx --image=nginx:latest

➜  kubectl exec -it $(kubectl get pods -o jsonpath='{.items[0].metadata.name}') bash

➜  apt-get update && apt-get install -y curl

➜  curl -ik \

  -H "Authorization: Bearer $(cat /var/run/secrets/kubernetes.io/serviceaccount/token)" \

  https://kubernetes/api/v1/namespaces/default/pods

Any applications you run in your Kubernetes cluster that interact with the Kubernetes API have the potential to be affected by the permissions changes when transitioning from ABAC to RBAC.

To smooth the transition from ABAC to RBAC, you can create Kubernetes 1.6 clusters with both ABAC and RBAC authorizers enabled. When both ABAC and RBAC are enabled, authorization for a resource is granted if either authorization policy grants access. However, under that configuration the most permissive authorizer is used and it will not be possible to use RBAC to fully control permissions.

At this point, RBAC is complete enough that ABAC support should be considered deprecated going forward. It will still remain in Kubernetes for the foreseeable future but development attention is focused on RBAC.

Two different talks at the at the Google Cloud Next conference touched on RBAC related changes in Kubernetes 1.6, jump to the relevant parts here and here. For more detailed information about using RBAC in Kubernetes 1.6 read the full RBAC documentation.

Get Involved

If you’d like to contribute or simply help provide feedback and drive the roadmap, join our community. Specifically interested in security and RBAC related conversation, participate through one of these channels:

Thanks for your support and contributions. Read more in-depth posts on what’s new in Kubernetes 1.6 here.

– Jacob Simpson, Greg Castle & CJ Cullen, Software Engineers at Google

  • Post questions (or answer questions) on Stack Overflow
  • Join the community portal for advocates on K8sPort
  • Get involved with the Kubernetes project on GitHub
  • Follow us on Twitter @Kubernetesio for latest updates
  • Connect with the community on Slack
  • Download Kubernetes

Configuring Private DNS Zones and Upstream Nameservers in Kubernetes

April 04 2017

Editor’s note: this post is part of a series of in-depth articles on what’s new in Kubernetes 1.6

Many users have existing domain name zones that they would like to integrate into their Kubernetes DNS namespace. For example, hybrid-cloud users may want to resolve their internal “.corp” domain addresses within the cluster. Other users may have a zone populated by a non-Kubernetes service discovery system (like Consul). We’re pleased to announce that, in Kubernetes 1.6, kube-dns adds support for configurable private DNS zones (often called “stub domains”) and external upstream DNS nameservers. In this blog post, we describe how to configure and use this feature.

Default lookup flow

Kubernetes currently supports two DNS policies specified on a per-pod basis using the dnsPolicy flag: “Default” and “ClusterFirst”. If dnsPolicy is not explicitly specified, then “ClusterFirst” is used:

  • If dnsPolicy is set to “Default”, then the name resolution configuration is inherited from the node the pods run on. Note: this feature cannot be used in conjunction with dnsPolicy: “Default”.
  • If dnsPolicy is set to “ClusterFirst”, then DNS queries will be sent to the kube-dns service. Queries for domains rooted in the configured cluster domain suffix (any address ending in “.cluster.local” in the example above) will be answered by the kube-dns service. All other queries (for example, www.kubernetes.io) will be forwarded to the upstream nameserver inherited from the node. Before this feature, it was common to introduce stub domains by replacing the upstream DNS with a custom resolver. However, this caused the custom resolver itself to become a critical path for DNS resolution, where issues with scalability and availability may cause the cluster to lose DNS functionality. This feature allows the user to introduce custom resolution without taking over the entire resolution path.

Customizing the DNS Flow

Beginning in Kubernetes 1.6, cluster administrators can specify custom stub domains and upstream nameservers by providing a ConfigMap for kube-dns. For example, the configuration below inserts a single stub domain and two upstream nameservers. As specified, DNS requests with the “.acme.local” suffix will be forwarded to a DNS listening at 1.2.3.4. Additionally, Google Public DNS will serve upstream queries. See ConfigMap Configuration Notes at the end of this section for a few notes about the data format.

apiVersion: v1

kind: ConfigMap

metadata:

  name: kube-dns

  namespace: kube-system

data:

  stubDomains: |

    {“acme.local”: [“1.2.3.4”]}

  upstreamNameservers: |

    [“8.8.8.8”, “8.8.4.4”]

The diagram below shows the flow of DNS queries specified in the configuration above. With the dnsPolicy set to “ClusterFirst” a DNS query is first sent to the DNS caching layer in kube-dns. From here, the suffix of the request is examined and then forwarded to the appropriate DNS. In this case, names with the cluster suffix (e.g.; “.cluster.local”) are sent to kube-dns. Names with the stub domain suffix (e.g.; “.acme.local”) will be sent to the configured custom resolver. Finally, requests that do not match any of those suffixes will be forwarded to the upstream DNS.

Below is a table of example domain names and the destination of the queries for those domain names:

   
Domain name Server answering the query
kubernetes.default.svc.cluster.local kube-dns
foo.acme.local custom DNS (1.2.3.4)
widget.com upstream DNS (one of 8.8.8.8, 8.8.4.4)

ConfigMap Configuration Notes

  • stubDomains (optional)

    • Format: a JSON map using a DNS suffix key (e.g.; “acme.local”) and a value consisting of a JSON array of DNS IPs.
    • Note: The target nameserver may itself be a Kubernetes service. For instance, you can run your own copy of dnsmasq to export custom DNS names into the ClusterDNS namespace.
  • upstreamNameservers (optional)

    • Format: a JSON array of DNS IPs.
    • Note: If specified, then the values specified replace the nameservers taken by default from the node’s /etc/resolv.conf
    • Limits: a maximum of three upstream nameservers can be specified

Example #1: Adding a Consul DNS Stub Domain

In this example, the user has Consul DNS service discovery system they wish to integrate with kube-dns. The consul domain server is located at 10.150.0.1, and all consul names have the suffix “.consul.local”. To configure Kubernetes, the cluster administrator simply creates a ConfigMap object as shown below. Note: in this example, the cluster administrator did not wish to override the node’s upstream nameservers, so they didn’t need to specify the optional upstreamNameservers field.

apiVersion: v1

kind: ConfigMap

metadata:

  name: kube-dns

  namespace: kube-system

data:

  stubDomains: |

    {“consul.local”: [“10.150.0.1”]}

Example #2: Replacing the Upstream Nameservers

In this example the cluster administrator wants to explicitly force all non-cluster DNS lookups to go through their own nameserver at 172.16.0.1. Again, this is easy to accomplish; they just need to create a ConfigMap with the upstreamNameservers field specifying the desired nameserver.

``` apiVersion: v1

kind: ConfigMap

metadata:

name: kube-dns

namespace: kube-system

data:

upstreamNameservers:
[“172.16.0.1”]

Get involved

If you’d like to contribute or simply help provide feedback and drive the roadmap, join our community. Specifically for network related conversations participate though one of these channels:

  • Chat with us on the Kubernetes Slack network channel
  • Join our Special Interest Group, SIG-Network, which meets on Tuesdays at 14:00 PT Thanks for your support and contributions. Read more in-depth posts on what’s new in Kubernetes 1.6 here.

–Bowei Du, Software Engineer and Matthew DeLio, Product Manager, Google

  • Post questions (or answer questions) on Stack Overflow
  • Join the community portal for advocates on K8sPort
  • Get involved with the Kubernetes project on GitHub
  • Follow us on Twitter @Kubernetesio for latest updates
  • Connect with the community on Slack
  • Download Kubernetes

Advanced Scheduling in Kubernetes

March 31 2017

Editor’s note: this post is part of a series of in-depth articles on what’s new in Kubernetes 1.6

The Kubernetes scheduler’s default behavior works well for most cases – for example, it ensures that pods are only placed on nodes that have sufficient free resources, it ties to spread pods from the same set (ReplicaSet, StatefulSet, etc.) across nodes, it tries to balance out the resource utilization of nodes, etc.

But sometimes you want to control how your pods are scheduled. For example, perhaps you want to ensure that certain pods only schedule on nodes with specialized hardware, or you want to co-locate services that communicate frequently, or you want to dedicate a set of nodes to a particular set of users. Ultimately, you know much more about how your applications should be scheduled and deployed than Kubernetes ever will. So Kubernetes 1.6 offers four advanced scheduling features: node affinity/anti-affinity, taints and tolerations, pod affinity/anti-affinity, and custom schedulers. Each of these features are now in beta in Kubernetes 1.6.

Node Affinity/Anti-Affinity

Node Affinity/Anti-Affinity is one way to set rules on which nodes are selected by the scheduler. This feature is a generalization of the nodeSelector feature which has been in Kubernetes since version 1.0. The rules are defined using the familiar concepts of custom labels on nodes and selectors specified in pods, and they can be either required or preferred, depending on how strictly you want the scheduler to enforce them.

Required rules must be met for a pod to schedule on a particular node. If no node matches the criteria (plus all of the other normal criteria, such as having enough free resources for the pod’s resource request), then the pod won’t be scheduled. Required rules are specified in the requiredDuringSchedulingIgnoredDuringExecution field of nodeAffinity.

For example, if we want to require scheduling on a node that is in the us-central1-a GCE zone of a multi-zone Kubernetes cluster, we can specify the following affinity rule as part of the Pod spec:

affinity:

  nodeAffinity:

    requiredDuringSchedulingIgnoredDuringExecution:

      nodeSelectorTerms:

        - matchExpressions:

          - key: "failure-domain.beta.kubernetes.io/zone"

            operator: In

            values: ["us-central1-a"]

“IgnoredDuringExecution” means that the pod will still run if labels on a node change and affinity rules are no longer met. There are future plans to offer requiredDuringSchedulingRequiredDuringExecution which will evict pods from nodes as soon as they don’t satisfy the node affinity rule(s).

Preferred rules mean that if nodes match the rules, they will be chosen first, and only if no preferred nodes are available will non-preferred nodes be chosen. You can prefer instead of require that pods are deployed to us-central1-a by slightly changing the pod spec to use preferredDuringSchedulingIgnoredDuringExecution:

affinity:

  nodeAffinity:

    preferredDuringSchedulingIgnoredDuringExecution:

      nodeSelectorTerms:

        - matchExpressions:

          - key: "failure-domain.beta.kubernetes.io/zone"

            operator: In

            values: ["us-central1-a"]

Node anti-affinity can be achieved by using negative operators. So for instance if we want our pods to avoid us-central1-a we can do this:

affinity:

  nodeAffinity:

    requiredDuringSchedulingIgnoredDuringExecution:

      nodeSelectorTerms:

        - matchExpressions:

          - key: "failure-domain.beta.kubernetes.io/zone"

            operator: NotIn

            values: ["us-central1-a"]

Valid operators you can use are In, NotIn, Exists, DoesNotExist. Gt, and Lt.

Additional use cases for this feature are to restrict scheduling based on nodes’ hardware architecture, operating system version, or specialized hardware. Node affinity/anti-affinity is beta in Kubernetes 1.6.

Taints and Tolerations

A related feature is “taints and tolerations,” which allows you to mark (“taint”) a node so that no pods can schedule onto it unless a pod explicitly “tolerates” the taint. Marking nodes instead of pods (as in node affinity/anti-affinity) is particularly useful for situations where most pods in the cluster should avoid scheduling onto the node. For example, you might want to mark your master node as schedulable only by Kubernetes system components, or dedicate a set of nodes to a particular group of users, or keep regular pods away from nodes that have special hardware so as to leave room for pods that need the special hardware.

The kubectl command allows you to set taints on nodes, for example:

kubectl taint nodes node1 key=value:NoSchedule

creates a taint that marks the node as unschedulable by any pods that do not have a toleration for taint with key key, value value, and effect NoSchedule. (The other taint effects are PreferNoSchedule, which is the preferred version of NoSchedule, and NoExecute, which means any pods that are running on the node when the taint is applied will be evicted unless they tolerate the taint.) The toleration you would add to a PodSpec to have the corresponding pod tolerate this taint would look like this

tolerations:

- key: "key"

  operator: "Equal"

  value: "value"

  effect: "NoSchedule"

In addition to moving taints and tolerations to beta in Kubernetes 1.6, we have introduced an alpha feature that uses taints and tolerations to allow you to customize how long a pod stays bound to a node when the node experiences a problem like a network partition instead of using the default five minutes. See this section of the documentation for more details.

Pod Affinity/Anti-Affinity

Node affinity/anti-affinity allows you to constrain which nodes a pod can run on based on the nodes’ labels. But what if you want to specify rules about how pods should be placed relative to one another, for example to spread or pack pods within a service or relative to pods in other services? For that you can use pod affinity/anti-affinity, which is also beta in Kubernetes 1.6.

Let’s look at an example. Say you have front-ends in service S1, and they communicate frequently with back-ends that are in service S2 (a “north-south” communication pattern). So you want these two services to be co-located in the same cloud provider zone, but you don’t want to have to choose the zone manually–if the zone fails, you want the pods to be rescheduled to another (single) zone. You can specify this with a pod affinity rule that looks like this (assuming you give the pods of this service a label “service=S2” and the pods of the other service a label “service=S1”):

affinity:

    podAffinity:

      requiredDuringSchedulingIgnoredDuringExecution:

      - labelSelector:

          matchExpressions:

          - key: service

            operator: In

            values: [“S1”]

        topologyKey: failure-domain.beta.kubernetes.io/zone

As with node affinity/anti-affinity, there is also a preferredDuringSchedulingIgnoredDuringExecution variant.

Pod affinity/anti-affinity is very flexible. Imagine you have profiled the performance of your services and found that containers from service S1 interfere with containers from service S2 when they share the same node, perhaps due to cache interference effects or saturating the network link. Or maybe due to security concerns you never want containers of S1 and S2 to share a node. To implement these rules, just make two changes to the snippet above – change podAffinity to podAntiAffinity and change topologyKey to kubernetes.io/hostname.

Custom Schedulers

If the Kubernetes scheduler’s various features don’t give you enough control over the scheduling of your workloads, you can delegate responsibility for scheduling arbitrary subsets of pods to your own custom scheduler(s) that run(s) alongside, or instead of, the default Kubernetes scheduler. Multiple schedulers is beta in Kubernetes 1.6.

Each new pod is normally scheduled by the default scheduler. But if you provide the name of your own custom scheduler, the default scheduler will ignore that Pod and allow your scheduler to schedule the Pod to a node. Let’s look at an example.

Here we have a Pod where we specify the schedulerName field:

apiVersion: v1

kind: Pod

metadata:

  name: nginx

  labels:

    app: nginx

spec:

  schedulerName: my-scheduler

  containers:

  - name: nginx

    image: nginx:1.10

If we create this Pod without deploying a custom scheduler, the default scheduler will ignore it and it will remain in a Pending state. So we need a custom scheduler that looks for, and schedules, pods whose schedulerName field is my-scheduler.

A custom scheduler can be written in any language and can be as simple or complex as you need. Here is a very simple example of a custom scheduler written in Bash that assigns a node randomly. Note that you need to run this along with kubectl proxy for it to work.

#!/bin/bash

SERVER='localhost:8001'

while true;

do

    for PODNAME in $(kubectl --server $SERVER get pods -o json | jq '.items[] | select(.spec.schedulerName == "my-scheduler") | select(.spec.nodeName == null) | .metadata.name' | tr -d '"')

;

    do

        NODES=($(kubectl --server $SERVER get nodes -o json | jq '.items[].metadata.name' | tr -d '"'))


        NUMNODES=${#NODES[@]}

        CHOSEN=${NODES[$[$RANDOM % $NUMNODES]]}

        curl --header "Content-Type:application/json" --request POST --data '{"apiVersion":"v1", "kind": "Binding", "metadata": {"name": "'$PODNAME'"}, "target": {"apiVersion": "v1", "kind"

: "Node", "name": "'$CHOSEN'"}}' http://$SERVER/api/v1/namespaces/default/pods/$PODNAME/binding/

        echo "Assigned $PODNAME to $CHOSEN"

    done

    sleep 1

done

Learn more

The Kubernetes 1.6 release notes have more information about these features, including details about how to change your configurations if you are already using the alpha version of one or more of these features (this is required, as the move from alpha to beta is a breaking change for these features).

Acknowledgements

The features described here, both in their alpha and beta forms, were a true community effort, involving engineers from Google, Huawei, IBM, Red Hat and more.

Get Involved

Share your voice at our weekly community meeting:

  • Post questions (or answer questions) on Stack Overflow
  • Follow us on Twitter @Kubernetesio for latest updates
  • Connect with the community on Slack (room #sig-scheduling)

Many thanks for your contributions.

–Ian Lewis, Developer Advocate, and David Oppenheimer, Software Engineer, Google

Scalability updates in Kubernetes 1.6: 5,000 node and 150,000 pod clusters

March 30 2017

Editor’s note: this post is part of a series of in-depth articles on what’s new in Kubernetes 1.6

Last summer we shared updates on Kubernetes scalability, since then we’ve been working hard and are proud to announce that Kubernetes 1.6 can handle 5,000-node clusters with up to 150,000 pods. Moreover, those cluster have even better end-to-end pod startup time than the previous 2,000-node clusters in the 1.3 release; and latency of the API calls are within the one-second SLO.

In this blog post we review what metrics we monitor in our tests and describe our performance results from Kubernetes 1.6. We also discuss what changes we made to achieve the improvements, and our plans for upcoming releases in the area of system scalability.

X-node clusters - what does it mean?

Now that Kubernetes 1.6 is released, it is a good time to review what it means when we say we “support” X-node clusters. As described in detail in a previous blog post, we currently have two performance-related Service Level Objectives (SLO):

  • API-responsiveness : 99% of all API calls return in less than 1s
  • Pod startup time : 99% of pods and their containers (with pre-pulled images) start within 5s. As before, it is possible to run larger deployments than the stated supported 5,000-node cluster (and users have), but performance may be degraded and it may not meet our strict SLO defined above.

We are aware of the limited scope of these SLOs. There are many aspects of the system that they do not exercise. For example, we do not measure how soon a new pod that is part of a service will be reachable through the service IP address after the pod is started. If you are considering using large Kubernetes clusters and have performance requirements not covered by our SLOs, please contact the Kubernetes Scalability SIG so we can help you understand whether Kubernetes is ready to handle your workload now.

The top scalability-related priority for upcoming Kubernetes releases is to enhance our definition of what it means to support X-node clusters by:

  • refining currently existing SLOs
  • adding more SLOs (that will cover various areas of Kubernetes, including networking)

Kubernetes 1.6 performance metrics at scale

So how does performance in large clusters look in Kubernetes 1.6? The following graph shows the end-to-end pod startup latency with 2000- and 5000-node clusters. For comparison, we also show the same metric from Kubernetes 1.3, which we published in our previous scalability blog post that described support for 2000-node clusters. As you can see, Kubernetes 1.6 has better pod startup latency with both 2000 and 5000 nodes compared to Kubernetes 1.3 with 2000 nodes [1].

The next graph shows API response latency for a 5000-node Kubernetes 1.6 cluster. The latencies at all percentiles are less than 500ms, and even 90th percentile is less than about 100ms.

How did we get here?

Over the past nine months (since the last scalability blog post), there have been a huge number of performance and scalability related changes in Kubernetes. In this post we will focus on the two biggest ones and will briefly enumerate a few others.

etcd v3
In Kubernetes 1.6 we switched the default storage backend (key-value store where the whole cluster state is stored) from etcd v2 to etcd v3. The initial works towards this transition has been started during the 1.3 release cycle. You might wonder why it took us so long, given that:

  • the first stable version of etcd supporting the v3 API was announced on June 30, 2016
  • the new API was designed together with the Kubernetes team to support our needs (from both a feature and scalability perspective)
  • the integration of etcd v3 with Kubernetes had already mostly been finished when etcd v3 was announced (indeed CoreOS used Kubernetes as a proof-of-concept for the new etcd v3 API) As it turns out, there were a lot of reasons. We will describe the most important ones below.

  • Changing storage in a backward incompatible way, as is in the case for the etcd v2 to v3 migration, is a big change, and thus one for which we needed a strong justification. We found this justification in September when we determined that we would not be able to scale to 5000-node clusters if we continued to use etcd v2 (kubernetes/32361 contains some discussion about it). In particular, what didn’t scale was the watch implementation in etcd v2. In a 5000-node cluster, we need to be able to send at least 500 watch events per second to a single watcher, which wasn’t possible in etcd v2.
  • Once we had the strong incentive to actually update to etcd v3, we started thoroughly testing it. As you might expect, we found some issues. There were some minor bugs in Kubernetes, and in addition we requested a performance improvement in etcd v3’s watch implementation (watch was the main bottleneck in etcd v2 for us). This led to the 3.0.10 etcd patch release.
  • Once those changes had been made, we were convinced that new Kubernetes clusters would work with etcd v3. But the large challenge of migrating existing clusters remained. For this we needed to automate the migration process, thoroughly test the underlying CoreOS etcd upgrade tool, and figure out a contingency plan for rolling back from v3 to v2. But finally, we are confident that it should work.

Switching storage data format to protobuf
In the Kubernetes 1.3 release, we enabled protobufs as the data format for Kubernetes components to communicate with the API server (in addition to maintaining support for JSON). This gave us a huge performance improvement.

However, we were still using JSON as a format in which data was stored in etcd, even though technically we were ready to change that. The reason for delaying this migration was related to our plans to migrate to etcd v3. Now you are probably wondering how this change was depending on migration to etcd v3. The reason for it was that with etcd v2 we couldn’t really store data in binary format (to workaround it we were additionally base64-encoding the data), whereas with etcd v3 it just worked. So to simplify the transition to etcd v3 and avoid some non-trivial transformation of data stored in etcd during it, we decided to wait with switching storage data format to protobufs until migration to etcd v3 storage backend is done.

Other optimizations
We made tens of optimizations throughout the Kubernetes codebase during the last three releases, including:

  • optimizing the scheduler (which resulted in 5-10x higher scheduling throughput)
  • switching all controllers to a new recommended design using shared informers, which reduced resource consumption of controller-manager - for reference see this document
  • optimizing individual operations in the API server (conversions, deep-copies, patch)
  • reducing memory allocation in the API server (which significantly impacts the latency of API calls) We want to emphasize that the optimization work we have done during the last few releases, and indeed throughout the history of the project, is a joint effort by many different companies and individuals from the whole Kubernetes community.

What’s next?

People frequently ask how far we are going to go in improving Kubernetes scalability. Currently we do not have plans to increase scalability beyond 5000-node clusters (within our SLOs) in the next few releases. If you need clusters larger than 5000 nodes, we recommend to use federation to aggregate multiple Kubernetes clusters.

However, that doesn’t mean we are going to stop working on scalability and performance. As we mentioned at the beginning of this post, our top priority is to refine our two existing SLOs and introduce new ones that will cover more parts of the system, e.g. networking. This effort has already started within the Scalability SIG. We have made significant progress on how we would like to define performance SLOs, and this work should be finished in the coming month.

Join the effort
If you are interested in scalability and performance, please join our community and help us shape Kubernetes. There are many ways to participate, including:

  • Chat with us in the Kubernetes Slack scalability channel
  • Join our Special Interest Group, SIG-Scalability, which meets every Thursday at 9:00 AM PST Thanks for the support and contributions! Read more in-depth posts on what’s new in Kubernetes 1.6 here.

– Wojciech Tyczynski, Software Engineer, Google

[1] We are investigating why 5000-node clusters have better startup time than 2000-node clusters. The current theory is that it is related to running 5000-node experiments using 64-core master and 2000-node experiments using 32-core master.

Five Days of Kubernetes 1.6

March 29 2017

With the help of our growing community of 1,110 plus contributors, we pushed around 5,000 commits to deliver Kubernetes 1.6, bringing focus on multi-user, multi-workloads at scale. While many improvements have been contributed, we selected few features to highlight in a series of in-depths posts listed below. 

Follow along and read what’s new:

  Five Days of Kubernetes
Day 1 Dynamic Provisioning and Storage Classes in Kubernetes Stable in 1.6
Day 2 Scalability updates in Kubernetes 1.6
Day 3 Advanced Scheduling in Kubernetes 1.6
Day 4 Configuring Private DNS Zones and Upstream Nameservers in Kubernetes
Day 5 RBAC support in Kubernetes

Connect

  • Follow us on Twitter @Kubernetesio for latest updates
  • Post questions (or answer questions) on Stack Overflow
  • Join the community portal for advocates on K8sPort
  • Get involved with the Kubernetes project on GitHub
  • Connect with the community on Slack
  • Download Kubernetes

Dynamic Provisioning and Storage Classes in Kubernetes

March 29 2017

Editor’s note: this post is part of a series of in-depth articles on what’s new in Kubernetes 1.6

Storage is a critical part of running stateful containers, and Kubernetes offers powerful primitives for managing it. Dynamic volume provisioning, a feature unique to Kubernetes, allows storage volumes to be created on-demand. Before dynamic provisioning, cluster administrators had to manually make calls to their cloud or storage provider to provision new storage volumes, and then create PersistentVolume objects to represent them in Kubernetes. With dynamic provisioning, these two steps are automated, eliminating the need for cluster administrators to pre-provision storage. Instead, the storage resources can be dynamically provisioned using the provisioner specified by the StorageClass object (see user-guide). StorageClasses are essentially blueprints that abstract away the underlying storage provider, as well as other parameters, like disk-type (e.g.; solid-state vs standard disks).

StorageClasses use provisioners that are specific to the storage platform or cloud provider to give Kubernetes access to the physical media being used. Several storage provisioners are provided in-tree (see user-guide), but additionally out-of-tree provisioners are now supported (see kubernetes-incubator).

In the Kubernetes 1.6 release, dynamic provisioning has been promoted to stable (having entered beta in 1.4). This is a big step forward in completing the Kubernetes storage automation vision, allowing cluster administrators to control how resources are provisioned and giving users the ability to focus more on their application. With all of these benefits, there are a few important user-facing changes (discussed below) that are important to understand before using Kubernetes 1.6.

Storage Classes and How to Use them

StorageClasses are the foundation of dynamic provisioning, allowing cluster administrators to define abstractions for the underlying storage platform. Users simply refer to a StorageClass by name in the PersistentVolumeClaim (PVC) using the “storageClassName” parameter.

In the following example, a PVC refers to a specific storage class named “gold”.

apiVersion: v1

kind: PersistentVolumeClaim

metadata:

  name: mypvc

  namespace: testns

spec:

  accessModes:

  - ReadWriteOnce

  resources:

    requests:

      storage: 100Gi

  storageClassName: gold

In order to promote the usage of dynamic provisioning this feature permits the cluster administrator to specify a default StorageClass. When present, the user can create a PVC without having specifying a storageClassName, further reducing the user’s responsibility to be aware of the underlying storage provider. When using default StorageClasses, there are some operational subtleties to be aware of when creating PersistentVolumeClaims (PVCs). This is particularly important if you already have existing PersistentVolumes (PVs) that you want to re-use:

  • PVs that are already “Bound” to PVCs will remain bound with the move to 1.6

    • They will not have a StorageClass associated with them unless the user manually adds it
    • If PVs become “Available” (i.e.; if you delete a PVC and the corresponding PV is recycled), then they are subject to the following
  • If storageClassName is not specified in the PVC, the default storage class will be used for provisioning.

    • Existing, “Available”, PVs that do not have the default storage class label will not be considered for binding to the PVC
  • If storageClassName is set to an empty string (‘’) in the PVC, no storage class will be used (i.e.; dynamic provisioning is disabled for this PVC)

    • Existing, “Available”, PVs (that do not have a specified storageClassName) will be considered for binding to the PVC
  • If storageClassName is set to a specific value, then the matching storage class will be used

    • Existing, “Available”, PVs that have a matching storageClassName will be considered for binding to the PVC
    • If no corresponding storage class exists, the PVC will fail. To reduce the burden of setting up default StorageClasses in a cluster, beginning with 1.6, Kubernetes installs (via the add-on manager) default storage classes for several cloud providers. To use these default StorageClasses, users do not need refer to them by name – that is, storageClassName need not be specified in the PVC.

The following table provides more detail on default storage classes pre-installed by cloud provider as well as the specific parameters used by these defaults.

Cloud Provider Default StorageClass Name Default Provisioner
Amazon Web Services gp2 aws-ebs
Microsoft Azure standard azure-disk
Google Cloud Platform standard gce-pd
OpenStack standard cinder
VMware vSphere thin vsphere-volume

While these pre-installed default storage classes are chosen to be “reasonable” for most storage users, this guide provides instructions on how to specify your own default.

Dynamically Provisioned Volumes and the Reclaim Policy

All PVs have a reclaim policy associated with them that dictates what happens to a PV once it becomes released from a claim (see user-guide). Since the goal of dynamic provisioning is to completely automate the lifecycle of storage resources, the default reclaim policy for dynamically provisioned volumes is “delete”. This means that when a PersistentVolumeClaim (PVC) is released, the dynamically provisioned volume is de-provisioned (deleted) on the storage provider and the data is likely irretrievable. If this is not the desired behavior, the user must change the reclaim policy on the corresponding PersistentVolume (PV) object after the volume is provisioned.

How do I change the reclaim policy on a dynamically provisioned volume?

You can change the reclaim policy by editing the PV object and changing the “persistentVolumeReclaimPolicy” field to the desired value. For more information on various reclaim policies see user-guide.

FAQs

How do I use a default StorageClass?

If your cluster has a default StorageClass that meets your needs, then all you need to do is create a PersistentVolumeClaim (PVC) and the default provisioner will take care of the rest – there is no need to specify the storageClassName:

apiVersion: v1

kind: PersistentVolumeClaim

metadata:

  name: mypvc

  namespace: testns

spec:

  accessModes:

  - ReadWriteOnce

  resources:

    requests:

      storage: 100Gi

Can I add my own storage classes?
Yes. To add your own storage class, first determine which provisioners will work in your cluster. Then, create a StorageClass object with parameters customized to meet your needs (see user-guide for more detail). For many users, the easiest way to create the object is to write a yaml file and apply it with “kubectl create -f”. The following is an example of a StorageClass for Google Cloud Platform named “gold” that creates a “pd-ssd”. Since multiple classes can exist within a cluster, the administrator may leave the default enabled for most workloads (since it uses a “pd-standard”), with the “gold” class reserved for workloads that need extra performance.

kind: StorageClass

apiVersion: storage.k8s.io/v1

metadata:

  name: gold

provisioner: kubernetes.io/gce-pd

parameters:

  type: pd-ssd

How do I check if I have a default StorageClass Installed?

You can use kubectl to check for StorageClass objects. In the example below there are two storage classes: “gold” and “standard”. The “gold” class is user-defined, and the “standard” class is installed by Kubernetes and is the default.

$ kubectl get sc

NAME                 TYPE

gold                 kubernetes.io/gce-pd   

standard (default)   kubernetes.io/gce-pd
$ kubectl describe storageclass standard

Name:     standard

IsDefaultClass: Yes

Annotations: storageclass.beta.kubernetes.io/is-default-class=true

Provisioner: kubernetes.io/gce-pd

Parameters: type=pd-standard

Events:         \<none\>

Can I delete/turn off the default StorageClasses?
You cannot delete the default storage class objects provided. Since they are installed as cluster addons, they will be recreated if they are deleted.

You can, however, disable the defaulting behavior by removing (or setting to false) the following annotation: storageclass.beta.kubernetes.io/is-default-class.

If there are no StorageClass objects marked with the default annotation, then PersistentVolumeClaim objects (without a StorageClass specified) will not trigger dynamic provisioning. They will, instead, fall back to the legacy behavior of binding to an available PersistentVolume object.

Can I assign my existing PVs to a particular StorageClass?
Yes, you can assign a StorageClass to an existing PV by editing the appropriate PV object and adding (or setting) the desired storageClassName field to it.

What happens if I delete a PersistentVolumeClaim (PVC)?
If the volume was dynamically provisioned, then the default reclaim policy is set to “delete”. This means that, by default, when the PVC is deleted, the underlying PV and storage asset will also be deleted. If you want to retain the data stored on the volume, then you must change the reclaim policy from “delete” to “retain” after the PV is provisioned.

–Saad Ali & Michelle Au, Software Engineers, and Matthew De Lio, Product Manager, Google

  • Post questions (or answer questions) on Stack Overflow
  • Join the community portal for advocates on K8sPort
  • Get involved with the Kubernetes project on GitHub
  • Follow us on Twitter @Kubernetesio for latest updates
  • Connect with the community on Slack
  • Download Kubernetes
@Kubernetesio View on Github #kubernetes-users Stack Overflow Download Kubernetes