Kubernetes Blog

Kubernetes 1.6: Multi-user, Multi-workloads at Scale

March 28 2017

Today we’re announcing the release of Kubernetes 1.6.

In this release the community’s focus is on scale and automation, to help you deploy multiple workloads to multiple users on a cluster. We are announcing that 5,000 node clusters are supported. We moved dynamic storage provisioning to stable. Role-based access control (RBAC), kubefed, kubeadm, and several scheduling features are moving to beta. We have also added intelligent defaults throughout to enable greater automation out of the box.

What’s New

Scale and Federation : Large enterprise users looking for proof of at-scale performance will be pleased to know that Kubernetes’ stringent scalability SLO now supports 5,000 node (150,000 pod) clusters. This 150% increase in total cluster size, powered by a new version of etcd v3 by CoreOS, is great news if you are deploying applications such as search or games which can grow to consume larger clusters.

For users who want to scale beyond 5,000 nodes or spread across multiple regions or clouds, federation lets you combine multiple Kubernetes clusters and address them through a single API endpoint. In this release, the kubefed command line utility graduated to beta - with improved support for on-premise clusters. kubefed now automatically configures kube-dns on joining clusters and can pass arguments to federated components.

Security and Setup : Users concerned with security will find that RBAC, now beta adds a significant security benefit through more tightly scoped default roles for system components. The default RBAC policies in 1.6 grant scoped permissions to control-plane components, nodes, and controllers. RBAC allows cluster administrators to selectively grant particular users or service accounts fine-grained access to specific resources on a per-namespace basis. RBAC users upgrading from 1.5 to 1.6 should view the guidance here

Users looking for an easy way to provision a secure cluster on physical or cloud servers can use kubeadm, which is now beta. kubeadm has been enhanced with a set of command line flags and a base feature set that includes RBAC setup, use of the Bootstrap Token system and an enhanced Certificates API.

Advanced Scheduling : This release adds a set of powerful and versatile scheduling constructs to give you greater control over how pods are scheduled, including rules to restrict pods to particular nodes in heterogeneous clusters, and rules to spread or pack pods across failure domains such as nodes, racks, and zones.

Node affinity/anti-affinity, now in beta, allows you to restrict pods to schedule only on certain nodes based on node labels. Use built-in or custom node labels to select specific zones, hostnames, hardware architecture, operating system version, specialized hardware, etc. The scheduling rules can be required or preferred, depending on how strictly you want the scheduler to enforce them.

A related feature, called taints and tolerations, makes it possible to compactly represent rules for excluding pods from particular nodes. The feature, also now in beta, makes it easy, for example, to dedicate sets of nodes to particular sets of users, or to keep nodes that have special hardware available for pods that need the special hardware by excluding pods that don’t need it.

Sometimes you want to co-schedule services, or pods within a service, near each other topologically, for example to optimize North-South or East-West communication. Or you want to spread pods of a service for failure tolerance, or keep antagonistic pods separated, or ensure sole tenancy of nodes. Pod affinity and anti-affinity, now in beta, enables such use cases by letting you set hard or soft requirements for spreading and packing pods relative to one another within arbitrary topologies (node, zone, etc.).

Lastly, for the ultimate in scheduling flexibility, you can run your own custom scheduler(s) alongside, or instead of, the default Kubernetes scheduler. Each scheduler is responsible for different sets of pods. Multiple schedulers is beta in this release. 

Dynamic Storage Provisioning : Users deploying stateful applications will benefit from the extensive storage automation capabilities in this release of Kubernetes.

Since its early days, Kubernetes has been able to automatically attach and detach storage, format disk, mount and unmount volumes per the pod spec, and do so seamlessly as pods move between nodes. In addition, the PersistentVolumeClaim (PVC) and PersistentVolume (PV) objects decouple the request for storage from the specific storage implementation, making the pod spec portable across a range of cloud and on-premise environments. In this release StorageClass and dynamic volume provisioning are promoted to stable, completing the automation story by creating and deleting storage on demand, eliminating the need to pre-provision.

The design allows cluster administrators to define and expose multiple flavors of storage within a cluster, each with a custom set of parameters. End users can stop worrying about the complexity and nuances of how storage is provisioned, while still selecting from multiple storage options.

In 1.6 Kubernetes comes with a set of built-in defaults to completely automate the storage provisioning lifecycle, freeing you to work on your applications. Specifically, Kubernetes now pre-installs system-defined StorageClass objects for AWS, Azure, GCP, OpenStack and VMware vSphere by default. This gives Kubernetes users on these providers the benefits of dynamic storage provisioning without having to manually setup StorageClass objects. This is a change in the default behavior of PVC objects on these clouds. Note that default behavior is that dynamically provisioned volumes are created with the “delete” reclaim policy. That means once the PVC is deleted, the dynamically provisioned volume is automatically deleted so users do not have the extra step of ‘cleaning up’.

In addition, we have expanded the range of storage supported overall including:

  • ScaleIO Kubernetes Volume Plugin enabling pods to seamlessly access and use data stored on ScaleIO volumes.
  • Portworx Kubernetes Volume Plugin adding the capability to use Portworx as a storage provider for Kubernetes clusters. Portworx pools your server capacity and turns your servers or cloud instances into converged, highly available compute and storage nodes.
  • Support for NFSv3, NFSv4, and GlusterFS on clusters using the COS node image 
  • Support for user-written/run dynamic PV provisioners. A golang library and examples can be found here.
  • Beta support for mount options in persistent volumes

Container Runtime Interface, etcd v3 and Daemon set updates : while users may not directly interact with the container runtime or the API server datastore, they are foundational components for user facing functionality in Kubernetes’. As such the community invests in expanding the capabilities of these and other system components.

  • The Docker-CRI implementation is beta and is enabled by default in kubelet. Alpha support for other runtimes, cri-o, frakti, rkt, has also been implemented.
  • The default backend storage for the API server has been upgraded to use etcd v3 by default for new clusters. If you are upgrading from a 1.5 cluster, care should be taken to ensure continuity by planning a data migration window. 
  • Node reliability is improved as Kubelet exposes an admin configurable Node Allocatable feature to reserve compute resources for system daemons.
  • Daemon set updates lets you perform rolling updates on a daemon set

Alpha features : this release was mostly focused on maturing functionality, however, a few alpha features were added to support the roadmap

  • Out-of-tree cloud provider support adds a new cloud-controller-manager binary that may be used for testing the new out-of-core cloud provider flow
  • Per-pod-eviction in case of node problems combined with tolerationSeconds, lets users tune the duration a pod stays bound to a node that is experiencing problems
  • Pod Injection Policy adds a new API resource PodPreset to inject information such as secrets, volumes, volume mounts, and environment variables into pods at creation time.
  • Custom metrics support in the Horizontal Pod Autoscaler changed to use 
  • Multiple Nvidia GPU support is introduced with the Docker runtime only

These are just some of the highlights in our first release for the year. For a complete list please visit the release notes.

This release is possible thanks to our vast and open community. Together, we’ve pushed nearly 5,000 commits by some 275 authors. To bring our many advocates together, the community has launched a new program called K8sPort, an online hub where the community can participate in gamified challenges and get credit for their contributions. Read more about the program here.

Release Process

A big thanks goes out to the release team for 1.6 (lead by Dan Gillespie of CoreOS) for their work bringing the 1.6 release to light. This release team is an exemplar of the Kubernetes community’s commitment to community governance. Dan is the first non-Google release manager and he, along with the rest of the team, worked throughout the release (building on the 1.5 release manager, Saad Ali’s, great work) to uncover and document tribal knowledge, shine light on tools and processes that still require special permissions, and prioritize work to improve the Kubernetes release process. Many thanks to the team.

User Adoption

We’re continuing to see rapid adoption of Kubernetes in all sectors and sizes of businesses. Furthermore, adoption is coming from across the globe, from a startup in Tennessee, USA to a Fortune 500 company in China. 

  • JD.com, one of China’s largest internet companies, uses Kubernetes in conjunction with their OpenStack deployment. They’ve move 20% of their applications thus far on Kubernetes and are already running 20,000 pods daily. Read more about their setup here
  • Spire, a startup based in Tennessee, witnessed their public cloud provider experience an outage, but suffered zero downtime because Kubernetes was able to move their workloads to different zones. Read their full experience here.

“With Kubernetes, there was never a moment of panic, just a sense of awe watching the automatic mitigation as it happened.”

  • Share your Kubernetes use case story with the community here.

Kubernetes 1.6 is available for download here on GitHub and via get.k8s.io. To get started with Kubernetes, try one of the these interactive tutorials

Get Involved
CloudNativeCon + KubeCon in Berlin is this week March 29-30, 2017. We hope to get together with much of the community and share more there!

Share your voice at our weekly community meeting

Many thanks for your contributions and advocacy!

– Aparna Sinha, Senior Product Manager, Kubernetes, Google

PS: read this series of in-depth articles on what’s new in Kubernetes 1.6

The K8sPort: Engaging Kubernetes Community One Activity at a Time

March 24 2017

Editor’s note: Today’s post is by Ryan Quackenbush, Advocacy Programs Manager at Apprenda, showing a new community portal for Kubernetes advocates: the K8sPort.

The K8sPort is a hub designed to help you, the Kubernetes community, earn credit for the hard work you’re putting forth in making this one of the most successful open source projects ever. Back at KubeCon Seattle in November, I presented a lightning talk of a preview of K8sPort.

This hub, and our intentions in helping to drive this initiative in the community, grew out of a desire to help cultivate an engaged community of Kubernetes advocates. This is done through gamification in a community hub full of different activities called “challenges,” which are activities meant to help direct members of the community to attend various events and meetings, share and provide feedback on important content, answer questions posed on sites like Stack Overflow, and more.

By completing these challenges, you collect points and can redeem them for different types of rewards and experiences, examples of which include charitable donations, gift certificates, conference tickets and more. As advocates complete challenges and gain points, they’ll earn performance-related badges, move up in community tiers and participate in a fun community leaderboard.

My presentation at KubeCon, simply put, was a call for early signups. Those who’ve been piloting the program have, for the most part, had positive things to say about their experiences.

I know I’m the only one playing with @K8sPort but it may be the most important thing the #Kubernetes community has.

— Justin Garrison (@rothgar) November 22, 2016

  • “Great way of improving the community and documentation. The gamification of Kubernetes gave me more insight into the stack as well.”
    • Jonas Kint, Devops Engineer at Showpad
  • _“A great way to engage with the kubernetes project and also help the community. Fun stuff.” _
    • Kevin Duane, Systems Engineer at The Walt Disney Company
  • “K8sPort seems like an awesome idea for incentivising giving back to the community in a way that will hopefully cause more valuable help from more people than might usually be helping.”
    • William Stewart, Site Reliability Engineer at Superbalist

Today I am pleased to announce that the Cloud Native Computing Foundation (CNCF) is making the K8sPort generally available to the entire contributing community! We’ve simplified the signup process by allowing would-be advocates to authenticate and register through the use of their existing GitHub accounts.

Screen Shot 2017-03-22 at 10.59.04 AM.png

If you’re a contributing member of the Kubernetes community and you have an active GitHub account tied to the Kubernetes repository at GitHub, you can authenticate using your GitHub credentials and gain access to the K8sPort.

Screen Shot 2017-03-22 at 12.48.08 PM.png Beyond the challenges that get posted regularly, community members will be recognized and compile points for things they’re already doing today. This will be accomplished through the K8sPort’s full integration with GitHub and the core Kubernetes repository. Once you authenticate, you’ll automatically begin earning points and recognition for various contributions – including logging issues, making pull requests, code commits & more.

If you’re interested in joining the advocacy hub, please join us at k8sport.org! We hope you’re as excited about what you see as we are to continue to build it and present it to you.

For a quick walkthrough on K8sPort authentication and the hub itself, see this quick demo, below.

–Ryan Quackenbush, Advocacy Programs Manager, Apprenda

Deploying PostgreSQL Clusters using StatefulSets

February 24 2017

Editor’s note: Today’s guest post is by Jeff McCormick, a developer at Crunchy Data, showing how to build a PostgreSQL cluster using the new Kubernetes StatefulSet feature.

In an earlier post, I described how to deploy a PostgreSQL cluster using Helm, a Kubernetes package manager. The following example provides the steps for building a PostgreSQL cluster using the new Kubernetes StatefulSets feature.

StatefulSets Example

Step 1 - Create Kubernetes Environment

StatefulSets is a new feature implemented in Kubernetes 1.5 (prior versions it was known as PetSets). As a result, running this example will require an environment based on Kubernetes 1.5.0 or above.

The example in this blog deploys on Centos7 using kubeadm. Some instructions on what kubeadm provides and how to deploy a Kubernetes cluster is located here.

Step 2 - Install NFS

The example in this blog uses NFS for the Persistent Volumes, but any shared file system would also work (ex: ceph, gluster).

The example script assumes your NFS server is running locally and your hostname resolves to a known IP address.

In summary, the steps used to get NFS working on a Centos 7 host are as follows:

sudo setsebool -P virt\_use\_nfs 1

sudo yum -y install nfs-utils libnfsidmap

sudo systemctl enable rpcbind nfs-server

sudo systemctl start rpcbind nfs-server rpc-statd nfs-idmapd

sudo mkdir /nfsfileshare

sudo chmod 777 /nfsfileshare/

sudo vi /etc/exports

sudo exportfs -r

The /etc/exports file should contain a line similar to this one except with the applicable IP address specified:


After these steps NFS should be running in the test environment.

Step 3 - Clone the Crunchy PostgreSQL Container Suite

The example used in this blog is found at in the Crunchy Containers GitHub repo here. Clone the Crunchy Containers repository to your test Kubernertes host and go to the example:

cd $HOME

git clone https://github.com/CrunchyData/crunchy-containers.git

cd crunchy-containers/examples/kube/statefulset

Next, pull down the Crunchy PostgreSQL container image:

docker pull crunchydata/crunchy-postgres:centos7-9.5-1.2.6

Step 4 - Run the Example

To begin, it is necessary to set a few of the environment variables used in the example:

export BUILDBASE=$HOME/crunchy-containers

export CCP\_IMAGE\_TAG=centos7-9.5-1.2.6

BUILDBASE is where you cloned the repository and CCP_IMAGE_TAG is the container image version we want to use.

Next, run the example:


That script will create several Kubernetes objects including:

  • Persistent Volumes (pv1, pv2, pv3)
  • Persistent Volume Claim (pgset-pvc)
  • Service Account (pgset-sa)
  • Services (pgset, pgset-master, pgset-replica)
  • StatefulSet (pgset)
  • Pods (pgset-0, pgset-1)

At this point, two pods will be running in the Kubernetes environment:

$ kubectl get pod


pgset-0   1/1       Running   0          2m

pgset-1   1/1       Running   1          2m

Immediately after the pods are created, the deployment will be as depicted below:

Step 5 - What Just Happened?

This example will deploy a StatefulSet, which in turn creates two pods.

The containers in those two pods run the PostgreSQL database. For a PostgreSQL cluster, we need one of the containers to assume the master role and the other containers to assume the replica role.

So, how do the containers determine who will be the master, and who will be the replica?

This is where the new StateSet mechanics come into play. The StateSet mechanics assign a unique ordinal value to each pod in the set.

The StatefulSets provided unique ordinal value always start with 0. During the initialization of the container, each container examines its assigned ordinal value. An ordinal value of 0 causes the container to assume the master role within the PostgreSQL cluster. For all other ordinal values, the container assumes a replica role. This is a very simple form of discovery made possible by the StatefulSet mechanics.

PostgreSQL replicas are configured to connect to the master database via a Service dedicated to the master database. In order to support this replication, the example creates a separate Service for each of the master role and the replica role. Once the replica has connected, the replica will begin replicating state from the master.

During the container initialization, a master container will use a Service Account (pgset-sa) to change it’s container label value to match the master Service selector. Changing the label is important to enable traffic destined to the master database to reach the correct container within the Stateful Set. All other pods in the set assume the replica Service label by default.

Step 6 - Deployment Diagram

The example results in a deployment depicted below:

In this deployment, there is a Service for the master and a separate Service for the replica. The replica is connected to the master and replication of state has started.

The Crunchy PostgreSQL container supports other forms of cluster deployment, the style of deployment is dictated by setting the PG_MODE environment variable for the container. In the case of a StatefulSet deployment, that value is set to: PG_MODE=set

This environment variable is a hint to the container initialization logic as to the style of deployment we intend.

Step 7 - Testing the Example

The tests below assume that the psql client has been installed on the test system. If if not, the psql client has been previously installed, it can be installed as follows:

sudo yum -y install postgresql

In addition, the tests below assume that the tested environment DNS resolves to the Kube DNS and that the tested environment DNS search path is specified to match the applicable Kube namespace and domain. The master service is named pgset-master and the replica service is named pgset-replica.

Test the master as follows (the password is password):

psql -h pgset-master -U postgres postgres -c 'table pg\_stat\_replication'

If things are working, the command above will return output indicating that a single replica is connecting to the master.

Next, test the replica as follows:

psql -h pgset-replica -U postgres postgres  -c 'create table foo (id int)'

The command above should fail as the replica is read-only within a PostgreSQL cluster.

Next, scale up the set as follows:

kubectl scale statefulset pgset --replicas=3

The command above should successfully create a new replica pod called pgset-2 as depicted below:

Step 8 - Persistence Explained

Take a look at the persisted PostgreSQL data files on the resulting NFS mount path:

$ ls -l /nfsfileshare/

total 12

drwx------ 20   26   26 4096 Jan 17 16:35 pgset-0

drwx------ 20   26   26 4096 Jan 17 16:35 pgset-1

drwx------ 20   26   26 4096 Jan 17 16:48 pgset-2

Each container in the stateful set binds to the single NFS Persistent Volume Claim (pgset-pvc) created in the example script.

Since NFS and the PVC can be shared, each pod can write to this NFS path.

The container is designed to create a subdirectory on that path using the pod host name for uniqueness.


StatefulSets is an exciting feature added to Kubernetes for container builders that are implementing clustering. The ordinal values assigned to the set provide a very simple mechanism to make clustering decisions when deploying a PostgreSQL cluster.

–Jeff McCormick, Developer, Crunchy Data

Containers as a Service, the foundation for next generation PaaS

February 21 2017

Today’s post is by Brendan Burns, Partner Architect, at Microsoft & Kubernetes co-founder.

Containers are revolutionizing the way that people build, package and deploy software. But what is often overlooked is how they are revolutionizing the way that people build the software that builds, packages and deploys software. (it’s ok if you have to read that sentence twice…) Today, and in a talk at Container World tomorrow, I’m taking a look at how container orchestrators like Kubernetes form the foundation for next generation platform as a service (PaaS). In particular, I’m interested in how cloud container as a service (CaaS) platforms like Azure Container Service, Google Container Engine and others are becoming the new infrastructure layer that PaaS is built upon.

To see this, it’s important to consider the set of services that have traditionally been provided by PaaS platforms:

  • Source code and executable packaging and distribution
  • Reliable, zero-downtime rollout of software versions
  • Healing, auto-scaling, load balancing

When you look at this list, it’s clear that most of these traditional “PaaS” roles have now been taken over by containers. The container image and container image build tooling has become the way to package up your application. Container registries have become the way to distribute your application across the world. Reliable software rollout is achieved using orchestrator concepts like Deployment in Kubernetes, and service healing, auto-scaling and load-balancing are all properties of an application deployed in Kubernetes using ReplicaSets and Services.

What then is left for PaaS? Is PaaS going to be replaced by container as a service? I think the answer is “no.” The piece that is left for PaaS is the part that was always the most important part of PaaS in the first place, and that’s the opinionated developer experience. In addition to all of the generic parts of PaaS that I listed above, the most important part of a PaaS has always been the way in which the developer experience and application framework made developers more productive within the boundaries of the platform. PaaS enables developers to go from source code on their laptop to a world-wide scalable service in less than an hour. That’s hugely powerful. 

However, in the world of traditional PaaS, the skills needed to build PaaS infrastructure itself, the software on which the user’s software ran, required very strong skills and experience with distributed systems. Consequently, PaaS tended to be built by distributed system engineers rather than experts in a particular vertical developer experience. This means that PaaS platforms tended towards general purpose infrastructure rather than targeting specific verticals. Recently, we have seen this start to change, first with PaaS targeted at mobile API backends, and later with PaaS targeting “function as a service”. However, these products were still built from the ground up on top of raw infrastructure.

More recently, we are starting to see these platforms build on top of container infrastructure. Taking for example “function as a service” there are at least two (and likely more) open source implementations of functions as a service that run on top of Kubernetes (fission and funktion). This trend will only continue. Building a platform as a service, on top of container as a service is easy enough that you could imagine giving it out as an undergraduate computer science assignment. This ease of development means that individual developers with specific expertise in a vertical (say software for running three-dimensional simulations) can and will build PaaS platforms targeted at that specific vertical experience. In turn, by targeting such a narrow experience, they will build an experience that fits that narrow vertical perfectly, making their solution a compelling one in that target market.

This then points to the other benefit of next generation PaaS being built on top of container as a service. It frees the developer from having to make an “all-in” choice on a particular PaaS platform. When layered on top of container as a service, the basic functionality (naming, discovery, packaging, etc) are all provided by the CaaS and thus common across multiple PaaS that happened to be deployed on top of that CaaS. This means that developers can mix and match, deploying multiple PaaS to the same container infrastructure, and choosing for each application the PaaS platform that best suits that particular platform. Also, importantly, they can choose to “drop down” to raw CaaS infrastructure if that is a better fit for their application. Freeing PaaS from providing the infrastructure layer, enables PaaS to diversify and target specific experiences without fear of being too narrow. The experiences become more targeted, more powerful, and yet by building on top of container as a service, more flexible as well.

Kubernetes is infrastructure for next generation applications, PaaS and more. Given this, I’m really excited by our announcement today that Kubernetes on Azure Container Service has reached general availability. When you deploy your next generation application to Azure, whether on a PaaS or deployed directly onto Kubernetes itself (or both) you can deploy it onto a managed, supported Kubernetes cluster.

Furthermore, because we know that the world of PaaS and software development in general is a hybrid one, we’re excited to announce the preview availability of Windows clusters in Azure Container Service. We’re also working on hybrid clusters in ACS-Engine and expect to roll those out to general availability in the coming months.

I’m thrilled to see how containers and container as a service is changing the world of compute, I’m confident that we’re only scratching the surface of the transformation we’ll see in the coming months and years.

Brendan Burns, Partner Architect, at Microsoft and co-founder of Kubernetes

Inside JD.com's Shift to Kubernetes from OpenStack

February 10 2017

Editor’s note: Today’s post is by the Infrastructure Platform Department team at JD.com about their transition from OpenStack to Kubernetes. JD.com is one of China’s largest companies and the first Chinese Internet company to make the Global Fortune 500 list.

History of cluster building

The era of physical machines (2004-2014)

Before 2014, our company’s applications were all deployed on the physical machine. In the age of physical machines, we needed to wait an average of one week for the allocation to application coming on-line. Due to the lack of isolation, applications would affected each other, resulting in a lot of potential risks. At that time, the average number of tomcat instances on each physical machine was no more than nine. The resource of physical machine was seriously wasted and the scheduling was inflexible. The time of application migration took hours due to the breakdown of physical machines. And the auto-scaling cannot be achieved. To enhance the efficiency of application deployment, we developed compilation-packaging, automatic deployment, log collection, resource monitoring and some other systems.

Containerized era (2014-2016)

The Infrastructure Platform Department (IPD) led by Liu Haifeng–Chief Architect of JD.COM, sought a new resolution in the fall of 2014. Docker ran into our horizon. At that time, docker had been rising, but was slightly weak and lacked of experience in production environment. We had repeatedly tested docker. In addition, docker was customized to fix a couple of issues, such as system crash caused by device mapper and some Linux kernel bugs. We also added plenty of new features into docker, including disk speed limit, capacity management, and layer merging in image building and so on.

To manage the container cluster properly, we chose the architecture of OpenStack + Novadocker driver. Containers are managed as virtual machines. It is known as the first generation of JD container engine platform–JDOS1.0 (JD Datacenter Operating System). The main purpose of JDOS 1.0 is to containerize the infrastructure. All applications run in containers rather than physical machines since then. As for the operation and maintenance of applications, we took full advantage of existing tools. The time for developers to request computing resources in production environment reduced to several minutes rather than a week. After the pooling of computing resources, even the scaling of 1,000 containers would be finished in seconds. Application instances had been isolated from each other. Both the average deployment density of applications and the physical machine utilization had increased by three times, which brought great economic benefits.

We deployed clusters in each IDC and provided unified global APIs to support deployment across the IDC. There are 10,000 compute nodes at most and 4,000 at least in a single OpenStack distributed container cluster in our production environment. The first generation of container engine platform (JDOS 1.0) successfully supported the “6.18” and “11.11” promotional activities in both 2015 and 2016. There are already 150,000 running containers online by November 2016.

“6.18” and “11.11” are known as the two most popular online promotion of JD.COM, similar to the black Friday promotions. Fulfilled orders in November 11, 2016 reached 30 million. 

In the practice of developing and promoting JDOS 1.0, applications were migrated directly from physical machines to containers. Essentially, JDOS 1.0 was an implementation of IaaS. Therefore, deployment of applications was still heavily dependent on compilation-packaging and automatic deployment tools. However, the practice of JDOS1.0 is very meaningful. Firstly, we successfully moved business into containers. Secondly, we have a deep understanding of container network and storage, and know how to polish them to the best. Finally, all the experiences lay a solid foundation for us to develop a brand new application container platform.

New container engine platform (JDOS 2.0)

Platform architecture

When JDOS 1.0 grew from 2,000 containers to 100,000, we launched a new container engine platform (JDOS 2.0). The goal of JDOS 2.0 is not just an infrastructure management platform, but also a container engine platform faced to applications. On the basic of JDOS 1.0 and Kubernetes, JDOS 2.0 integrates the storage and network of JDOS 1.0, gets through the process of CI/CD from the source to the image, and finally to the deployment. Also, JDOS 2.0 provides one-stop service such as log, monitor, troubleshooting, terminal and orchestration. The platform architecture of JDOS 2.0 is shown below.


Function Product
Source Code Management Gitlab
Container Tool Docker
Container Networking Cane
Container Engine Kubernetes
Image Registry Harbor
CI Tool Jenkins
Log Management Logstash + Elastic Search
Monitor Prometheus

In JDOS 2.0, we define two levels, system and application. A system consists of several applications and an application consists of several Pods which provide the same service. In general, a department can apply for one or more systems which directly corresponds to the namespace of Kubernetes. This means that the Pods of the same system will be in the same namespace.

Most of the JDOS 2.0 components (GitLab / Jenkins / Harbor / Logstash / Elastic Search / Prometheus) are also containerized and deployed on the Kubernetes platform.

One Stop Solution


  1. 1.JDOS 2.0 takes docker image as the core to implement continuous integration and continuous deployment.
  2. 2.Developer pushes code to git.
  3. 3.Git triggers the jenkins master to generate build job.
  4. 4.Jenkins master invokes Kubernetes to create jenkins slave Pod.
  5. 5.Jenkins slave pulls the source code, compiles and packs.
  6. 6.Jenkins slave sends the package and the Dockerfile to the image build node with docker.
  7. 7.The image build node builds the image.
  8. 8.The image build node pushes the image to the image registry Harbor.
  9. 9.User creates or updates app Pods in different zone.

The docker image in JDOS 1.0 consisted primarily of the operating system and the runtime software stack of the application. So, the deployment of applications was still dependent on the auto-deployment and some other tools. While in JDOS 2.0, the deployment of the application is done during the image building. And the image contains the complete software stack, including App. With the image, we can achieve the goal of running applications as designed in any environment.


Networking and External Service Load Balancing

JDOS 2.0 takes the network solution of JDOS 1.0, which is implemented with the VLAN model of OpenStack Neutron. This solution enables highly efficient communication between containers, making it ideal for a cluster environment within a company. Each Pod occupies a port in Neutron, with a separate IP. Based on the Container Network Interface standard (CNI) standard, we have developed a new project Cane for integrating kubelet and Neutron.


At the same time, Cane is also responsible for the management of LoadBalancer in Kubernetes service. When a LoadBalancer is created / deleted / modified, Cane will call the creating / removing / modifying interface of the lbaas service in Neutron. In addition, the Hades component in the Cane project provides an internal DNS resolution service for the Pods.

The source code of the Cane project is currently being finished and will be released on GitHub soon.

Flexible Scheduling

D:\百度云同步盘\徐新坤-新人培训计划\docker\MAE\分享\schedule.pngJDOS 2.0 accesses applications, including big data, web applications, deep learning and some other types, and takes more diverse and flexible scheduling approaches. In some IDCs, we experimentally mixed deployment of online tasks and offline tasks. Compared to JDOS 1.0, overall resource utilization increased by about 30%.


The rich functionality of Kubernetes allows us to pay more attention to the entire ecosystem of the platform, such as network performance, rather than the platform itself. In particular, the SREs highly appreciated the functionality of replication controller. With it, the scaling of the applications is achieved in several seconds. JDOS 2.0 now has accessed about 20% of the applications, and deployed 2 clusters with about 20,000 Pods running daily. We plan to access more applications of our company, to replace the current JDOS 1.0. And we are also glad to share our experience in this process with the community.

Thank you to all the contributors of Kubernetes and the other open source projects.

–Infrastructure Platform Department team at JD.com

Run Deep Learning with PaddlePaddle on Kubernetes

February 08 2017

Editor’s note: Today’s post is a joint post from the deep learning team at Baidu and the etcd team at CoreOS.

What is PaddlePaddle

PaddlePaddle is an easy-to-use, efficient, flexible and scalable deep learning platform originally developed at Baidu for applying deep learning to Baidu products since 2014.

There have been more than 50 innovations created using PaddlePaddle supporting 15 Baidu products ranging from the search engine, online advertising, to Q&A and system security.

In September 2016, Baidu open sourced PaddlePaddle, and it soon attracted many contributors from outside of Baidu.

Why Run PaddlePaddle on Kubernetes

PaddlePaddle is designed to be slim and independent of computing infrastructure. Users can run it on top of Hadoop, Spark, Mesos, Kubernetes and others.. We have a strong interest with Kubernetes because of its flexibility, efficiency and rich features.

While we are applying PaddlePaddle in various Baidu products, we noticed two main kinds of PaddlePaddle usage – research and product. Research data does not change often, and the focus is fast experiments to reach the expected scientific measurement. Products data changes often. It usually comes from log messages generated from the Web services.

A successful deep learning project includes both the research and the data processing pipeline. There are many parameters to be tuned. A lot of engineers work on the different parts of the project simultaneously.

To ensure the project is easy to manage and utilize hardware resource efficiently, we want to run all parts of the project on the same infrastructure platform.

The platform should provide:

  • fault-tolerance. It should abstract each stage of the pipeline as a service, which consists of many processes that provide high throughput and robustness through redundancy.

  • auto-scaling. In the daytime, there are usually many active users, the platform should scale out online services. While during nights, the platform should free some resources for deep learning experiments.

  • job packing and isolation. It should be able to assign a PaddlePaddle trainer process requiring the GPU, a web backend service requiring large memory, and a CephFS process requiring disk IOs to the same node to fully utilize its hardware.

What we want is a platform which runs the deep learning system, the Web server (e.g., Nginx), the log collector (e.g., fluentd), the distributed queue service (e.g., Kafka), the log joiner and other data processors written using Storm, Spark, and Hadoop MapReduce on the same cluster. We want to run all jobs – online and offline, production and experiments – on the same cluster, so we could make full utilization of the cluster, as different kinds of jobs require different hardware resource.

We chose container based solutions since the overhead introduced by VMs is contradictory to our goal of efficiency and utilization.

Based on our research of different container based solutions, Kubernetes fits our requirement the best.

Distributed Training on Kubernetes

PaddlePaddle supports distributed training natively. There are two roles in a PaddlePaddle cluster: parameter server and trainer. Each parameter server process maintains a shard of the global model. Each trainer has its local copy of the model, and uses its local data to update the model. During the training process, trainers send model updates to parameter servers, parameter servers are responsible for aggregating these updates, so that trainers can synchronize their local copy with the global model.

Figure 1: Model is partitioned into two shards. Managed by two parameter servers respectively.

Some other approaches use a set of parameter servers to collectively hold a very large model in the CPU memory space on multiple hosts. But in practice, it is not often that we have such big models, because it would be very inefficient to handle very large model due to the limitation of GPU memory. In our configuration, multiple parameter servers are mostly for fast communications. Suppose there is only one parameter server process working with all trainers, the parameter server would have to aggregate gradients from all trainers and becomes a bottleneck. In our experience, an experimentally efficient configuration includes the same number of trainers and parameter servers. And we usually run a pair of trainer and parameter server on the same node. In the following Kubernetes job configuration, we start a job that runs N Pods, and in each Pod there are a parameter server and a trainer process.


apiVersion: batch/v1

kind: Job


  name: PaddlePaddle-cluster-job


  parallelism: 3

  completions: 3



      name: PaddlePaddle-cluster-job



      - name: jobpath


          path: /home/admin/efs


      - name: trainer

        image: your\_repo/paddle:mypaddle

        command: ["bin/bash",  "-c", "/root/start.sh"]


        - name: JOB\_NAME

          value: paddle-cluster-job

        - name: JOB\_PATH

          value: /home/jobpath

        - name: JOB\_NAMESPACE

          value: default


        - name: jobpath

          mountPath: /home/jobpath

      restartPolicy: Never

We can see from the config that parallelism, completions are both set to 3. So this job will simultaneously start up 3 PaddlePaddle pods, and this job will be finished when all 3 pods finishes.


Figure 2: Job A of three pods and Job B of one pod running on two nodes. |

The entrypoint of each pod is start.sh. It downloads data from a storage service, so that trainers can read quickly from the pod-local disk space. After downloading completes, it runs a Python script, start_paddle.py, which starts a parameter server, waits until parameter servers of all pods are ready to serve, and then starts the trainer process in the pod.

This waiting is necessary because each trainer needs to talk to all parameter servers, as shown in Figure. 1. Kubernetes API enables trainers to check the status of pods, so the Python script could wait until all parameter servers’ status change to “running” before it triggers the training process.

Currently, the mapping from data shards to pods/trainers is static. If we are going to run N trainers, we would need to partition the data into N shards, and statically assign each shard to a trainer. Again we rely on the Kubernetes API to enlist pods in a job so could we index pods / trainers from 1 to N. The i-th trainer would read the i-th data shard.

Training data is usually served on a distributed filesystem. In practice we use CephFS on our on-premise clusters and Amazon Elastic File System on AWS. If you are interested in building a Kubernetes cluster to run distributed PaddlePaddle training jobs, please follow this tutorial.

What’s Next

We are working on running PaddlePaddle with Kubernetes more smoothly.

As you might notice the current trainer scheduling fully relies on Kubernetes based on a static partition map. This approach is simple to start, but might cause a few efficiency problems.

First, slow or dead trainers block the entire job. There is no controlled preemption or rescheduling after the initial deployment. Second, the resource allocation is static. So if Kubernetes has more available resources than we anticipated, we have to manually change the resource requirements. This is tedious work, and is not aligned with our efficiency and utilization goal.

To solve the problems mentioned above, we will add a PaddlePaddle master that understands Kubernetes API, can dynamically add/remove resource capacity, and dispatches shards to trainers in a more dynamic manner. The PaddlePaddle master uses etcd as a fault-tolerant storage of the dynamic mapping from shards to trainers. Thus, even if the master crashes, the mapping is not lost. Kubernetes can restart the master and the job will keep running.

Another potential improvement is better PaddlePaddle job configuration. Our experience of having the same number of trainers and parameter servers was mostly collected from using special-purpose clusters. That strategy was observed performant on our clients’ clusters that run only PaddlePaddle jobs. However, this strategy might not be optimal on general-purpose clusters that run many kinds of jobs.

PaddlePaddle trainers can utilize multiple GPUs to accelerate computations. GPU is not a first class resource in Kubernetes yet. We have to manage GPUs semi-manually. We would love to work with Kubernetes community to improve GPU support to ensure PaddlePaddle runs the best on Kubernetes.

–Yi Wang, Baidu Research and Xiang Li, CoreOS

Highly Available Kubernetes Clusters

February 02 2017

Today’s post shows how to set-up a reliable, highly available distributed Kubernetes cluster. The support for running such clusters on Google Compute Engine (GCE) was added as an alpha feature in Kubernetes 1.5 release.


We will create a Highly Available Kubernetes cluster, with master replicas and worker nodes distributed among three zones of a region. Such setup will ensure that the cluster will continue operating during a zone failure.

Setting Up HA cluster

The following instructions apply to GCE. First, we will setup a cluster that will span over one zone (europe-west1-b), will contain one master and three worker nodes and will be HA-compatible (will allow adding more master replicas and more worker nodes in multiple zones in future). To implement this, we’ll export the following environment variables:


$ export NUM\_NODES=3

$ export MULTIZONE=true

$ export ENABLE\_ETCD\_QUORUM\_READ=true

and run kube-up script (note that the entire cluster will be initially placed in zone europe-west1-b):

$ KUBE\_GCE\_ZONE=europe-west1-b ./cluster/kube-up.sh

Now, we will add two additional pools of worker nodes, each of three nodes, in zones europe-west1-c and europe-west1-d (more details on adding pools of worker nodes can be find here):

$ KUBE\_USE\_EXISTING\_MASTER=true KUBE\_GCE\_ZONE=europe-west1-c ./cluster/kube-up.sh

$ KUBE\_USE\_EXISTING\_MASTER=true KUBE\_GCE\_ZONE=europe-west1-d ./cluster/kube-up.sh

To complete setup of HA cluster, we will add two master replicase, one in zone europe-west1-c, the other in europe-west1-d:

$ KUBE\_GCE\_ZONE=europe-west1-c KUBE\_REPLICATE\_EXISTING\_MASTER=true ./cluster/kube-up.sh

$ KUBE\_GCE\_ZONE=europe-west1-d KUBE\_REPLICATE\_EXISTING\_MASTER=true ./cluster/kube-up.sh

Note that adding the first replica will take longer (~15 minutes), as we need to reassign the IP of the master to the load balancer in front of replicas and wait for it to propagate (see design doc for more details).

Verifying in HA cluster works as intended

We may now list all nodes present in the cluster:

$ kubectl get nodes

NAME                      STATUS                AGE

kubernetes-master         Ready,SchedulingDisabled 48m

kubernetes-master-2d4     Ready,SchedulingDisabled 5m

kubernetes-master-85f     Ready,SchedulingDisabled 32s

kubernetes-minion-group-6s52 Ready                 39m

kubernetes-minion-group-cw8e Ready                 48m

kubernetes-minion-group-fw91 Ready                 48m

kubernetes-minion-group-h2kn Ready                 31m

kubernetes-minion-group-ietm Ready                 39m

kubernetes-minion-group-j6lf Ready                 31m

kubernetes-minion-group-soj7 Ready                 31m

kubernetes-minion-group-tj82 Ready                 39m

kubernetes-minion-group-vd96 Ready                 48m

As we can see, we have 3 master replicas (with disabled scheduling) and 9 worker nodes.

We will deploy a sample application (nginx server) to verify that our cluster is working correctly:

$ kubectl run nginx --image=nginx --expose --port=80

After waiting for a while, we can verify that both the deployment and the service were correctly created and are running:

$ kubectl get pods



nginx-3449338310-m7fjm 1/1 Running 0     4s


$ kubectl run -i --tty test-a --image=busybox /bin/sh

If you don't see a command prompt, try pressing enter.

# wget -q -O- http://nginx.default.svc.cluster.local


\<title\>Welcome to nginx!\</title\>


Now, let’s simulate failure of one of master’s replicas by executing halt command on it (kubernetes-master-137, zone europe-west1-c):

$ gcloud compute ssh kubernetes-master-2d4 --zone=europe-west1-c


$ sudo halt

After a while the master replica will be marked as NotReady:

$ kubectl get nodes

NAME                      STATUS                   AGE

kubernetes-master         Ready,SchedulingDisabled 51m

kubernetes-master-2d4     NotReady,SchedulingDisabled 8m

kubernetes-master-85f     Ready,SchedulingDisabled 4m


However, the cluster is still operational. We may verify it by checking if our nginx server works correctly:

$ kubectl run -i --tty test-b --image=busybox /bin/sh

If you don't see a command prompt, try pressing enter.

# wget -q -O- http://nginx.default.svc.cluster.local


\<title\>Welcome to nginx!\</title\>


We may also run another nginx server:

$ kubectl run nginx-next --image=nginx --expose --port=80

The new server should be also working correctly:

$ kubectl run -i --tty test-c --image=busybox /bin/sh

If you don't see a command prompt, try pressing enter.

# wget -q -O- http://nginx-next.default.svc.cluster.local


\<title\>Welcome to nginx!\</title\>


Let’s now reset the broken replica:

$ gcloud compute instances start kubernetes-master-2d4 --zone=europe-west1-c

After a while, the replica should be re-attached to the cluster:

$ kubectl get nodes

NAME                      STATUS                AGE

kubernetes-master         Ready,SchedulingDisabled 57m

kubernetes-master-2d4     Ready,SchedulingDisabled 13m

kubernetes-master-85f     Ready,SchedulingDisabled 9m


Shutting down HA cluster

To shutdown the cluster, we will first shut down master replicas in zones D and E:

$ KUBE\_DELETE\_NODES=false KUBE\_GCE\_ZONE=europe-west1-c ./cluster/kube-down.sh

$ KUBE\_DELETE\_NODES=false KUBE\_GCE\_ZONE=europe-west1-d ./cluster/kube-down.sh

Note that the second removal of replica will take longer (~15 minutes), as we need to reassign the IP of the load balancer in front of replicas to the remaining master and wait for it to propagate (see design doc for more details).

Then, we will remove the additional worker nodes from zones europe-west1-c and europe-west1-d:

$ KUBE\_USE\_EXISTING\_MASTER=true KUBE\_GCE\_ZONE=europe-west1-c ./cluster/kube-down.sh

$ KUBE\_USE\_EXISTING\_MASTER=true KUBE\_GCE\_ZONE=europe-west1-d ./cluster/kube-down.sh

And finally, we will shutdown the remaining master with the last group of nodes (zone europe-west1-b):

$ KUBE\_GCE\_ZONE=europe-west1-b ./cluster/kube-down.sh


We have shown how, by adding worker node pools and master replicas, a Highly Available Kubernetes cluster can be created. As of Kubernetes version 1.5.2, it is supported in kube-up/kube-down scripts for GCE (as alpha). Additionally, there is a support for HA cluster on AWS in kops scripts (see this article for more details).

–Jerzy Szczepkowski, Software Engineer, Google

@Kubernetesio View on Github #kubernetes-users Stack Overflow Download Kubernetes