Kubernetes Blog

Migrating the Kubernetes blog

April 11 2018

We recently migrated the Kubernetes blog from the Blogger platform to GitHub. With the change in platform comes a change in URL: formerly at http://blog.kubernetes.io, the blog now resides at https://kubernetes.io/blog.

All existing posts redirect from their former URLs with <rel=canonical> tags, preserving SEO values.

Why and how we migrated the blog

Our primary reasons for migrating were to streamline blog submissions and reviews, and to make the overall blog process faster and more transparent. Blogger’s web interface made it difficult to provide drafts to multiple reviewers without also granting unnecessary access permissions and compromising security. GitHub’s review process offered clear improvements.

We learned from Jim Brikman’s experience during his own site migration away from Blogger.

Our migration was broken into several pull requests, but you can see the work that went into the primary migration PR.

We hope that making blog submissions more accessible will encourage greater community involvement in creating and reviewing blog content.

How to Submit a Blog Post

You can submit a blog post for consideration one of two ways:

If you have a post that you want to remain confidential until your publish date, please submit your post via the Google form. Otherwise, you can choose your submission process based on your comfort level and preferred workflow.

Note: Our workflow hasn’t changed for confidential advance drafts. Additionally, we’ll coordinate publishing for time sensitive posts to ensure that information isn’t released prematurely through an open pull request.

Call for reviewers

The Kubernetes blog needs more reviewers! If you’re interested in contributing to the Kubernetes project and can participate on a regular, weekly basis, send an introductory email to k8sblog@linuxfoundation.org.

Container Storage Interface (CSI) for Kubernetes Goes Beta

April 10 2018

Kubernetes Logo CSI Logo

The Kubernetes implementation of the Container Storage Interface (CSI) is now beta in Kubernetes v1.10. CSI was introduced as alpha in Kubernetes v1.9.

Kubernetes features are generally introduced as alpha and moved to beta (and eventually to stable/GA) over subsequent Kubernetes releases. This process allows Kubernetes developers to get feedback, discover and fix issues, iterate on the designs, and deliver high quality, production grade features.

Why introduce Container Storage Interface in Kubernetes?

Although Kubernetes already provides a powerful volume plugin system that makes it easy to consume different types of block and file storage, adding support for new volume plugins has been challenging. Because volume plugins are currently “in-tree”—volume plugins are part of the core Kubernetes code and shipped with the core Kubernetes binaries—vendors wanting to add support for their storage system to Kubernetes (or even fix a bug in an existing volume plugin) must align themselves with the Kubernetes release process.

With the adoption of the Container Storage Interface, the Kubernetes volume layer becomes truly extensible. Third party storage developers can now write and deploy volume plugins exposing new storage systems in Kubernetes without ever having to touch the core Kubernetes code. This will result in even more options for the storage that backs Kubernetes users’ stateful containerized workloads.

What’s new in Beta?

With the promotion to beta CSI is now enabled by default on standard Kubernetes deployments instead of being opt-in.

The move of the Kubernetes implementation of CSI to beta also means:

  • Kubernetes is compatible with v0.2 of the CSI spec (instead of v0.1)
    • There were breaking changes between the CSI spec v0.1 and v0.2, so existing CSI drivers must be updated to be 0.2 compatible before use with Kubernetes 1.10.0+.
  • Mount propagation, a feature that allows bidirectional mounts between containers and host (a requirement for containerized CSI drivers), has also moved to beta.
  • The Kubernetes VolumeAttachment object, introduced in v1.9 in the storage v1alpha1 group, has been added to the storage v1beta1 group.
  • The Kubernetes CSIPersistentVolumeSource object has been promoted to beta. A VolumeAttributes field was added to Kubernetes CSIPersistentVolumeSource object (in alpha this was passed around via annotations).
  • Node authorizer has been updated to limit access to VolumeAttachment objects from kubelet.
  • The Kubernetes CSIPersistentVolumeSource object and the CSI external-provisioner have been modified to allow passing of secrets to the CSI volume plugin.
  • The Kubernetes CSIPersistentVolumeSource has been modified to allow passing in filesystem type (previously always assumed to be ext4).
  • A new optional call, NodeStageVolume, has been added to the CSI spec, and the Kubernetes CSI volume plugin has been modified to call NodeStageVolume during MountDevice (in alpha this step was a no-op).

How do I deploy a CSI driver on a Kubernetes Cluster?

CSI plugin authors must provide their own instructions for deploying their plugin on Kubernetes.

The Kubernetes-CSI implementation team created a sample hostpath CSI driver. The sample provides a rough idea of what the deployment process for a CSI driver looks like. Production drivers, however, would deploy node components via a DaemonSet and controller components via a StatefulSet rather than a single pod (for example, see the deployment files for the GCE PD driver).

How do I use a CSI Volume in my Kubernetes pod?

Assuming a CSI storage plugin is already deployed on your cluster, you can use it through the familiar Kubernetes storage primitives: PersistentVolumeClaims, PersistentVolumes, and StorageClasses.

CSI is a beta feature in Kubernetes v1.10. Although it is enabled by default, it may require the following flag:

  • API server binary and kubelet binaries:
    • --allow-privileged=true
      • Most CSI plugins will require bidirectional mount propagation, which can only be enabled for privileged pods. Privileged pods are only permitted on clusters where this flag has been set to true (this is the default in some environments like GCE, GKE, and kubeadm).

Dynamic Provisioning

You can enable automatic creation/deletion of volumes for CSI Storage plugins that support dynamic provisioning by creating a StorageClass pointing to the CSI plugin.

The following StorageClass, for example, enables dynamic creation of “fast-storage” volumes by a CSI volume plugin called “com.example.csi-driver”.

kind: StorageClass
apiVersion: storage.k8s.io/v1
metadata:
  name: fast-storage
provisioner: com.example.csi-driver
parameters:
  type: pd-ssd
  csiProvisionerSecretName: mysecret
  csiProvisionerSecretNamespace: mynamespace

New for beta, the default CSI external-provisioner reserves the parameter keys csiProvisionerSecretName and csiProvisionerSecretNamespace. If specified, it fetches the secret and passes it to the CSI driver during provisioning.

Dynamic provisioning is triggered by the creation of a PersistentVolumeClaim object. The following PersistentVolumeClaim, for example, triggers dynamic provisioning using the StorageClass above.

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: my-request-for-storage
spec:
  accessModes:
  - ReadWriteOnce
  resources:
    requests:
      storage: 5Gi
  storageClassName: fast-storage

When volume provisioning is invoked, the parameter type: pd-ssd and the secret any referenced secret(s) are passed to the CSI plugin com.example.csi-driver via a CreateVolume call. In response, the external volume plugin provisions a new volume and then automatically create a PersistentVolume object to represent the new volume. Kubernetes then binds the new PersistentVolume object to the PersistentVolumeClaim, making it ready to use.

If the fast-storage StorageClass is marked as “default”, there is no need to include the storageClassName in the PersistentVolumeClaim, it will be used by default.

Pre-Provisioned Volumes

You can always expose a pre-existing volume in Kubernetes by manually creating a PersistentVolume object to represent the existing volume. The following PersistentVolume, for example, exposes a volume with the name “existingVolumeName” belonging to a CSI storage plugin called “com.example.csi-driver”.

apiVersion: v1
kind: PersistentVolume
metadata:
  name: my-manually-created-pv
spec:
  capacity:
    storage: 5Gi
  accessModes:
    - ReadWriteOnce
  persistentVolumeReclaimPolicy: Retain
  csi:
    driver: com.example.csi-driver
    volumeHandle: existingVolumeName
    readOnly: false
    fsType: ext4
    volumeAttributes:
      foo: bar
    controllerPublishSecretRef:
      name: mysecret1
      namespace: mynamespace
    nodeStageSecretRef:
      name: mysecret2
      namespace: mynamespace
    nodePublishSecretRef
      name: mysecret3
      namespace: mynamespace

Attaching and Mounting

You can reference a PersistentVolumeClaim that is bound to a CSI volume in any pod or pod template.

kind: Pod
apiVersion: v1
metadata:
  name: my-pod
spec:
  containers:
    - name: my-frontend
      image: nginx
      volumeMounts:
      - mountPath: "/var/www/html"
        name: my-csi-volume
  volumes:
    - name: my-csi-volume
      persistentVolumeClaim:
        claimName: my-request-for-storage

When the pod referencing a CSI volume is scheduled, Kubernetes will trigger the appropriate operations against the external CSI plugin (ControllerPublishVolume, NodeStageVolume, NodePublishVolume, etc.) to ensure the specified volume is attached, mounted, and ready to use by the containers in the pod.

For more details please see the CSI implementation design doc and documentation.

How do I write a CSI driver?

CSI Volume Driver deployments on Kubernetes must meet some minimum requirements.

The minimum requirements document also outlines the suggested mechanism for deploying an arbitrary containerized CSI driver on Kubernetes. This mechanism can be used by a Storage Provider to simplify deployment of containerized CSI compatible volume drivers on Kubernetes.

As part of the suggested deployment process, the Kubernetes team provides the following sidecar (helper) containers:

  • external-attacher
    • watches Kubernetes VolumeAttachment objects and triggers ControllerPublish and ControllerUnpublish operations against a CSI endpoint
  • external-provisioner
    • watches Kubernetes PersistentVolumeClaim objects and triggers CreateVolume and DeleteVolume operations against a CSI endpoint
  • driver-registrar
    • registers the CSI driver with kubelet (in the future) and adds the drivers custom NodeId (retrieved via GetNodeID call against the CSI endpoint) to an annotation on the Kubernetes Node API Object
  • livenessprobe

Storage vendors can build Kubernetes deployments for their plugins using these components, while leaving their CSI driver completely unaware of Kubernetes.

Where can I find CSI drivers?

CSI drivers are developed and maintained by third parties. You can find a non-definitive list of some sample and production CSI drivers.

What about FlexVolumes?

As mentioned in the alpha release blog post, FlexVolume plugin was an earlier attempt to make the Kubernetes volume plugin system extensible. Although it enables third party storage vendors to write drivers “out-of-tree”, because it is an exec based API, FlexVolumes requires files for third party driver binaries (or scripts) to be copied to a special plugin directory on the root filesystem of every node (and, in some cases, master) machine. This requires a cluster admin to have write access to the host filesystem for each node and some external mechanism to ensure that the driver file is recreated if deleted, just to deploy a volume plugin.

In addition to being difficult to deploy, Flex did not address the pain of plugin dependencies: Volume plugins tend to have many external requirements (on mount and filesystem tools, for example). These dependencies are assumed to be available on the underlying host OS, which is often not the case.

CSI addresses these issues by not only enabling storage plugins to be developed out-of-tree, but also containerized and deployed via standard Kubernetes primitives.

If you still have questions about in-tree volumes vs CSI vs Flex, please see the Volume Plugin FAQ.

What will happen to the in-tree volume plugins?

Once CSI reaches stability, we plan to migrate most of the in-tree volume plugins to CSI. Stay tuned for more details as the Kubernetes CSI implementation approaches stable.

What are the limitations of beta?

The beta implementation of CSI has the following limitations:

  • Block volumes are not supported; only file.
  • CSI drivers must be deployed with the provided external-attacher sidecar plugin, even if they don’t implement ControllerPublishVolume.
  • Topology awareness is not supported for CSI volumes, including the ability to share information about where a volume is provisioned (zone, regions, etc.) with the Kubernetes scheduler to allow it to make smarter scheduling decisions, and the ability for the Kubernetes scheduler or a cluster administrator or an application developer to specify where a volume should be provisioned.
  • driver-registrar requires permissions to modify all Kubernetes node API objects which could result in a compromised node gaining the ability to do the same.

What’s next?

Depending on feedback and adoption, the Kubernetes team plans to push the CSI implementation to GA in 1.12.

The team would like to encourage storage vendors to start developing CSI drivers, deploying them on Kubernetes, and sharing feedback with the team via the Kubernetes Slack channel wg-csi, the Google group kubernetes-sig-storage-wg-csi, or any of the standard SIG storage communication channels.

How do I get involved?

This project, like all of Kubernetes, is the result of hard work by many contributors from diverse backgrounds working together.

In addition to the contributors who have been working on the Kubernetes implementation of CSI since alpha:

We offer a huge thank you to the new contributors who stepped up this quarter to help the project reach beta:

If you’re interested in getting involved with the design and development of CSI or any part of the Kubernetes Storage system, join the Kubernetes Storage Special Interest Group (SIG). We’re rapidly growing and always welcome new contributors.

Fixing the Subpath Volume Vulnerability in Kubernetes

April 04 2018

On March 12, 2018, the Kubernetes Product Security team disclosed CVE-2017-1002101, which allowed containers using subpath volume mounts to access files outside of the volume. This means that a container could access any file available on the host, including volumes for other containers that it should not have access to.

The vulnerability has been fixed and released in the latest Kubernetes patch releases. We recommend that all users upgrade to get the fix. For more details on the impact and how to get the fix, please see the announcement. (Note, some functional regressions were found after the initial fix and are being tracked in issue #61563).

This post presents a technical deep dive on the vulnerability and the solution.

Kubernetes Background

To understand the vulnerability, one must first understand how volume and subpath mounting works in Kubernetes.

Before a container is started on a node, the kubelet volume manager locally mounts all the volumes specified in the PodSpec under a directory for that Pod on the host system. Once all the volumes are successfully mounted, it constructs the list of volume mounts to pass to the container runtime. Each volume mount contains information that the container runtime needs, the most relevant being:

  • Path of the volume in the container
  • Path of the volume on the host (/var/lib/kubelet/pods/<pod uid>/volumes/<volume type>/<volume name>)

When starting the container, the container runtime creates the path in the container root filesystem, if necessary, and then bind mounts it to the provided host path.

Subpath mounts are passed to the container runtime just like any other volume. The container runtime does not distinguish between a base volume and a subpath volume, and handles them the same way. Instead of passing the host path to the root of the volume, Kubernetes constructs the host path by appending the Pod-specified subpath (a relative path) to the base volume’s host path.

For example, here is a spec for a subpath volume mount:

apiVersion: v1
kind: Pod
metadata:
  name: my-pod
spec:
  containers:
  - name: my-container
    <snip>
    volumeMounts:
    - mountPath: /mnt/data
      name: my-volume
      subPath: dataset1
  volumes:
  - name: my-volume
    emptyDir: {}

In this example, when the Pod gets scheduled to a node, the system will:

  • Set up an EmptyDir volume at /var/lib/kubelet/pods/1234/volumes/kubernetes.io~empty-dir/my-volume
  • Construct the host path for the subpath mount: /var/lib/kubelet/pods/1234/volumes/kubernetes.io~empty-dir/my-volume/ + dataset1
  • Pass the following mount information to the container runtime:
    • Container path: /mnt/data
    • Host path: /var/lib/kubelet/pods/1234/volumes/kubernetes.io~empty-dir/my-volume/dataset1
  • The container runtime bind mounts /mnt/data in the container root filesystem to /var/lib/kubelet/pods/1234/volumes/kubernetes.io~empty-dir/my-volume/dataset1 on the host.
  • The container runtime starts the container.

The Vulnerability

The vulnerability with subpath volumes was discovered by Maxim Ivanov, by making a few observations:

  • Subpath references files or directories that are controlled by the user, not the system.
  • Volumes can be shared by containers that are brought up at different times in the Pod lifecycle, including by different Pods.
  • Kubernetes passes host paths to the container runtime to bind mount into the container.

The basic example below demonstrates the vulnerability. It takes advantage of the observations outlined above by:

  • Using an init container to setup the volume with a symlink.
  • Using a regular container to mount that symlink as a subpath later.
  • Causing kubelet to evaluate the symlink on the host before passing it into the container runtime.
apiVersion: v1
kind: Pod
metadata:
  name: my-pod
spec:
  initContainers:
  - name: prep-symlink
    image: "busybox"
    command: ["bin/sh", "-ec", "ln -s / /mnt/data/symlink-door"]
    volumeMounts:
    - name: my-volume
      mountPath: /mnt/data
  containers:
  - name: my-container
    image: "busybox"
    command: ["/bin/sh", "-ec", "ls /mnt/data; sleep 999999"]
    volumeMounts:
    - mountPath: /mnt/data
      name: my-volume
      subPath: symlink-door
  volumes:
  - name: my-volume
    emptyDir: {}

For this example, the system will:

  • Setup an EmptyDir volume at /var/lib/kubelet/pods/1234/volumes/kubernetes.io~empty-dir/my-volume
  • Pass the following mount information for the init container to the container runtime:
    • Container path: /mnt/data
    • Host path: /var/lib/kubelet/pods/1234/volumes/kubernetes.io~empty-dir/my-volume
  • The container runtime bind mounts /mnt/data in the container root filesystem to /var/lib/kubelet/pods/1234/volumes/kubernetes.io~empty-dir/my-volume on the host.
  • The container runtime starts the init container.
  • The init container creates a symlink inside the container: /mnt/data/symlink-door -> /, and then exits.
  • Kubelet starts to prepare the volume mounts for the normal containers.
  • It constructs the host path for the subpath volume mount: /var/lib/kubelet/pods/1234/volumes/kubernetes.io~empty-dir/my-volume/ + symlink-door.
  • And passes the following mount information to the container runtime:
    • Container path: /mnt/data
    • Host path: /var/lib/kubelet/pods/1234/volumes/kubernetes.io~empty-dir/my-volume/symlink-door
  • The container runtime bind mounts /mnt/data in the container root filesystem to /var/lib/kubelet/pods/1234/volumes/kubernetes.io~empty~dir/my-volume/symlink-door
  • However, the bind mount resolves symlinks, which in this case, resolves to / on the host! Now the container can see all of the host’s filesystem through its mount point /mnt/data.

This is a manifestation of a symlink race, where a malicious user program can gain access to sensitive data by causing a privileged program (in this case, kubelet) to follow a user-created symlink.

It should be noted that init containers are not always required for this exploit, depending on the volume type. It is used in the EmptyDir example because EmptyDir volumes cannot be shared with other Pods, and only created when a Pod is created, and destroyed when the Pod is destroyed. For persistent volume types, this exploit can also be done across two different Pods sharing the same volume.

The Fix

The underlying issue is that the host path for subpaths are untrusted and can point anywhere in the system. The fix needs to ensure that this host path is both:

  • Resolved and validated to point inside the base volume.
  • Not changeable by the user in between the time of validation and when the container runtime bind mounts it.

The Kubernetes product security team went through many iterations of possible solutions before finally agreeing on a design.

Idea 1

Our first design was relatively simple. For each subpath mount in each container:

  • Resolve all the symlinks for the subpath.
  • Validate that the resolved path is within the volume.
  • Pass the resolved path to the container runtime.

However, this design is prone to the classic time-of-check-to-time-of-use (TOCTTOU) problem. In between steps 2) and 3), the user could change the path back to a symlink. The proper solution needs some way to “lock” the path so that it cannot be changed in between validation and bind mounting by the container runtime. All the subsequent ideas use an intermediate bind mount by kubelet to achieve this “lock” step before handing it off to the container runtime. Once a bind mount is performed, the mount source is fixed and cannot be changed.

Idea 2

We went a bit wild with this idea:

  • Create a working directory under the kubelet’s pod directory. Let’s call it dir1.
  • Bind mount the base volume to under the working directory, dir1/volume.
  • Chroot to the working directory dir1.
  • Inside the chroot, bind mount volume/subpath to subpath. This ensures that any symlinks get resolved to inside the chroot environment.
  • Exit the chroot.
  • On the host again, pass the bind mounted dir1/subpath to the container runtime.

While this design does ensure that the symlinks cannot point outside of the volume, it was ultimately rejected due to difficulties of implementing the chroot mechanism in 4) across all the various distros and environments that Kubernetes has to support, including containerized kubelets.

Idea 3

Coming back to earth a little bit, our next idea was to:

  • Bind mount the subpath to a working directory under the kubelet’s pod directory.
  • Get the source of the bind mount, and validate that it is within the base volume.
  • Pass the bind mount to the container runtime.

In theory, this sounded pretty simple, but in reality, 2) was quite difficult to implement correctly. Many scenarios had to be handled where volumes (like EmptyDir) could be on a shared filesystem, on a separate filesystem, on the root filesystem, or not on the root filesystem. NFS volumes ended up handling all bind mounts as a separate mount, instead of as a child to the base volume. There was additional uncertainty about how out-of-tree volume types (that we couldn’t test) would behave.

The Solution

Given the amount of scenarios and corner cases that had to be handled with the previous design, we really wanted to find a solution that was more generic across all volume types. The final design that we ultimately went with was to:

  • Resolve all the symlinks in the subpath.
  • Starting with the base volume, open each path segment one by one, using the openat() syscall, and disallow symlinks. With each path segment, validate that the current path is within the base volume.
  • Bind mount /proc/<kubelet pid>/fd/<final fd> to a working directory under the kubelet’s pod directory. The proc file is a link to the opened file. If that file gets replaced while kubelet still has it open, then the link will still point to the original file.
  • Close the fd and pass the bind mount to the container runtime.

Note that this solution is different for Windows hosts, where the mounting semantics are different than Linux. In Windows, the design is to:

  • Resolve all the symlinks in the subpath.
  • Starting with the base volume, open each path segment one by one with a file lock, and disallow symlinks. With each path segment, validate that the current path is within the base volume.
  • Pass the resolved subpath to the container runtime, and start the container.
  • After the container has started, unlock and close all the files.

Both solutions are able to address all the requirements of:

  • Resolving the subpath and validating that it points to a path inside the base volume.
  • Ensuring that the subpath host path cannot be changed in between the time of validation and when the container runtime bind mounts it.
  • Being generic enough to support all volume types.

Acknowledgements

Special thanks to many folks involved with handling this vulnerability:

  • Maxim Ivanov, who responsibly disclosed the vulnerability to the Kubernetes Product Security team.
  • Kubernetes storage and security engineers from Google, Microsoft, and RedHat, who developed, tested, and reviewed the fixes.
  • Kubernetes test-infra team, for setting up the private build infrastructure
  • Kubernetes patch release managers, for coordinating and handling all the releases.
  • All the production release teams that worked to deploy the fix quickly after release.

If you find a vulnerability in Kubernetes, please follow our responsible disclosure process and let us know; we want to do our best to make Kubernetes secure for all users.

– Michelle Au, Software Engineer, Google; and Jan Šafránek, Software Engineer, Red Hat

Kubernetes 1.10: Stabilizing Storage, Security, and Networking

March 27 2018

Editor’s note: today’s post is by the 1.10 Release Team

We’re pleased to announce the delivery of Kubernetes 1.10, our first release of 2018!

Today’s release continues to advance maturity, extensibility, and pluggability of Kubernetes. This newest version stabilizes features in 3 key areas, including storage, security, and networking. Notable additions in this release include the introduction of external kubectl credential providers (alpha), the ability to switch DNS service to CoreDNS at install time (beta), and the move of Container Storage Interface (CSI) and persistent local volumes to beta.

Let’s dive into the key features of this release:

Storage - CSI and Local Storage move to beta

This is an impactful release for the Storage Special Interest Group (SIG), marking the culmination of their work on multiple features. The Kubernetes implementation of the Container Storage Interface (CSI) moves to beta in this release: installing new volume plugins is now as easy as deploying a pod. This in turn enables third-party storage providers to develop their solutions independently outside of the core Kubernetes codebase. This continues the thread of extensibility within the Kubernetes ecosystem.

Durable (non-shared) local storage management progressed to beta in this release, making locally attached (non-network attached) storage available as a persistent volume source. This means higher performance and lower cost for distributed file systems and databases.

This release also includes many updates to Persistent Volumes. Kubernetes can automatically prevent deletion of Persistent Volume Claims that are in use by a pod (beta) and prevent deletion of a Persistent Volume that is bound to a Persistent Volume Claim (beta). This helps ensure that storage API objects are deleted in the correct order.

Security - External credential providers (alpha)

Kubernetes, which is already highly extensible, gains another extension point in 1.10 with external kubectl credential providers (alpha). Cloud providers, vendors, and other platform developers can now release binary plugins to handle authentication for specific cloud-provider IAM services, or that integrate with in-house authentication systems that aren’t supported in-tree, such as Active Directory. This complements the Cloud Controller Manager feature added in 1.9.

Networking - CoreDNS as a DNS provider (beta)

The ability to switch the DNS service to CoreDNS at install time is now in beta. CoreDNS has fewer moving parts: it’s a single executable and a single process, and supports additional use cases.

Each Special Interest Group (SIG) within the community continues to deliver the most-requested enhancements, fixes, and functionality for their respective specialty areas. For a complete list of inclusions by SIG, please visit the release notes.

Availability

Kubernetes 1.10 is available for download on GitHub. To get started with Kubernetes, check out these interactive tutorials.

2 Day Features Blog Series

If you’re interested in exploring these features more in depth, check back next week for our 2 Days of Kubernetes series where we’ll highlight detailed walkthroughs of the following features:

Day 1 - Container Storage Interface (CSI) for Kubernetes going Beta Day 2 - Local Persistent Volumes for Kubernetes going Beta

Release team

This release is made possible through the effort of hundreds of individuals who contributed both technical and non-technical content. Special thanks to the release team led by Jaice Singer DuMars, Kubernetes Ambassador for Microsoft. The 10 individuals on the release team coordinate many aspects of the release, from documentation to testing, validation, and feature completeness.

As the Kubernetes community has grown, our release process represents an amazing demonstration of collaboration in open source software development. Kubernetes continues to gain new users at a rapid clip. This growth creates a positive feedback cycle where more contributors commit code creating a more vibrant ecosystem.

Project Velocity

The CNCF has continued refining an ambitious project to visualize the myriad contributions that go into the project. K8s DevStats illustrates the breakdown of contributions from major company contributors, as well as an impressive set of preconfigured reports on everything from individual contributors to pull request lifecycle times. Thanks to increased automation, issue count at the end of the release was only slightly higher than it was at the beginning. This marks a major shift toward issue manageability. With 75,000+ comments, Kubernetes remains one of the most actively discussed projects on GitHub.

User Highlights

According to a recent CNCF survey, more than 49% of Asia-based respondents use Kubernetes in production, with another 49% evaluating it for use in production. Established, global organizations are using Kubernetes in production at massive scale. Recently published user stories from the community include:

  1. Huawei, the largest telecommunications equipment manufacturer in the world, moved its internal IT department’s applications to run on Kubernetes. This resulted in the global deployment cycles decreasing from a week to minutes, and the efficiency of application delivery improved by tenfold.
  2. Jinjiang Travel International, one of the top 5 largest OTA and hotel companies, use Kubernetes to speed up their software release velocity from hours to just minutes. Additionally, they leverage Kubernetes to increase the scalability and availability of their online workloads.
  3. Haufe Group, the Germany-based media and software company, utilized Kubernetes to deliver a new release in half an hour instead of days. The company is also able to scale down to around half the capacity at night, saving 30 percent on hardware costs.
  4. BlackRock, the world’s largest asset manager, was able to move quickly using Kubernetes and built an investor research web app from inception to delivery in under 100 days. Is Kubernetes helping your team? Share your story with the community.

Ecosystem Updates

  1. The CNCF is expanding its certification offerings to include a Certified Kubernetes Application Developer exam. The CKAD exam certifies an individual’s ability to design, build, configure, and expose cloud native applications for Kubernetes. The CNCF is looking for beta testers for this new program. More information can be found here.
  2. Kubernetes documentation now features user journeys: specific pathways for learning based on who readers are and what readers want to do. Learning Kubernetes is easier than ever for beginners, and more experienced users can find task journeys specific to cluster admins and application developers.
  3. CNCF also offers online training that teaches the skills needed to create and configure a real-world Kubernetes cluster.

KubeCon

The world’s largest Kubernetes gathering, KubeCon + CloudNativeCon is coming to Copenhagen from May 2-4, 2018 and will feature technical sessions, case studies, developer deep dives, salons and more! Check out the schedule of speakers and register today!

Webinar

Join members of the Kubernetes 1.10 release team on April 10th at 10am PDT to learn about the major features in this release including Local Persistent Volumes and the Container Storage Interface (CSI). Register here.

Get Involved

The simplest way to get involved with Kubernetes is by joining one of the many Special Interest Groups (SIGs) that align with your interests. Have something you’d like to broadcast to the Kubernetes community? Share your voice at our weekly community meeting, and through the channels below.

Thank you for your continued feedback and support.

  1. Post questions (or answer questions) on Stack Overflow
  2. Join the community portal for advocates on K8sPort
  3. Follow us on Twitter @Kubernetesio for latest updates
  4. Chat with the community on Slack
  5. Share your Kubernetes story.

Principles of Container-based Application Design

March 15 2018

It’s possible nowadays to put almost any application in a container and run it. Creating cloud-native applications, however—containerized applications that are automated and orchestrated effectively by a cloud-native platform such as Kubernetes—requires additional effort. Cloud-native applications anticipate failure; they run and scale reliably even when their infrastructure experiences outages. To offer such capabilities, cloud-native platforms like Kubernetes impose a set of contracts and constraints on applications. These contracts ensure that applications they run conform to certain constraints and allow the platform to automate application management.

I’ve outlined seven principlesfor containerized applications to follow in order to be fully cloud-native.

Container Design Principles

These seven principles cover both build time and runtime concerns.

Build time

  • Single Concern: Each container addresses a single concern and does it well.
  • Self-Containment: A container relies only on the presence of the Linux kernel. Additional libraries are added when the container is built.
  • Image Immutability: Containerized applications are meant to be immutable, and once built are not expected to change between different environments.

Runtime

  • High Observability: Every container must implement all necessary APIs to help the platform observe and manage the application in the best way possible.
  • Lifecycle Conformance: A container must have a way to read events coming from the platform and conform by reacting to those events.
  • Process Disposability: Containerized applications must be as ephemeral as possible and ready to be replaced by another container instance at any point in time.
  • Runtime Confinement: Every container must declare its resource requirements and restrict resource use to the requirements indicated. The build time principles ensure that containers have the right granularity, consistency, and structure in place. The runtime principles dictate what functionalities must be implemented in order for containerized applications to possess cloud-native function. Adhering to these principles helps ensure that your applications are suitable for automation in Kubernetes.

The white paper is freely available for download:

To read more about designing cloud-native applications for Kubernetes, check out my Kubernetes Patterns book.

Bilgin Ibryam, Principal Architect, Red Hat

Twitter: 

Blog: http://www.ofbizian.com
Linkedin:

Bilgin Ibryam (@bibryam) is a principal architect at Red Hat, open source committer at ASF, blogger, author, and speaker. He is the author of Camel Design Patterns and Kubernetes Patterns books. In his day-to-day job, Bilgin enjoys mentoring, training and leading teams to be successful with distributed systems, microservices, containers, and cloud-native applications in general.

Expanding User Support with Office Hours

March 14 2018

Today’s post is by Jorge Castro and Ilya Dmitichenko on Kubernetes office hours.

Today’s developer has an almost overwhelming amount of resources available for learning. Kubernetes development teams use StackOverflow, user documentation, Slack, and the mailing lists. Additionally, the community itself continues to amass an awesome list of resources.

One of the challenges of large projects is keeping user resources relevant and useful. While documentation can be useful, great learning also happens in Q&A sessions at conferences, or by learning with someone whose explanation matches your learning style. Consider that learning Kung Fu from Morpheus would be a lot more fun than reading a book about Kung Fu!

We as Kubernetes developers want to create an interactive experience: where Kubernetes users can get their questions answered by experts in real time, or at least referred to the best known documentation or code example.

Having discussed a few broad ideas, we eventually decided to make Kubernetes Office Hours a live stream where we take user questions from the audience and present them to our panel of contributors and expert users. We run two sessions: one for European time zones, and one for the Americas. These streaming setup guidelines make office hours extensible—for example, if someone wants to run office hours for Asia/Pacific timezones, or for another CNCF project.

To give you an idea of what Kubernetes office hours are like, here’s Josh Berkus answering a question on running databases on Kubernetes. Despite the popularity of this topic, it’s still difficult for a new user to get a constructive answer. Here’s an excellent response from Josh:

It’s often easier to field this kind of question in office hours than it is to ask a developer to write a full-length blog post. [Editor’s note: That’s legit!] Because we don’t have infinite developers with infinite time, this kind of focused communication creates high-bandwidth help while limiting developer commitments to 1 hour per month. This allows a rotating set of experts to share the load without overwhelming any one person.

We hold office hours the third Wednesday of every month on the Kubernetes YouTube Channel. You can post questions on the #office-hours channel on Slack, or you can submit your question to Stack Overflow and post a link on Slack. If you post a question in advance, you might get better answers, as volunteers have more time to research and prepare. If a question can’t be fully solved during the call, the team will try their best to point you in the right direction and/or ping other people in the community to take a look. Check out this page for more details on what’s off- and on topic as well as meeting information for your time zone. We hope to hear your questions soon!

Special thanks to Amazon, Bitnami, Giant Swarm, Heptio, Liquidweb, Northwestern Mutual, Packet.net, Pivotal, Red Hat, Weaveworks, and VMWare for donating engineering time to office hours.

And thanks to Alan Pope, Joe Beda, and Charles Butler for technical support in making our livestream better.

How to Integrate RollingUpdate Strategy for TPR in Kubernetes

March 13 2018

With Kubernetes, it’s easy to manage and scale stateless applications like web apps and API services right out of the box. To date, almost all of the talks about Kubernetes has been about microservices and stateless applications.

With the popularity of container-based microservice architectures, there is a strong need to deploy and manage RDBMS(Relational Database Management Systems). RDBMS requires experienced database-specific knowledge to correctly scale, upgrade, and re-configure while protecting against data loss or unavailability.

For example, MySQL (the most popular open source RDBMS) needs to store data in files that are persistent and exclusive to each MySQL database’s storage. Each MySQL database needs to be individually distinct, another, more complex is in cluster that need to distinguish one MySQL database from a cluster as a different role, such as master, slave, or shard. High availability and zero data loss are also hard to accomplish when replacing database nodes on failed machines.

Using powerful Kubernetes API extension mechanisms, we can encode RDBMS domain knowledge into software, named WQ-RDS, running atop Kubernetes like built-in resources.

WQ-RDS leverages Kubernetes primitive resources and controllers, it deliveries a number of enterprise-grade features and brings a significantly reliable way to automate time-consuming operational tasks like database setup, patching backups, and setting up high availability clusters. WQ-RDS supports mainstream versions of Oracle and MySQL (both compatible with MariaDB).

Let’s demonstrate how to manage a MySQL sharding cluster.

MySQL Sharding Cluster

MySQL Sharding Cluster is a scale-out database architecture. Based on the hash algorithm, the architecture distributes data across all the shards of the cluster. Sharding is entirely transparent to clients: Proxy is able to connect to any Shards in the cluster and issue queries to the correct shards directly.

 

Note: Each shard corresponds to a single MySQL instance. Currently, WQ-RDS supports a maximum of 64 shards.

All of the shards are built with Kubernetes Statefulset, Services, Storage Class, configmap, secrets and MySQL. WQ-RDS manages the entire lifecycle of the sharding cluster. Advantages of the sharding cluster are obvious:

  • Scale out queries per second (QPS) and transactions per second (TPS)
  • Scale out storage capacity: gain more storage by distributing data to multiple nodes

Create a MySQL Sharding Cluster

Let’s create a Kubernetes cluster with 8 shards.

 kubectl create -f mysqlshardingcluster.yaml

Next, create a MySQL Sharding Cluster including 8 shards.

  • TPR : MysqlCluster and MysqlDatabase
[root@k8s-master ~]# kubectl get mysqlcluster  


NAME             KIND

clustershard-c   MysqlCluster.v1.mysql.orain.com

MysqlDatabase from clustershard-c0 to clustershard-c7 belongs to MysqlCluster clustershard-c.

[root@k8s-master ~]# kubectl get mysqldatabase  

NAME KIND  

clustershard-c0 MysqlDatabase.v1.mysql.orain.com  

clustershard-c1 MysqlDatabase.v1.mysql.orain.com  

clustershard-c2 MysqlDatabase.v1.mysql.orain.com  

clustershard-c3 MysqlDatabase.v1.mysql.orain.com  

clustershard-c4 MysqlDatabase.v1.mysql.orain.com  

clustershard-c5 MysqlDatabase.v1.mysql.orain.com  

clustershard-c6 MysqlDatabase.v1.mysql.orain.com  

clustershard-c7 MysqlDatabase.v1.mysql.orain.com

Next, let’s look at two main features: high availability and RollingUpdate strategy.

To demonstrate, we’ll start by running sysbench to generate some load on the cluster. In this example, QPS metrics are generated by MySQL export, collected by Prometheus, and visualized in Grafana.

Feature: high availability

WQ-RDS handles MySQL instance crashes while protecting against data loss.

When killing clustershard-c0, WQ-RDS will detect that clustershard-c0 is unavailable and replace clustershard-c0 on failed machine, taking about 35 seconds on average.

zero data loss at same time.

Feature : RollingUpdate Strategy

MySQL Sharding Cluster brings us not only strong scalability but also some level of maintenance complexity. For example, when updating a MySQL configuration like innodb_buffer_pool_size, a DBA has to perform a number of steps:

1. Apply change time.
2. Disable client access to database proxies.
3. Start a rolling upgrade.

Rolling upgrades need to proceed in order and are the most demanding step of the process. One cannot continue a rolling upgrade until and unless previous updates to MySQL instances are running and ready.

4 Verify the cluster.
5. Enable client access to database proxies.

Possible problems with a rolling upgrade include:

  • node reboot
  • MySQL instances restart
  • human error Instead, WQ-RDS enables a DBA to perform rolling upgrades automatically.

StatefulSet RollingUpdate in Kubernetes

Kubernetes 1.7 includes a major feature that adds automated updates to StatefulSets and supports a range of update strategies including rolling updates.

Note: For more information about StatefulSet RollingUpdate, see the Kubernetes docs.

Because TPR (currently CRD) does not support the rolling upgrade strategy, we needed to integrate the RollingUpdate strategy into WQ-RDS. Fortunately, the Kubernetes repo is a treasure for learning. In the process of implementation, there are some points to share:

  • MySQL Sharding Cluster has **changed**: Each StatefulSet has its corresponding ControllerRevision, which records all the revision data and order (like git). Whenever StatefulSet is syncing, StatefulSet Controller will firstly compare it’s spec to the latest corresponding ControllerRevision data (similar to git diff). If changed, a new ControllerrRevision will be generated, and the revision number will be incremented by 1. WQ-RDS borrows the process, MySQL Sharding Cluster object will record all the revision and order in ControllerRevision.
  • How to initialize MySQL Sharding Cluster to meet request **replicas**: Statefulset supports two Pod management policies: Parallel and OrderedReady. Because MySQL Sharding Cluster doesn’t require ordered creation for its initial processes, we use the Parallel policy to accelerate the initialization of the cluster.
  • How to perform a Rolling **Upgrade**: Statefulset recreates pods in strictly decreasing order. The difference is that WQ-RDS updates shards instead of recreating them, as shown below:

  • When RollingUpdate ends: Kubernetes signals termination clearly. A rolling update completes when all of a set’s Pods have been updated to the updateRevision. The status’s currentRevision is set to updateRevision and its updateRevision is set to the empty string. The status’s currentReplicas is set to updateReplicas and its updateReplicas are set to 0.

Controller revision in WQ-RDS

Revision information is stored in MysqlCluster.Status and is no different than Statefulset.Status.


root@k8s-master ~]# kubectl get mysqlcluster -o yaml clustershard-c

apiVersion: v1

items:

\- apiVersion: mysql.orain.com/v1

 kind: MysqlCluster

 metadata:

   creationTimestamp: 2017-10-20T08:19:41Z

   labels:

     AppName: clustershard-crm

     Createdby: orain.com

     DBType: MySQL

   name: clustershard-c

   namespace: default

   resourceVersion: "415852"

   selfLink: /apis/mysql.orain.com/v1/namespaces/default/mysqlclusters/clustershard-c

   uid: 6bb089bb-b56f-11e7-ae02-525400e717a6

 spec:



     dbresourcespec:

       limitedcpu: 1200m

       limitedmemory: 400Mi

       requestcpu: 1000m

       requestmemory: 400Mi



 status:

   currentReplicas: 8

   currentRevision: clustershard-c-648d878965

   replicas: 8

   updateRevision: clustershard-c-648d878965

kind: List

Example: Perform a rolling upgrade

Finally, We can now update “clustershard-c” to update configuration “innodb_buffer_pool_size” from 6GB to 7GB and reboot.

The process takes 480 seconds.

The upgrade is in monotonically decreasing manner:

Conclusion

RollingUpgrade is meaningful to database administrators. It provides a more effective way to operator database.

--Orain Xiong, co-founder, Woqutech

@Kubernetesio View on Github #kubernetes-users Stack Overflow Download Kubernetes