Kubernetes 1.14: Local Persistent Volumes GA

Thursday, April 04, 2019

Kubernetes 1.14: Local Persistent Volumes GA

Authors: Michelle Au (Google), Matt Schallert (Uber), Celina Ward (Uber)

The Local Persistent Volumes feature has been promoted to GA in Kubernetes 1.14. It was first introduced as alpha in Kubernetes 1.7, and then beta in Kubernetes 1.10. The GA milestone indicates that Kubernetes users may depend on the feature and its API for production use. GA features are protected by the Kubernetes deprecation policy.

What is a Local Persistent Volume?

A local persistent volume represents a local disk directly-attached to a single Kubernetes Node.

Kubernetes provides a powerful volume plugin system that enables Kubernetes workloads to use a wide variety of block and file storage to persist data. Most of these plugins enable remote storage – these remote storage systems persist data independent of the Kubernetes node where the data originated. Remote storage usually can not offer the consistent high performance guarantees of local directly-attached storage. With the Local Persistent Volume plugin, Kubernetes workloads can now consume high performance local storage using the same volume APIs that app developers have become accustomed to.

How is it different from a HostPath Volume?

To better understand the benefits of a Local Persistent Volume, it is useful to compare it to a HostPath volume. HostPath volumes mount a file or directory from the host node’s filesystem into a Pod. Similarly a Local Persistent Volume mounts a local disk or partition into a Pod.

The biggest difference is that the Kubernetes scheduler understands which node a Local Persistent Volume belongs to. With HostPath volumes, a pod referencing a HostPath volume may be moved by the scheduler to a different node resulting in data loss. But with Local Persistent Volumes, the Kubernetes scheduler ensures that a pod using a Local Persistent Volume is always scheduled to the same node.

While HostPath volumes may be referenced via a Persistent Volume Claim (PVC) or directly inline in a pod definition, Local Persistent Volumes can only be referenced via a PVC. This provides additional security benefits since Persistent Volume objects are managed by the administrator, preventing Pods from being able to access any path on the host.

Additional benefits include support for formatting of block devices during mount, and volume ownership using fsGroup.

What’s New With GA?

Since 1.10, we have mainly focused on improving stability and scalability of the feature so that it is production ready.

The only major feature addition is the ability to specify a raw block device and have Kubernetes automatically format and mount the filesystem. This reduces the previous burden of having to format and mount devices before giving it to Kubernetes.

Limitations of GA

At GA, Local Persistent Volumes do not support dynamic volume provisioning. However there is an external controller available to help manage the local PersistentVolume lifecycle for individual disks on your nodes. This includes creating the PersistentVolume objects, cleaning up and reusing disks once they have been released by the application.

How to Use a Local Persistent Volume?

Workloads can request a local persistent volume using the same PersistentVolumeClaim interface as remote storage backends. This makes it easy to swap out the storage backend across clusters, clouds, and on-prem environments.

First, a StorageClass should be created that sets volumeBindingMode: WaitForFirstConsumer to enable volume topology-aware scheduling. This mode instructs Kubernetes to wait to bind a PVC until a Pod using it is scheduled.

kind: StorageClass
apiVersion: storage.k8s.io/v1
  name: local-storage
provisioner: kubernetes.io/no-provisioner
volumeBindingMode: WaitForFirstConsumer

Then, the external static provisioner can be configured and run to create PVs for all the local disks on your nodes.

$ kubectl get pv
local-pv-27c0f084   368Gi      RWO            Delete           Available          local-storage              8s
local-pv-3796b049   368Gi      RWO            Delete           Available          local-storage              7s
local-pv-3ddecaea   368Gi      RWO            Delete           Available          local-storage              7s

Afterwards, workloads can start using the PVs by creating a PVC and Pod or a StatefulSet with volumeClaimTemplates.

apiVersion: apps/v1
kind: StatefulSet
  name: local-test
  serviceName: "local-service"
  replicas: 3
      app: local-test
        app: local-test
      - name: test-container
        image: k8s.gcr.io/busybox
        - "/bin/sh"
        - "-c"
        - "sleep 100000"
        - name: local-vol
          mountPath: /usr/test-pod
  - metadata:
      name: local-vol
      accessModes: [ "ReadWriteOnce" ]
      storageClassName: "local-storage"
          storage: 368Gi

Once the StatefulSet is up and running, the PVCs are all bound:

$ kubectl get pvc
NAME                     STATUS   VOLUME              CAPACITY   ACCESS MODES   STORAGECLASS      AGE
local-vol-local-test-0   Bound    local-pv-27c0f084   368Gi      RWO            local-storage     3m45s
local-vol-local-test-1   Bound    local-pv-3ddecaea   368Gi      RWO            local-storage     3m40s
local-vol-local-test-2   Bound    local-pv-3796b049   368Gi      RWO            local-storage     3m36s

When the disk is no longer needed, the PVC can be deleted. The external static provisioner will clean up the disk and make the PV available for use again.

$ kubectl patch sts local-test -p '{"spec":{"replicas":2}}'
statefulset.apps/local-test patched

$ kubectl delete pvc local-vol-local-test-2
persistentvolumeclaim "local-vol-local-test-2" deleted

$ kubectl get pv
NAME                CAPACITY   ACCESS MODES   RECLAIM POLICY   STATUS      CLAIM                            STORAGECLASS   REASON      AGE
local-pv-27c0f084   368Gi      RWO            Delete           Bound       default/local-vol-local-test-0   local-storage              11m
local-pv-3796b049   368Gi      RWO            Delete           Available                                    local-storage              7s
local-pv-3ddecaea   368Gi      RWO            Delete           Bound       default/local-vol-local-test-1   local-storage              19m

You can find full documentation for the feature on the Kubernetes website.

What Are Suitable Use Cases?

The primary benefit of Local Persistent Volumes over remote persistent storage is performance: local disks usually offer higher IOPS and throughput and lower latency compared to remote storage systems.

However, there are important limitations and caveats to consider when using Local Persistent Volumes:

  • Using local storage ties your application to a specific node, making your application harder to schedule. Applications which use local storage should specify a high priority so that lower priority pods, that don’t require local storage, can be preempted if necessary.
  • If that node or local volume encounters a failure and becomes inaccessible, then that pod also becomes inaccessible. Manual intervention, external controllers, or operators may be needed to recover from these situations.
  • While most remote storage systems implement synchronous replication, most local disk offerings do not provide data durability guarantees. Meaning loss of the disk or node may result in loss of all the data on that disk

For these reasons, local persistent storage should only be considered for workloads that handle data replication and backup at the application layer, thus making the applications resilient to node or data failures and unavailability despite the lack of such guarantees at the individual disk level.

Examples of good workloads include software defined storage systems and replicated databases. Other types of applications should continue to use highly available, remotely accessible, durable storage.

How Uber Uses Local Storage

M3, Uber’s in-house metrics platform, piloted Local Persistent Volumes at scale in an effort to evaluate M3DB — an open-source, distributed timeseries database created by Uber. One of M3DB’s notable features is its ability to shard its metrics into partitions, replicate them by a factor of three, and then evenly disperse the replicas across separate failure domains.

Prior to the pilot with local persistent volumes, M3DB ran exclusively in Uber-managed environments. Over time, internal use cases arose that required the ability to run M3DB in environments with fewer dependencies. So the team began to explore options. As an open-source project, we wanted to provide the community with a way to run M3DB as easily as possible, with an open-source stack, while meeting M3DB’s requirements for high throughput, low-latency storage, and the ability to scale itself out.

The Kubernetes Local Persistent Volume interface, with its high-performance, low-latency guarantees, quickly emerged as the perfect abstraction to build on top of. With Local Persistent Volumes, individual M3DB instances can comfortably handle up to 600k writes per-second. This leaves plenty of headroom for spikes on clusters that typically process a few million metrics per-second.

Because M3DB also gracefully handles losing a single node or volume, the limited data durability guarantees of Local Persistent Volumes are not an issue. If a node fails, M3DB finds a suitable replacement and the new node begins streaming data from its two peers.

Thanks to the Kubernetes scheduler’s intelligent handling of volume topology, M3DB is able to programmatically evenly disperse its replicas across multiple local persistent volumes in all available cloud zones, or, in the case of on-prem clusters, across all available server racks.

Uber’s Operational Experience

As mentioned above, while Local Persistent Volumes provide many benefits, they also require careful planning and careful consideration of constraints before committing to them in production. When thinking about our local volume strategy for M3DB, there were a few things Uber had to consider.

For one, we had to take into account the hardware profiles of the nodes in our Kubernetes cluster. For example, how many local disks would each node cluster have? How would they be partitioned?

The local static provisioner README provides guidance to help answer these questions. It’s best to be able to dedicate a full disk to each local volume (for IO isolation) and a full partition per-volume (for capacity isolation). This was easier in our cloud environments where we could mix and match local disks. However, if using local volumes on-prem, hardware constraints may be a limiting factor depending on the number of disks available and their characteristics.

When first testing local volumes, we wanted to have a thorough understanding of the effect disruptions (voluntary and involuntary) would have on pods using local storage, and so we began testing some failure scenarios. We found that when a local volume becomes unavailable while the node remains available (such as when performing maintenance on the disk), a pod using the local volume will be stuck in a ContainerCreating state until it can mount the volume. If a node becomes unavailable, for example if it is removed from the cluster or is drained, then pods using local volumes on that node are stuck in an Unknown or Pending state depending on whether or not the node was removed gracefully.

Recovering pods from these interim states means having to delete the PVC binding the pod to its local volume and then delete the pod in order for it to be rescheduled (or wait until the node and disk are available again). We took this into account when building our operator for M3DB, which makes changes to the cluster topology when a pod is rescheduled such that the new one gracefully streams data from the remaining two peers. Eventually we plan to automate the deletion and rescheduling process entirely.

Alerts on pod states can help call attention to stuck local volumes, and workload-specific controllers or operators can remediate them automatically. Because of these constraints, it’s best to exclude nodes with local volumes from automatic upgrades or repairs, and in fact some cloud providers explicitly mention this as a best practice.

Portability Between On-Prem and Cloud

Local Volumes played a big role in Uber’s decision to build orchestration for M3DB using Kubernetes, in part because it is a storage abstraction that works the same across on-prem and cloud environments. Remote storage solutions have different characteristics across cloud providers, and some users may prefer not to use networked storage at all in their own data centers. On the other hand, local disks are relatively ubiquitous and provide more predictable performance characteristics.

By orchestrating M3DB using local disks in the cloud, where it was easier to get up and running with Kubernetes, we gained confidence that we could still use our operator to run M3DB in our on-prem environment without any modifications. As we continue to work on how we’d run Kubernetes on-prem, having solved such an important pending question is a big relief.

What’s Next for Local Persistent Volumes?

As we’ve seen with Uber’s M3DB, local persistent volumes have successfully been used in production environments. As adoption of local persistent volumes continues to increase, SIG Storage continues to seek feedback for ways to improve the feature.

One of the most frequent asks has been for a controller that can help with recovery from failed nodes or disks, which is currently a manual process (or something that has to be built into an operator). SIG Storage is investigating creating a common controller that can be used by workloads with simple and similar recovery processes.

Another popular ask has been to support dynamic provisioning using lvm. This can simplify disk management, and improve disk utilization. SIG Storage is evaluating the performance tradeoffs for the viability of this feature.

Getting Invovled

If you have feedback for this feature or are interested in getting involved with the design and development, join the Kubernetes Storage Special-Interest-Group (SIG). We’re rapidly growing and always welcome new contributors.

Special thanks to all the contributors that helped bring this feature to GA, including Chuqiang Li (lichuqiang), Dhiraj Hedge (dhirajh), Ian Chakeres (ianchakeres), Jan Šafránek (jsafrane), Michelle Au (msau42), Saad Ali (saad-ali), Yecheng Fu (cofyc) and Yuquan Ren (nickrenren).