This article is more than one year old. Older articles may contain outdated content. Check that the information in the page has not become incorrect since its publication.

Local Storage: Storage Capacity Tracking, Distributed Provisioning and Generic Ephemeral Volumes hit Beta

Authors: Patrick Ohly (Intel)

The "generic ephemeral volumes" and "storage capacity tracking" features in Kubernetes are getting promoted to beta in Kubernetes 1.21. Together with the distributed provisioning support in the CSI external-provisioner, development and deployment of Container Storage Interface (CSI) drivers which manage storage locally on a node become a lot easier.

This blog post explains how such drivers worked before and how these features can be used to make drivers simpler.

Problems we are solving

There are drivers for local storage, like TopoLVM for traditional disks and PMEM-CSI for persistent memory. They work and are ready for usage today also on older Kubernetes releases, but making that possible was not trivial.

Central component required

The first problem is volume provisioning: it is handled through the Kubernetes control plane. Some component must react to PersistentVolumeClaims (PVCs) and create volumes. Usually, that is handled by a central deployment of the CSI external-provisioner and a CSI driver component that then connects to the storage backplane. But for local storage, there is no such backplane.

TopoLVM solved this by having its different components communicate with each other through the Kubernetes API server by creating and reacting to custom resources. So although TopoLVM is based on CSI, a standard that is independent of a particular container orchestrator, TopoLVM only works on Kubernetes.

PMEM-CSI created its own storage backplane with communication through gRPC calls. Securing that communication depends on TLS certificates, which made driver deployment more complicated.

Informing Pod scheduler about capacity

The next problem is scheduling. When volumes get created independently of pods ("immediate binding"), the CSI driver must pick a node without knowing anything about the pod(s) that are going to use it. Topology information then forces those pods to run on the node where the volume was created. If other resources like RAM or CPU are exhausted there, the pod cannot start. This can be avoided by configuring in the StorageClass that volume creation is meant to wait for the first pod that uses a volume (volumeBinding: WaitForFirstConsumer). In that mode, the Kubernetes scheduler tentatively picks a node based on other constraints and then the external-provisioner is asked to create a volume such that it is usable there. If local storage is exhausted, the provisioner can ask for another scheduling round. But without information about available capacity, the scheduler might always pick the same unsuitable node.

Both TopoLVM and PMEM-CSI solved this with scheduler extenders. This works, but it is hard to configure when deploying the driver because communication between kube-scheduler and the driver is very dependent on how the cluster was set up.

Rescheduling

A common use case for local storage is scratch space. A better fit for that use case than persistent volumes are ephemeral volumes that get created for a pod and destroyed together with it. The initial API for supporting ephemeral volumes with CSI drivers (hence called "CSI ephemeral volumes") was designed for light-weight volumes where volume creation is unlikely to fail. Volume creation happens after pods have been permanently scheduled onto a node, in contrast to the traditional provisioning where volume creation is tried before scheduling a pod onto a node. CSI drivers must be modified to support "CSI ephemeral volumes", which was done for TopoLVM and PMEM-CSI. But due to the design of the feature in Kubernetes, pods can get stuck permanently if storage capacity runs out on a node. The scheduler extenders try to avoid that, but cannot be 100% reliable.

Enhancements in Kubernetes 1.21

Distributed provisioning

Starting with external-provisioner v2.1.0, released for Kubernetes 1.20, provisioning can be handled by external-provisioner instances that get deployed together with the CSI driver on each node and then cooperate to provision volumes ("distributed provisioning"). There is no need any more to have a central component and thus no need for communication between nodes, at least not for provisioning.

Storage capacity tracking

A scheduler extender still needs some way to find out about capacity on each node. When PMEM-CSI switched to distributed provisioning in v0.9.0, this was done by querying the metrics data exposed by the local driver containers. But it is better also for users to eliminate the need for a scheduler extender completely because the driver deployment becomes simpler. Storage capacity tracking, introduced in 1.19 and promoted to beta in Kubernetes 1.21, achieves that. It works by publishing information about capacity in CSIStorageCapacity objects. The scheduler itself then uses that information to filter out unsuitable nodes. Because information might be not quite up-to-date, pods may still get assigned to nodes with insufficient storage, it's just less likely and the next scheduling attempt for a pod should work better once the information got refreshed.

Generic ephemeral volumes

So CSI drivers still need the ability to recover from a bad scheduling decision, something that turned out to be impossible to implement for "CSI ephemeral volumes". "Generic ephemeral volumes", another feature that got promoted to beta in 1.21, don't have that limitation. This feature adds a controller that will create and manage PVCs with the lifetime of the Pod and therefore the normal recovery mechanism also works for them. Existing storage drivers will be able to process these PVCs without any new logic to handle this new scenario.

Known limitations

Both generic ephemeral volumes and storage capacity tracking increase the load on the API server. Whether that is a problem depends a lot on the kind of workload, in particular how many pods have volumes and how often those need to be created and destroyed.

No attempt was made to model how scheduling decisions affect storage capacity. That's because the effect can vary considerably depending on how the storage system handles storage. The effect is that multiple pods with unbound volumes might get assigned to the same node even though there is only sufficient capacity for one pod. Scheduling should recover, but it would be more efficient if the scheduler knew more about storage.

Because storage capacity gets published by a running CSI driver and the cluster autoscaler needs information about a node that hasn't been created yet, it will currently not scale up a cluster for pods that need volumes. There is an idea how to provide that information, but more work is needed in that area.

Distributed snapshotting and resizing are not currently supported. It should be doable to adapt the respective sidecar and there are tracking issues for external-snapshotter and external-resizer open already, they just need some volunteer.

The recovery from a bad scheduling decising can fail for pods with multiple volumes, in particular when those volumes are local to nodes: if one volume can be created and then storage is insufficient for another volume, the first volume continues to exist and forces the scheduler to put the pod onto the node of that volume. There is an idea how do deal with this, rolling back the provision of the volume, but this is only in the very early stages of brainstorming and not even a merged KEP yet. For now it is better to avoid creating pods with more than one persistent volume.

Enabling the new features and next steps

With the feature entering beta in the 1.21 release, no additional actions are needed to enable it. Generic ephemeral volumes also work without changes in CSI drivers. For more information, see the documentation and the previous blog post about it. The API has not changed at all between alpha and beta.

For the other two features, the external-provisioner documentation explains how CSI driver developers must change how their driver gets deployed to support storage capacity tracking and distributed provisioning. These two features are independent, therefore it is okay to enable only one of them.

SIG Storage would like to hear from you if you are using these new features. We can be reached through email, Slack (channel #sig-storage) and in the regular SIG meeting. A description of your workload would be very useful to validate design decisions, set up performance tests and eventually promote these features to GA.

Acknowledgements

Thanks a lot to the members of the community who have contributed to these features or given feedback including members of SIG Scheduling, SIG Auth, and of course SIG Storage!