Kubernetes 1.24: Maximum Unavailable Replicas for StatefulSet
Author: Mayank Kumar (Salesforce)
Kubernetes StatefulSets, since their introduction in
1.5 and becoming stable in 1.9, have been widely used to run stateful applications. They provide stable pod identity, persistent
per pod storage and ordered graceful deployment, scaling and rolling updates. You can think of StatefulSet as the atomic building
block for running complex stateful applications. As the use of Kubernetes has grown, so has the number of scenarios requiring
StatefulSets. Many of these scenarios, require faster rolling updates than the currently supported one-pod-at-a-time updates, in the
case where you're using the
OrderedReady Pod management policy for a StatefulSet.
Here are some examples:
I am using a StatefulSet to orchestrate a multi-instance, cache based application where the size of the cache is large. The cache starts cold and requires some siginificant amount of time before the container can start. There could be more initial startup tasks that are required. A RollingUpdate on this StatefulSet would take a lot of time before the application is fully updated. If the StatefulSet supported updating more than one pod at a time, it would result in a much faster update.
My stateful application is composed of leaders and followers or one writer and multiple readers. I have multiple readers or followers and my application can tolerate multiple pods going down at the same time. I want to update this application more than one pod at a time so that i get the new updates rolled out quickly, especially if the number of instances of my application are large. Note that my application still requires unique identity per pod.
In order to support such scenarios, Kubernetes 1.24 includes a new alpha feature to help. Before you can use the new feature you must
MaxUnavailableStatefulSet feature flag. Once you enable that, you can specify a new field called
spec for a StatefulSet. For example:
apiVersion: apps/v1 kind: StatefulSet metadata: name: web namespace: default spec: podManagementPolicy: OrderedReady # you must set OrderedReady replicas: 5 selector: matchLabels: app: nginx template: metadata: labels: app: nginx spec: containers: # image changed since publication (previously used registry "k8s.gcr.io") - image: registry.k8s.io/nginx-slim:0.8 imagePullPolicy: IfNotPresent name: nginx updateStrategy: rollingUpdate: maxUnavailable: 2 # this is the new alpha field, whose default value is 1 partition: 0 type: RollingUpdate
If you enable the new feature and you don't specify a value for
maxUnavailable in a StatefulSet, Kubernetes applies a default
maxUnavailable: 1. This matches the behavior you would see if you don't enable the new feature.
I'll run through a scenario based on that example manifest to demonstrate how this feature works. I will deploy a StatefulSet that
has 5 replicas, with
maxUnavailable set to 2 and
partition set to 0.
I can trigger a rolling update by changing the image to
registry.k8s.io/nginx-slim:0.9. Once I initiate the rolling update, I can
watch the pods update 2 at a time as the current value of maxUnavailable is 2. The below output shows a span of time and is not
complete. The maxUnavailable can be an absolute number (for example, 2) or a percentage of desired Pods (for example, 10%). The
absolute number is calculated from percentage by rounding up to the nearest integer.
kubectl get pods --watch
NAME READY STATUS RESTARTS AGE web-0 1/1 Running 0 85s web-1 1/1 Running 0 2m6s web-2 1/1 Running 0 106s web-3 1/1 Running 0 2m47s web-4 1/1 Running 0 2m27s web-4 1/1 Terminating 0 5m43s ----> start terminating 4 web-3 1/1 Terminating 0 6m3s ----> start terminating 3 web-3 0/1 Terminating 0 6m7s web-3 0/1 Pending 0 0s web-3 0/1 Pending 0 0s web-4 0/1 Terminating 0 5m48s web-4 0/1 Terminating 0 5m48s web-3 0/1 ContainerCreating 0 2s web-3 1/1 Running 0 2s web-4 0/1 Pending 0 0s web-4 0/1 Pending 0 0s web-4 0/1 ContainerCreating 0 0s web-4 1/1 Running 0 1s web-2 1/1 Terminating 0 5m46s ----> start terminating 2 (only after both 4 and 3 are running) web-1 1/1 Terminating 0 6m6s ----> start terminating 1 web-2 0/1 Terminating 0 5m47s web-1 0/1 Terminating 0 6m7s web-1 0/1 Pending 0 0s web-1 0/1 Pending 0 0s web-1 0/1 ContainerCreating 0 1s web-1 1/1 Running 0 2s web-2 0/1 Pending 0 0s web-2 0/1 Pending 0 0s web-2 0/1 ContainerCreating 0 0s web-2 1/1 Running 0 1s web-0 1/1 Terminating 0 6m6s ----> start terminating 0 (only after 2 and 1 are running) web-0 0/1 Terminating 0 6m7s web-0 0/1 Pending 0 0s web-0 0/1 Pending 0 0s web-0 0/1 ContainerCreating 0 0s web-0 1/1 Running 0 1s
Note that as soon as the rolling update starts, both 4 and 3 (the two highest ordinal pods) start terminating at the same time. Pods with ordinal 4 and 3 may become ready at their own pace. As soon as both pods 4 and 3 are ready, pods 2 and 1 start terminating at the same time. When pods 2 and 1 are both running and ready, pod 0 starts terminating.
In Kubernetes, updates to StatefulSets follow a strict ordering when updating Pods. In this example, the update starts at replica 4, then
replica 3, then replica 2, and so on, one pod at a time. When going one pod at a time, its not possible for 3 to be running and ready
before 4. When
maxUnavailable is more than 1 (in the example scenario I set
maxUnavailable to 2), it is possible that replica 3 becomes
ready and running before replica 4 is ready—and that is ok. If you're a developer and you set
maxUnavailable to more than 1, you should
know that this outcome is possible and you must ensure that your application is able to handle such ordering issues that occur
if any. When you set
maxUnavailable greater than 1, the ordering is guaranteed in between each batch of pods being updated. That guarantee
means that pods in update batch 2 (replicas 2 and 1) cannot start updating until the pods from batch 0 (replicas 4 and 3) are ready.
Although Kubernetes refers to these as replicas, your stateful application may have a different view and each pod of the StatefulSet may be holding completely different data than other pods. The important thing here is that updates to StatefulSets happen in batches, and you can now have a batch size larger than 1 (as an alpha feature).
Also note, that the above behavior is with
podManagementPolicy: OrderedReady. If you defined a StatefulSet as
maxUnavailable number of replicas are terminated at the same time;
maxUnavailable number of replicas start in
phase at the same time as well. This is called bursting.
So, now you may have a lot of questions about:-
- What is the behavior when you set
- What is the behavior when
partitionto a value other than
It might be better to try and see it for yourself. This is an alpha feature, and the Kubernetes contributors are looking for feedback on this feature. Did this help you achieve your stateful scenarios Did you find a bug or do you think the behavior as implemented is not intuitive or can break applications or catch them by surprise? Please open an issue to let us know.
Further reading and next steps
- Maximum unavailable Pods
- KEP for MaxUnavailable for StatefulSet
- Enhancement Tracking Issue