Kubernetes rollouts are not seamless by default¶
Most Kubernetes users assume that adding a readiness probe and setting
spec.strategy.type: RollingUpdate is enough to achieve a seamless pod
rollouts. This is not true. This blog describes why this happens and how to
avoid dropping calls during rollouts.
Anatomy of a rollout¶
Let’s say we have a simple Deployment/Service setup, looking like:
apiVersion: apps/v1
kind: Deployment
metadata:
name: nginx
spec:
replicas: 1
strategy:
# We make sure that the new pod is started and ready before
# terminating the old one.
type: RollingUpdate
rollingUpdate:
maxUnavailable: 0
maxSurge: 1
template:
metadata:
labels:
app.kubernetes.io/name: nginx
spec:
containers:
- name: nginx
image: nginx:1.28.0
ports:
- containerPort: 80
# Minimal readiness probe making sure that the server is running
readinessProbe:
httpGet:
path: /
port: 80
---
apiVersion: v1
kind: Service
metadata:
name: nginx
spec:
type: LoadBalancer
selector:
app.kubernetes.io/name: nginx
ports:
- protocol: TCP
port: 80
targetPort: 80
Everything is working nicely, the pod is ready and requests are served.
We decide to update the nginx image to nginx:1.29.1.
The replication controller wakes up and creates a new ReplicaSet with the
nginx:1.29.1image.A new pod is created, and soon scheduled to a node.
The kubelet creates everything the pod needs and starts the container
After a few seconds, the kubelet starts trying readiness checks
As soon as the check succeeds, the kubelet marks the pod ready
The pod turning ready wakes up 2 different controllers:
The replication controller sees that the new pod is ready and marks the old pod as “to be terminated”. This will wake up the kubelet, which will kick off the pod termination sequence.
The endpoint controller creates a new endpoint for the service containing the pod’s IP address. This will wake up loadbalancer controllers and reconfigure the LB to add the new pod as a destination.
Wait, is this a race!?¶
Yes it is. The rollout will happily progress while the LB controller is still updating the loadbalancer backend config.
If we’re not careful, we will start deleting the old pod while the traffic is not yet sent to the new pod.
To add insult to injury, we’ll also have a similar race when terminating the previous pod. As soon as the pod is marked as “to-be-deleted”:
Its endpoint is marked unready, its IP is removed from the LB backends
the kubelet starts terminating the pod, sending STOPSIGNAL to its containers.
So during the termination, there’s nothing making sure that the pod is not receiving new requests before stopping it.
Both creation and termination issues affect internal traffic (e.g.
ClusterIP/NodePort services), and external traffix (LoadBalancer
services, or Ingress/HTTPRoute).
Internal traffic¶
For internal traffic, either the CNI, kube-proxy or both are responsible
for updating the rules and sending the traffic to the right node/pod.
Before 1.28, kube-proxy was always refreshing all rules, which caused most
kube distros to increase the period between syncs to mitigate performance
issues. This often caused 10 to 30 second delays between when the endpoint is
updated and when the traffic was sent to the right place.
Modern kube version are way more efficient at refreshing rules and it’s recommended to bring back the minimum sync period to 1s.
iptables-free setups (e.g. Cilium), although not instant, usually offer way faster propagation.
Regardless of the network setup and the propagation speed, we must assume that endpoint propagation is not instant and take this into account before shutting down a pod or progressing a rollout.
The most common mitigations are:
Delaying the container shutdown to make sure no one will try to send traffic to it. Since 1.29, this is supported natively by kubernetes with the
sleepfield of thepreStophook. On earlier versions, this was done either via apreStop.exechook or atime.Sleep()before the application treats the STOPSIGNAL and stops taking new requests.Slowing down the Deployment or StatefulSet rollout by adding a non-null
minReadySeconds.
In most cases, a 5-second delay should be enough for internal traffic, but if you don’t control where your application will be deployed, a more conservative 30-second delay might be more appropriate.
External services¶
For external services, we face the same problems, but the propagation delay can
be even be longer. The Ingress or LoadBalancer controllers have to sync with
external systems, which are often distributed and need some time to commit
new rules/confirm the new targets are healthy (e.g. an AWS ALB in ip mode).
Kubernetes has a feature to solve this issue: the readiness gates. This
is an optional way for the loadbalancer controller to say “until I set this
status.condition to true, don’t mark the pod ready”. Keeping the pod
unready will block the rollout from progressing until the loadbalancer is
aware of the pod.
This is a very elegant solution but is not always practical as the feature is often off by default, and every loadbalancer controller has its own custom way of enabling it. This also heavily depends on the CNI and networking mode (does the LB reach out to pods directly?) and is not available everywhere.
If you are deploying your own application, it’s likely not an issue, but if you are writing manifests that will be deployed by someone else, you cannot assume they will have readiness gates nor will configure them properly.
The other unsolved issue is the pod termination. Very surprisingly, there’s
no readiness gate equivalent for delaying a pod termination until it’s
unregistered from the load-balancer. Currently, the only workaround is to
wait a reasonable amount of time before stopping the container. This is usually
done with the preStop.sleep hook trick.
tl;dr¶
If you want your rollouts to stop dropping calls you should:
Turn-on readiness gates and configure the LB to send traffic directly to the pods (as opposed to the underlying instance) when possible.
set a
preStophook sleeping 30 seconds or bake this delay in your application.increase
terminationGracePeriodSecondsto be the sum of thepreStopduration and how long it takes for your application to stop.set
minReadySeconds: 30in yourDeployment
On AWS with the AWS Load Balancer controller (the out-of-tree one), the final resources would look like:
apiVersion: v1
kind: Namespace
metadata:
name: nginx
labels:
# turn on readiness gate injection, this is LB-controller-specific.
elbv2.k8s.aws/pod-readiness-gate-inject: enabled
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: nginx
spec:
# wait for new pods to get in the LB pool before continuing the rollout
minReadySeconds: 30
replicas: 1
strategy:
type: RollingUpdate
rollingUpdate:
maxUnavailable: 0
maxSurge: 1
template:
metadata:
labels:
app.kubernetes.io/name: nginx
spec:
containers:
- name: nginx
image: nginx:1.28.0
ports:
- containerPort: 80
readinessProbe:
httpGet:
path: /
port: 80
# wait for pods to be out of LB pool before stopping
lifecycle:
preStop:
sleep:
seconds: 30
# compensate for the preStop causing slower terminations
terminationGracePeriodSeconds: 60