Kubernetes rollouts are not seamless by default¶

Most Kubernetes users assume that adding a readiness probe and setting spec.strategy.type: RollingUpdate is enough to achieve a seamless pod rollouts. This is not true. This blog describes why this happens and how to avoid dropping calls during rollouts.

“I don’t want to read the details, show me the conclusion”

Anatomy of a rollout¶

Let’s say we have a simple Deployment/Service setup, looking like:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: nginx
spec:
  replicas: 1
  strategy:
    # We make sure that the new pod is started and ready before
    # terminating the old one.
    type: RollingUpdate
    rollingUpdate:
      maxUnavailable: 0
      maxSurge: 1
  template:
    metadata:
      labels:
        app.kubernetes.io/name: nginx
    spec:
      containers:
      - name: nginx
        image: nginx:1.28.0
        ports:
        - containerPort: 80
        # Minimal readiness probe making sure that the server is running
        readinessProbe:
          httpGet:
            path: /
            port: 80
---
apiVersion: v1
kind: Service
metadata:
  name: nginx
spec:
  type: LoadBalancer
  selector:
    app.kubernetes.io/name: nginx
  ports:
    - protocol: TCP
      port: 80
      targetPort: 80

Everything is working nicely, the pod is ready and requests are served. We decide to update the nginx image to nginx:1.29.1.

The replication controller wakes up and creates a new ReplicaSet with the nginx:1.29.1 image.
A new pod is created, and soon scheduled to a node.
The kubelet creates everything the pod needs and starts the container
After a few seconds, the kubelet starts trying readiness checks
As soon as the check succeeds, the kubelet marks the pod ready

The pod turning ready wakes up 2 different controllers:

The replication controller sees that the new pod is ready and marks the old pod as “to be terminated”. This will wake up the kubelet, which will kick off the pod termination sequence.
The endpoint controller creates a new endpoint for the service containing the pod’s IP address. This will wake up loadbalancer controllers and reconfigure the LB to add the new pod as a destination.

Wait, is this a race!?¶

Yes it is. The rollout will happily progress while the LB controller is still updating the loadbalancer backend config.

If we’re not careful, we will start deleting the old pod while the traffic is not yet sent to the new pod.

To add insult to injury, we’ll also have a similar race when terminating the previous pod. As soon as the pod is marked as “to-be-deleted”:

Its endpoint is marked unready, its IP is removed from the LB backends
the kubelet starts terminating the pod, sending STOPSIGNAL to its containers.

So during the termination, there’s nothing making sure that the pod is not receiving new requests before stopping it.

Both creation and termination issues affect internal traffic (e.g. ClusterIP/NodePort services), and external traffix (LoadBalancer services, or Ingress/HTTPRoute).

Internal traffic¶

For internal traffic, either the CNI, kube-proxy or both are responsible for updating the rules and sending the traffic to the right node/pod.

Before 1.28, kube-proxy was always refreshing all rules, which caused most kube distros to increase the period between syncs to mitigate performance issues. This often caused 10 to 30 second delays between when the endpoint is updated and when the traffic was sent to the right place.

Modern kube version are way more efficient at refreshing rules and it’s recommended to bring back the minimum sync period to 1s.

iptables-free setups (e.g. Cilium), although not instant, usually offer way faster propagation.

Regardless of the network setup and the propagation speed, we must assume that endpoint propagation is not instant and take this into account before shutting down a pod or progressing a rollout.

The most common mitigations are:

Delaying the container shutdown to make sure no one will try to send traffic to it. Since 1.29, this is supported natively by kubernetes with the sleep field of the preStop hook. On earlier versions, this was done either via a preStop.exec hook or a time.Sleep() before the application treats the STOPSIGNAL and stops taking new requests.
Slowing down the Deployment or StatefulSet rollout by adding a non-null minReadySeconds.

In most cases, a 5-second delay should be enough for internal traffic, but if you don’t control where your application will be deployed, a more conservative 30-second delay might be more appropriate.

External services¶

For external services, we face the same problems, but the propagation delay can be even be longer. The Ingress or LoadBalancer controllers have to sync with external systems, which are often distributed and need some time to commit new rules/confirm the new targets are healthy (e.g. an AWS ALB in ip mode).

Kubernetes has a feature to solve this issue: the readiness gates. This is an optional way for the loadbalancer controller to say “until I set this status.condition to true, don’t mark the pod ready”. Keeping the pod unready will block the rollout from progressing until the loadbalancer is aware of the pod.

This is a very elegant solution but is not always practical as the feature is often off by default, and every loadbalancer controller has its own custom way of enabling it. This also heavily depends on the CNI and networking mode (does the LB reach out to pods directly?) and is not available everywhere.

If you are deploying your own application, it’s likely not an issue, but if you are writing manifests that will be deployed by someone else, you cannot assume they will have readiness gates nor will configure them properly.

The other unsolved issue is the pod termination. Very surprisingly, there’s no readiness gate equivalent for delaying a pod termination until it’s unregistered from the load-balancer. Currently, the only workaround is to wait a reasonable amount of time before stopping the container. This is usually done with the preStop.sleep hook trick.

tl;dr¶

If you want your rollouts to stop dropping calls you should:

Turn-on readiness gates and configure the LB to send traffic directly to the pods (as opposed to the underlying instance) when possible.
set a preStop hook sleeping 30 seconds or bake this delay in your application.
increase terminationGracePeriodSeconds to be the sum of the preStop duration and how long it takes for your application to stop.
set minReadySeconds: 30 in your Deployment

On AWS with the AWS Load Balancer controller (the out-of-tree one), the final resources would look like:

apiVersion: v1
kind: Namespace
metadata:
  name: nginx
  labels:
    # turn on readiness gate injection, this is LB-controller-specific.
    elbv2.k8s.aws/pod-readiness-gate-inject: enabled
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: nginx
spec:
  # wait for new pods to get in the LB pool before continuing the rollout
  minReadySeconds: 30
  replicas: 1
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxUnavailable: 0
      maxSurge: 1
  template:
    metadata:
      labels:
        app.kubernetes.io/name: nginx
    spec:
      containers:
      - name: nginx
        image: nginx:1.28.0
        ports:
        - containerPort: 80
        readinessProbe:
          httpGet:
            path: /
            port: 80
        # wait for pods to be out of LB pool before stopping
        lifecycle:
          preStop:
            sleep:
              seconds: 30
        # compensate for the preStop causing slower terminations
        terminationGracePeriodSeconds: 60