Kubernetes rollouts are not seamless by default
===============================================

Most Kubernetes users assume that adding a readiness probe and setting
``spec.strategy.type: RollingUpdate``  is enough to achieve a seamless pod
rollouts. This is not true. This blog describes why this happens and how to
avoid dropping calls during rollouts.

  :ref:`"I don't want to read the details, show me the conclusion" <tl;dr>`

Anatomy of a rollout
--------------------

Let's say we have a simple ``Deployment``/``Service`` setup, looking like:

.. code:: yaml

   apiVersion: apps/v1
   kind: Deployment
   metadata:
     name: nginx
   spec:
     replicas: 1
     strategy:
       # We make sure that the new pod is started and ready before
       # terminating the old one.
       type: RollingUpdate
       rollingUpdate:
         maxUnavailable: 0
         maxSurge: 1
     template:
       metadata:
         labels:
           app.kubernetes.io/name: nginx
       spec:
         containers:
         - name: nginx
           image: nginx:1.28.0
           ports:
           - containerPort: 80
           # Minimal readiness probe making sure that the server is running
           readinessProbe:
             httpGet:
               path: /
               port: 80
   ---
   apiVersion: v1
   kind: Service
   metadata:
     name: nginx
   spec:
     type: LoadBalancer
     selector:
       app.kubernetes.io/name: nginx
     ports:
       - protocol: TCP
         port: 80
         targetPort: 80

Everything is working nicely, the pod is ready and requests are served.
We decide to update the nginx image to ``nginx:1.29.1``.

1. The replication controller wakes up and creates a new ReplicaSet with the
   ``nginx:1.29.1`` image.
2. A new pod is created, and soon scheduled to a node.
3. The kubelet creates everything the pod needs and starts the container
4. After a few seconds, the kubelet starts trying readiness checks
5. As soon as the check succeeds, the kubelet marks the pod ready

The pod turning ready wakes up 2 different controllers:

- The replication controller sees that the new pod is ready and marks the
  old pod as "to be terminated". This will wake up the kubelet, which will
  kick off the pod termination sequence.
- The endpoint controller creates a new endpoint for the service
  containing the pod's IP address. This will wake up loadbalancer controllers
  and reconfigure the LB to add the new pod as a destination.

Wait, is this a race!?
----------------------

Yes it is. The rollout will happily progress while the LB controller is
still updating the loadbalancer backend config.

If we're not careful, we will start deleting the old pod while the traffic
is not yet sent to the new pod.

To add insult to injury, we'll also have a similar race when terminating the
previous pod. As soon as the pod is marked as "to-be-deleted":

- Its endpoint is marked unready, its IP is removed from the LB backends
- the kubelet starts terminating the pod, sending STOPSIGNAL to its containers.

So during the termination, there's nothing making sure that the pod is not
receiving new requests before stopping it.

Both creation and termination issues affect internal traffic (e.g.
``ClusterIP``/``NodePort`` services), and external traffix (``LoadBalancer``
services, or ``Ingress``/``HTTPRoute``).

Internal traffic
----------------

For internal traffic, either the CNI, ``kube-proxy`` or both are responsible
for updating the rules and sending the traffic to the right node/pod.

Before 1.28, ``kube-proxy`` was always refreshing all rules, which caused most
kube distros to increase the period between syncs to mitigate performance
issues. This often caused 10 to 30 second delays between when the endpoint is
updated and when the traffic was sent to the right place.

Modern kube version are way more efficient at refreshing rules and it's
recommended to bring back the minimum sync period to 1s.

iptables-free setups (e.g. Cilium_), although not instant, usually offer way
faster propagation.

.. _Cilium: https://docs.cilium.io/en/stable/overview/intro/

Regardless of the network setup and the propagation speed, we must assume that
endpoint propagation is not instant and take this into account before shutting
down a pod or progressing a rollout.

The most common mitigations are:

- Delaying the container shutdown to make sure no one will try to send
  traffic to it. Since 1.29, this is supported natively by kubernetes with
  the ``sleep`` field of the ``preStop`` hook. On earlier versions, this
  was done either via a ``preStop.exec`` hook or a ``time.Sleep()`` before the
  application treats the STOPSIGNAL and stops taking new requests.
- Slowing down the Deployment or StatefulSet rollout by adding a non-null
  ``minReadySeconds``.

In most cases, a 5-second delay should be enough for internal traffic, but if
you don't control where your application will be deployed, a more
conservative 30-second delay might be more appropriate.

External services
-----------------

For external services, we face the same problems, but the propagation delay can
be even be longer. The Ingress or LoadBalancer controllers have to sync with
external systems, which are often distributed and need some time to commit
new rules/confirm the new targets are healthy (e.g. an AWS ALB in ``ip`` mode).

Kubernetes has a feature to solve this issue: the `readiness gates`_. This
is an optional way for the loadbalancer controller to say "until I set this
``status.condition`` to true, don't mark the pod ready". Keeping the pod
unready will block the rollout from progressing until the loadbalancer is
aware of the pod.

.. _readiness gates: https://kubernetes.io/docs/concepts/workloads/pods/pod-lifecycle/#pod-readiness-gate

This is a very elegant solution but is not always practical as the feature
is often off by default, and every loadbalancer controller has its own
custom way of enabling it. This also heavily depends on the CNI and networking
mode (does the LB reach out to pods directly?) and is not available everywhere.

If you are deploying your own application, it's likely not an issue, but if
you are writing manifests that will be deployed by someone else, you cannot
assume they will have readiness gates nor will configure them properly.

The other unsolved issue is the pod termination. Very surprisingly, `there's
no readiness gate equivalent for delaying a pod termination`__ until it's
unregistered from the load-balancer. Currently, the only workaround is to
wait a reasonable amount of time before stopping the container. This is usually
done with the ``preStop.sleep`` hook trick.

__ https://github.com/kubernetes/kubernetes/issues/89263

.. _tl;dr:

tl;dr
-----

If you want your rollouts to stop dropping calls you should:

- Turn-on `readiness gates`_ and configure the LB to send traffic directly to
  the pods (as opposed to the underlying instance) when possible.
- set a ``preStop`` hook sleeping 30 seconds or bake this delay in your
  application.
- increase ``terminationGracePeriodSeconds`` to be the sum of the ``preStop``
  duration and how long it takes for your application to stop.
- set ``minReadySeconds: 30`` in your ``Deployment``

.. warning::¶

   Those mitigations will make your pod rollouts slower, make sure that your
   CI/CD timeouts are set accordingly if you're waiting for the rollout to
   succeed.

On AWS with the AWS Load Balancer controller (the out-of-tree one), the final
resources would look like:

.. code:: yaml

   apiVersion: v1
   kind: Namespace
   metadata:
     name: nginx
     labels:
       # turn on readiness gate injection, this is LB-controller-specific.
       elbv2.k8s.aws/pod-readiness-gate-inject: enabled
   ---
   apiVersion: apps/v1
   kind: Deployment
   metadata:
     name: nginx
   spec:
     # wait for new pods to get in the LB pool before continuing the rollout
     minReadySeconds: 30
     replicas: 1
     strategy:
       type: RollingUpdate
       rollingUpdate:
         maxUnavailable: 0
         maxSurge: 1
     template:
       metadata:
         labels:
           app.kubernetes.io/name: nginx
       spec:
         containers:
         - name: nginx
           image: nginx:1.28.0
           ports:
           - containerPort: 80
           readinessProbe:
             httpGet:
               path: /
               port: 80
           # wait for pods to be out of LB pool before stopping
           lifecycle:
             preStop:
               sleep:
                 seconds: 30
           # compensate for the preStop causing slower terminations
           terminationGracePeriodSeconds: 60