Kubernetes rollouts are not seamless by default =============================================== Most Kubernetes users assume that adding a readiness probe and setting ``spec.strategy.type: RollingUpdate`` is enough to achieve a seamless pod rollouts. This is not true. This blog describes why this happens and how to avoid dropping calls during rollouts. :ref:`"I don't want to read the details, show me the conclusion" ` Anatomy of a rollout -------------------- Let's say we have a simple ``Deployment``/``Service`` setup, looking like: .. code:: yaml apiVersion: apps/v1 kind: Deployment metadata: name: nginx spec: replicas: 1 strategy: # We make sure that the new pod is started and ready before # terminating the old one. type: RollingUpdate rollingUpdate: maxUnavailable: 0 maxSurge: 1 template: metadata: labels: app.kubernetes.io/name: nginx spec: containers: - name: nginx image: nginx:1.28.0 ports: - containerPort: 80 # Minimal readiness probe making sure that the server is running readinessProbe: httpGet: path: / port: 80 --- apiVersion: v1 kind: Service metadata: name: nginx spec: type: LoadBalancer selector: app.kubernetes.io/name: nginx ports: - protocol: TCP port: 80 targetPort: 80 Everything is working nicely, the pod is ready and requests are served. We decide to update the nginx image to ``nginx:1.29.1``. 1. The replication controller wakes up and creates a new ReplicaSet with the ``nginx:1.29.1`` image. 2. A new pod is created, and soon scheduled to a node. 3. The kubelet creates everything the pod needs and starts the container 4. After a few seconds, the kubelet starts trying readiness checks 5. As soon as the check succeeds, the kubelet marks the pod ready The pod turning ready wakes up 2 different controllers: - The replication controller sees that the new pod is ready and marks the old pod as "to be terminated". This will wake up the kubelet, which will kick off the pod termination sequence. - The endpoint controller creates a new endpoint for the service containing the pod's IP address. This will wake up loadbalancer controllers and reconfigure the LB to add the new pod as a destination. Wait, is this a race!? ---------------------- Yes it is. The rollout will happily progress while the LB controller is still updating the loadbalancer backend config. If we're not careful, we will start deleting the old pod while the traffic is not yet sent to the new pod. To add insult to injury, we'll also have a similar race when terminating the previous pod. As soon as the pod is marked as "to-be-deleted": - Its endpoint is marked unready, its IP is removed from the LB backends - the kubelet starts terminating the pod, sending STOPSIGNAL to its containers. So during the termination, there's nothing making sure that the pod is not receiving new requests before stopping it. Both creation and termination issues affect internal traffic (e.g. ``ClusterIP``/``NodePort`` services), and external traffix (``LoadBalancer`` services, or ``Ingress``/``HTTPRoute``). Internal traffic ---------------- For internal traffic, either the CNI, ``kube-proxy`` or both are responsible for updating the rules and sending the traffic to the right node/pod. Before 1.28, ``kube-proxy`` was always refreshing all rules, which caused most kube distros to increase the period between syncs to mitigate performance issues. This often caused 10 to 30 second delays between when the endpoint is updated and when the traffic was sent to the right place. Modern kube version are way more efficient at refreshing rules and it's recommended to bring back the minimum sync period to 1s. iptables-free setups (e.g. Cilium_), although not instant, usually offer way faster propagation. .. _Cilium: https://docs.cilium.io/en/stable/overview/intro/ Regardless of the network setup and the propagation speed, we must assume that endpoint propagation is not instant and take this into account before shutting down a pod or progressing a rollout. The most common mitigations are: - Delaying the container shutdown to make sure no one will try to send traffic to it. Since 1.29, this is supported natively by kubernetes with the ``sleep`` field of the ``preStop`` hook. On earlier versions, this was done either via a ``preStop.exec`` hook or a ``time.Sleep()`` before the application treats the STOPSIGNAL and stops taking new requests. - Slowing down the Deployment or StatefulSet rollout by adding a non-null ``minReadySeconds``. In most cases, a 5-second delay should be enough for internal traffic, but if you don't control where your application will be deployed, a more conservative 30-second delay might be more appropriate. External services ----------------- For external services, we face the same problems, but the propagation delay can be even be longer. The Ingress or LoadBalancer controllers have to sync with external systems, which are often distributed and need some time to commit new rules/confirm the new targets are healthy (e.g. an AWS ALB in ``ip`` mode). Kubernetes has a feature to solve this issue: the `readiness gates`_. This is an optional way for the loadbalancer controller to say "until I set this ``status.condition`` to true, don't mark the pod ready". Keeping the pod unready will block the rollout from progressing until the loadbalancer is aware of the pod. .. _readiness gates: https://kubernetes.io/docs/concepts/workloads/pods/pod-lifecycle/#pod-readiness-gate This is a very elegant solution but is not always practical as the feature is often off by default, and every loadbalancer controller has its own custom way of enabling it. This also heavily depends on the CNI and networking mode (does the LB reach out to pods directly?) and is not available everywhere. If you are deploying your own application, it's likely not an issue, but if you are writing manifests that will be deployed by someone else, you cannot assume they will have readiness gates nor will configure them properly. The other unsolved issue is the pod termination. Very surprisingly, `there's no readiness gate equivalent for delaying a pod termination`__ until it's unregistered from the load-balancer. Currently, the only workaround is to wait a reasonable amount of time before stopping the container. This is usually done with the ``preStop.sleep`` hook trick. __ https://github.com/kubernetes/kubernetes/issues/89263 .. _tl;dr: tl;dr ----- If you want your rollouts to stop dropping calls you should: - Turn-on `readiness gates`_ and configure the LB to send traffic directly to the pods (as opposed to the underlying instance) when possible. - set a ``preStop`` hook sleeping 30 seconds or bake this delay in your application. - increase ``terminationGracePeriodSeconds`` to be the sum of the ``preStop`` duration and how long it takes for your application to stop. - set ``minReadySeconds: 30`` in your ``Deployment`` .. warning::ΒΆ Those mitigations will make your pod rollouts slower, make sure that your CI/CD timeouts are set accordingly if you're waiting for the rollout to succeed. On AWS with the AWS Load Balancer controller (the out-of-tree one), the final resources would look like: .. code:: yaml apiVersion: v1 kind: Namespace metadata: name: nginx labels: # turn on readiness gate injection, this is LB-controller-specific. elbv2.k8s.aws/pod-readiness-gate-inject: enabled --- apiVersion: apps/v1 kind: Deployment metadata: name: nginx spec: # wait for new pods to get in the LB pool before continuing the rollout minReadySeconds: 30 replicas: 1 strategy: type: RollingUpdate rollingUpdate: maxUnavailable: 0 maxSurge: 1 template: metadata: labels: app.kubernetes.io/name: nginx spec: containers: - name: nginx image: nginx:1.28.0 ports: - containerPort: 80 readinessProbe: httpGet: path: / port: 80 # wait for pods to be out of LB pool before stopping lifecycle: preStop: sleep: seconds: 30 # compensate for the preStop causing slower terminations terminationGracePeriodSeconds: 60