Hello all, Apologies if this is not the right forum, but we have been observing a behavior when performing the drain of our Kubernetes cluster nodes and we were hoping that someone could give us some pointers on what could be happening here.
In this specific example, we have a Kubernetes cluster with 5 Kafka brokers and client nodes of our application. The Kafka topics are configured with a replication factor of 3 and a minimum in-sync replicas defined as 2. In our cluster, due to its setup we were able to ensure that only one Kafka broker pod is present on each node, so when we perform a drain operation we are targeting just a single pod individually. Our client application producer is tarteging the topic leader when sending the messages, and specifically during the restart process of the Kafka brokers we noticed a pattern we would like to have a better understanding of. While we were performing the draining of the Kafka nodes we noticed an increase in the message latency on our client application and looking at the errors we could see multiple instances of Network errors and NOT_COORDINATOR and NOT_LEADER_OR_FOLLOWER responses, combined with multiple retries and reconnects occurring at the same time, that appear to be a consequence of the errors observed. >From our observations, we could observe that the minimum ISR of 2 we defined is always respected, and we have not detected any example of message loss, so our questions on this topic are: - Taking into account we are targeting the topic leader nodes, are the errors we have been observing during this process expected to be observed while the cluster nodes are being drained? - Do you have an idea of how long it takes for a cluster to usually recover from a drain operation, or does it depend on the number of topics and messages being processed? - What operations are occurring in the cluster that could explain the latency increase we are observing (and the error responses we are observing)? In addition to these questions, is there anything we should be doing differently on the client side that could avoid this period of greater latency? Thank you. Best regards, *Tiago Ricardo* -- The content of this email is confidential and intended for the recipient specified in message only. It is strictly prohibited to share any part of this message with any third party, without a written consent of the sender. If you received this message by mistake, please reply to this message and follow with its deletion, so that we can ensure such a mistake does not occur in the future.