Questions on high latencies observed during a rolling restart process in Kafka

Tiago Ricardo Tue, 03 Dec 2024 09:35:39 -0800

Hello all,

Apologies if this is not the right forum, but we have been observing a
behavior when performing the drain of our Kubernetes cluster nodes and we
were hoping that someone could give us some pointers on what could be
happening here.


In this specific example, we have a Kubernetes cluster with 5 Kafka brokers
and client nodes of our application. The Kafka topics are configured with a
replication factor of 3 and a minimum in-sync replicas defined as 2. In our
cluster, due to its setup we were able to ensure that only one Kafka broker
pod is present on each node, so when we perform a drain operation we are
targeting just a single pod individually.

Our client application producer is tarteging the topic leader when sending
the messages, and specifically during the restart process of the Kafka
brokers we noticed a pattern we would like to have a better understanding
of.

While we were performing the draining of the Kafka nodes we noticed an
increase in the message latency on our client application and looking at
the errors we could see multiple instances of Network errors and
NOT_COORDINATOR and NOT_LEADER_OR_FOLLOWER responses, combined with
multiple retries and reconnects occurring at the same time, that appear to
be a consequence of the errors observed.

>From our observations, we could observe that the minimum ISR of 2 we
defined is always respected, and we have not detected any example of
message loss, so our questions on this topic are:

- Taking into account we are targeting the topic leader nodes, are the
errors we have been observing during this process expected to be observed
while the cluster nodes are being drained?
- Do you have an idea of how long it takes for a cluster to usually recover
from a drain operation, or does it depend on the number of topics and
messages being processed?
- What operations are occurring in the cluster that could explain the
latency increase we are observing (and the error responses we are
observing)?

In addition to these questions, is there anything we should be doing
differently on the client side that could avoid this period of greater
latency?

Thank you.
Best regards,

*Tiago Ricardo*

-- 
The content of this email is confidential and 
intended for the recipient 
specified in message only. It is strictly 
prohibited to share any part of 
this message with any third party, 
without a written consent of the 
sender. If you received this message by
 mistake, please reply to this 
message and follow with its deletion, so 
that we can ensure such a mistake 
does not occur in the future.

Questions on high latencies observed during a rolling restart process in Kafka

Reply via email to