Hi, As I understand it, the behavior also depends on the 'acks' value set by the producer. For more details, you can delve into the 'acks' parameter, but let's assume the default 'acks=1' setting. With this, the producer considers a message as successfully sent to a Kafka broker once an acknowledgment from the leader partition is received.
If the leader broker fails after the producer sends messages but before acknowledgments are received, the producer will attempt retries based on its configuration. If acknowledgments aren't received within the request and delivery timeout periods, the producer will mark the message batch as failed. Assuming the default delivery timeout is 2 minutes, the producer will wait this duration, then flag the messages as failed and initiate to send a new batch of messages. Since, here the leader partition is available after 2 minutes, the message sending should succeed. This is the expected behavior Thanks & Regards, Hemanth On Thu, Nov 2, 2023 at 2:36 PM Slavo Valko <s.va...@partner.samsung.com> wrote: > Hello, > > > > I've been working on testing Kafka availability in Zookeeper mode during > single broker shutdowns within a Kubernetes setup, and I've come across > something interesting that I wanted to run by you. > > > > We've noticed that when a partition leader goes down, messages are not > delivered until a new leader is elected. While we expect this to happen, > there’s a part of it that’s still not adding up. The downtime, or the time > it takes for the new leader to step up, is about a minute. But what’s > interesting is that when we increase the producer side retries to just 1, > all of our messages get delivered successfully. > > > > This seems a bit odd to me because, theoretically, increasing the retries > should only resend the message, giving it an extra 10 seconds before it > times out, while the first few messages should still have around 40 seconds > to wait for the new leader. So, this behavior is a bit of a head-scratcher. > > > I was wondering if you might have any insights or could point me in the > right direction to understand why this is happening. Any help or guidance > would be greatly appreciated. > > > Below is a log snippet from one of the test runs: > > Partition leader shutdown and observation of new partition leader being > automatically elected in setup with 1 partition and replication factor of 3. > Thu Oct 26 21:59:51 CEST 2023 - Partition leader has been shutdown > Thu Oct 26 22:01:06 CEST 2023 - Change in partition leader detected > > Error messages from the producer client during the window when partition > leader is unelected. > Failed to send message: > KafkaError{code=_MSG_TIMED_OUT,val=-192,str="Local: Message timed out"}. > Message content: Message #39 from 2023-10-26 19:59:52 > > Failed to send message: > KafkaError{code=_MSG_TIMED_OUT,val=-192,str="Local: Message timed out"}. > Message content: Message #40 from 2023-10-26 19:59:53 > … > Failed to send message: > KafkaError{code=_MSG_TIMED_OUT,val=-192,str="Local: Message timed out"}. > Message content: Message #97 from 2023-10-26 20:00:50 > > Failed to send message: > KafkaError{code=_MSG_TIMED_OUT,val=-192,str="Local: Message timed out"}. > Message content: Message #98 from 2023-10-26 20:00:51 > > > The container clocks are a little out of sync, but both unavailability > windows match to around one minute. > > > > Thanks a lot for your time, and looking forward to hearing from you. > > > > > > > -- Thanks & Regards, Hemanth Savasere