[ https://issues.apache.org/jira/browse/KAFKA-2960?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15048206#comment-15048206 ]
Xing Huang commented on KAFKA-2960: ----------------------------------- We use Kafka 0.8.2. The produce request timeout is the default value, which is 10000ms. At 39:07, controller did a preferred leader election, sent LeaderAndIsrRequest to replicas. Just one or two seconds later, it found the broker hosted the new leader failed. So, the controller did another leader election, and send the second batch of LeaderAndIsrRequests. At 40:11, the related replicas processed the first LeaderAndIsrRequest. At 40:12, they processed the second LeaderAndIsrRequest. So, the original leader experienced a leader -> follower -> leader change in just two seconds, I think. > DelayedProduce may cause message lose during repeatly leader change > ------------------------------------------------------------------- > > Key: KAFKA-2960 > URL: https://issues.apache.org/jira/browse/KAFKA-2960 > Project: Kafka > Issue Type: Bug > Components: core > Affects Versions: 0.9.0.0 > Reporter: Xing Huang > Fix For: 0.9.1.0 > > > related to #KAFKA-1148 > When a leader replica became follower then leader again, it may truncated its > log as follower. But the second time it became leader, its ISR may shrink and > if at this moment new messages were appended, the DelayedProduce generated > when it was leader the first time may be satisfied, and the client will > receive a response with no error. But, actually the messages were lost. > We simulated this scene, which proved the message lose could happen. And it > seems to be the reason for a data lose recently happened to us according to > broker logs and client logs. > I think we should check the leader epoch when send a response, or satisfy > DelayedProduce when leader change as described in #KAFKA-1148. > And we may need an new error code to inform the producer about this error. -- This message was sent by Atlassian JIRA (v6.3.4#6332)