[ 
https://issues.apache.org/jira/browse/KAFKA-2960?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15048115#comment-15048115
 ] 

Guozhang Wang commented on KAFKA-2960:
--------------------------------------

[~becket_qin] I think the issue here is that when a broker becomes follower, 
its delayed produce request does NOT get cleaned and returned an error code to 
the producer, but will still sit in the purgatory. If its producer timeout is 
long enough to not being timed out, it can be incorrectly satisfied when the 
follower becomes leader again. For example let's say we have two brokers:

Broker 1 is the leader with its current LEO 50, HW 50.
Broker 2 is follower with current LEO 50, HW 50.

1) broker 1 gets one message "a" with ack = all and append with offset 51, and 
its LEO is 51.
2) this produce request sit in the purgatory for broker 2 to replicate.
3) broker 1 becomes the follower and broker 2 becomes leader.
4) broker 1 sees broker 2's HW is 50, so it will truncate out message "a" and 
reset its LEO to 50.
5) broker 1 becomes leader again and broker 2 becomes follower again.
6) broker 1 gets another message "b", append with offset 51.
7) broker 2 replicates message "b".
8) broker 1 now advanced its HW to 51, and satisfying both produce requests for 
"a" and "b" based on the offset, but "a" is actually truncated.

[~peoplebike] I'm wondering in your case, what is the produce request timeout 
value to trigger this issue? And how long did you observe the original leader 
to transit to follower and back to leader again?


> DelayedProduce may cause message lose during repeatly leader change
> -------------------------------------------------------------------
>
>                 Key: KAFKA-2960
>                 URL: https://issues.apache.org/jira/browse/KAFKA-2960
>             Project: Kafka
>          Issue Type: Bug
>          Components: core
>    Affects Versions: 0.9.0.0
>            Reporter: Xing Huang
>             Fix For: 0.9.1.0
>
>
> related to #KAFKA-1148
> When a leader replica became follower then leader again, it may truncated its 
> log as follower. But the second time it became leader, its ISR may shrink and 
> if at this moment new messages were appended, the DelayedProduce generated 
> when it was leader the first time may be satisfied, and the client will 
> receive a response with no error. But, actually the messages were lost. 
> We simulated this scene, which proved the message lose could happen. And it 
> seems to be the reason for a data lose recently happened to us according to 
> broker logs and client logs.
> I think we should check the leader epoch when send a response, or satisfy 
> DelayedProduce when leader change as described in #KAFKA-1148.
> And we may need an new error code to inform the producer about this error. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to