[ https://issues.apache.org/jira/browse/KAFKA-1211?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15404404#comment-15404404 ]
Jun Rao commented on KAFKA-1211: -------------------------------- [~fpj], for #1 and #2, there are a couple scenarios that this proposal can fix. a. The first one is what's described in the original jira. Currently, when the follower does truncation, it can truncate some previously committed messages. If the follower immediately becomes the leader after truncation, we will lose some previously committed messages. This is rare, but if it happens, it's bad. The proposal fixes this case by preventing the follower from unnecessarily truncating previously committed messages. b. Another issue is that a portion of the log in different replicas may not match in certain failure cases. This can happen when unclean leader election is enabled. However, even if unclean leader election is disabled, mis-matching can still happen when messages are lost due to power outage (see KAFKA-3919). The proposal fixes this issue by making sure that the replicas are always identical. For #3, the controller increases the leader generation every time the leader changes. The latest leader generation is persisted in ZK. For #4, putting the leader generation in the segment file name is another possibility. One concern I had on that approach is dealing with compacted topics. After compaction, it's possible there is only a small number (or even just a single) messages left in a particular generation. Putting the generation id in the segment file name will force us to have tiny segments, which is not ideal. About the race condition, even with a separate checkpoint file, we can avoid that. The sequencing will be (1) broker receives LeaderAndIsrRequest to become leader; (2) broker stops fetching from current leader; (3) no new writes can happen to this replica at this point; (4) broker writes the new leader generation and log end offset to checkpoint file; (5) broker marks replica as leader; (6) new writes can happen to this replica now. For #5, it depends on who becomes the new leader in that case. If A becomes the new leader (generation 3), then B and C will remove m1 and m2 and copy m3 and m4 over from A. If B becomes the new leader, A will remove m3 and m4 and copy m1 and m2 over from B. In either case, the replicas will be identical. > Hold the produce request with ack > 1 in purgatory until replicas' HW has > larger than the produce offset > -------------------------------------------------------------------------------------------------------- > > Key: KAFKA-1211 > URL: https://issues.apache.org/jira/browse/KAFKA-1211 > Project: Kafka > Issue Type: Bug > Reporter: Guozhang Wang > Assignee: Guozhang Wang > Fix For: 0.11.0.0 > > > Today during leader failover we will have a weakness period when the > followers truncate their data before fetching from the new leader, i.e., > number of in-sync replicas is just 1. If during this time the leader has also > failed then produce requests with ack >1 that have get responded will still > be lost. To avoid this scenario we would prefer to hold the produce request > in purgatory until replica's HW has larger than the offset instead of just > their end-of-log offsets. -- This message was sent by Atlassian JIRA (v6.3.4#6332)