[ https://issues.apache.org/jira/browse/KAFKA-10487?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Jason Gustafson resolved KAFKA-10487. ------------------------------------- Resolution: Fixed > Fix edge case in Raft truncation protocol > ----------------------------------------- > > Key: KAFKA-10487 > URL: https://issues.apache.org/jira/browse/KAFKA-10487 > Project: Kafka > Issue Type: Sub-task > Reporter: Jason Gustafson > Assignee: Jason Gustafson > Priority: Major > > Consider the following scenario: > Three replicas: A, B, and C. In epoch=1, replica A is the leader and writes > up to offset 10. The leader then fails with the high watermark at offset 8. > Replica B had caught up to offset 10 while replica C was at offset 8. Suppose > that C is elected with epoch=2 and immediately writes records up to offset > 10. However, it also fails before these records become committed and replica > B gets elected and writes records > up to offset 12. The epoch cache on each replica will look like the following: > Replica A: > (epoch=1, start_offset=0) > Replica B: > (epoch=1, start_offset=0) > (epoch=3, start_offset=10) > Replica C: > (epoch=1, start_offset=0) > (epoch=2, start_offset=8) > Suppose C comes back online. It will attempt to fetch at offset=10 with > last_fetched_epoch=3. The leader B will detect log divergence and will return > truncation_offset=10. Replica C will truncate to offset 10 (a no-op) and > retry the same fetch and will be stuck. > To fix this, I see two options: > Option 1: In the case that the truncation offset equals the fetch offset, we > can instead return the previous epoch end offset. In this example, we would > return truncation_offset=0. The downside is that this causes unnecessary > truncation. > Option 2: Rather than returning only the truncation offset, we can have the > leader return both the previous "diverging" epoch and its end offset. In this > example, B would return diverging_epoch=1, end_offset=10. Replica C would > then know > to truncate to offset 8. > The second option is what was initially specified in the Raft proposal, but > we changed during the discussion because we were not thinking of this case > and we thought the response could be simplified. My inclination is to restore > the originally specified truncation logic. -- This message was sent by Atlassian Jira (v8.3.4#803005)