[ 
https://issues.apache.org/jira/browse/KAFKA-10487?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason Gustafson resolved KAFKA-10487.
-------------------------------------
    Resolution: Fixed

> Fix edge case in Raft truncation protocol
> -----------------------------------------
>
>                 Key: KAFKA-10487
>                 URL: https://issues.apache.org/jira/browse/KAFKA-10487
>             Project: Kafka
>          Issue Type: Sub-task
>            Reporter: Jason Gustafson
>            Assignee: Jason Gustafson
>            Priority: Major
>
> Consider the following scenario:
> Three replicas: A, B, and C. In epoch=1, replica A is the leader and writes 
> up to offset 10. The leader then fails with the high watermark at offset 8. 
> Replica B had caught up to offset 10 while replica C was at offset 8. Suppose 
> that C is elected with epoch=2 and immediately writes records up to offset 
> 10. However, it also fails before these records become committed and replica 
> B gets elected and writes records
> up to offset 12. The epoch cache on each replica will look like the following:
> Replica A:
> (epoch=1, start_offset=0)
> Replica B:
> (epoch=1, start_offset=0)
> (epoch=3, start_offset=10)
> Replica C:
> (epoch=1, start_offset=0)
> (epoch=2, start_offset=8)
> Suppose C comes back online. It will attempt to fetch at offset=10 with 
> last_fetched_epoch=3. The leader B will detect log divergence and will return 
> truncation_offset=10. Replica C will truncate to offset 10 (a no-op) and 
> retry the same fetch and will be stuck.
> To fix this, I see two options:
> Option 1: In the case that the truncation offset equals the fetch offset, we 
> can instead return the previous epoch end offset. In this example, we would 
> return truncation_offset=0. The downside is that this causes unnecessary 
> truncation.
> Option 2: Rather than returning only the truncation offset, we can have the 
> leader return both the previous "diverging" epoch and its end offset. In this 
> example, B would return diverging_epoch=1, end_offset=10. Replica C would 
> then know
> to truncate to offset 8.
> The second option is what was initially specified in the Raft proposal, but 
> we changed during the discussion because we were not thinking of this case 
> and we thought the response could be simplified. My inclination is to restore 
> the originally specified truncation logic.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to