Jason Gustafson created KAFKA-10487:
---------------------------------------

             Summary: Fix edge case in Raft truncation protocol
                 Key: KAFKA-10487
                 URL: https://issues.apache.org/jira/browse/KAFKA-10487
             Project: Kafka
          Issue Type: Sub-task
            Reporter: Jason Gustafson
            Assignee: Jason Gustafson


Consider the following scenario:

Three replicas: A, B, and C. In epoch=1, replica A is the leader and writes up 
to offset 10. The leader then fails with the high watermark at offset 8. 
Replica B had caught up to offset 10 while replica C was at offset 8. Suppose 
that C is elected with epoch=2 and immediately writes records up to offset 10. 
However, it also fails before these records become committed and replica B gets 
elected and writes records
up to offset 12. The epoch cache on each replica will look like the following:

Replica A:
(epoch=1, start_offset=0)

Replica B:
(epoch=1, start_offset=0)
(epoch=3, start_offset=10)

Replica C:
(epoch=1, start_offset=0)
(epoch=2, start_offset=8)

Suppose C comes back online. It will attempt to fetch at offset=10
with last_fetched_epoch=3. The leader B will detect log divergence
and will return truncation_offset=10. Replica C will truncate to
offset 10 (a no-op) and retry the same fetch.

To fix this, I see two options:

Option 1: In the case that the truncation offset equals the fetch offset, we 
can instead return the previous epoch. In this example, we would return 
truncation_offset=0. The downside is that this causes unnecessary truncation.

Option 2: Rather than returning only the truncation offset, we can have the 
leader return both the previous "diverging" epoch and its end offset. In this 
example, B would return diverging_epoch=1, end_offset=10. Replica C would then 
know
to truncate to offset 8.

The second option is what was initially specified in the Raft proposal, but we 
changed during the discussion because we were not thinking of this case and we 
thought the response could be simplified. My inclination is to restore the 
originally specified truncation logic.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to