Jason Gustafson created KAFKA-10706:
---------------------------------------

             Summary: Liveness bug in truncation protocol can lead to 
indefinite URP
                 Key: KAFKA-10706
                 URL: https://issues.apache.org/jira/browse/KAFKA-10706
             Project: Kafka
          Issue Type: Bug
            Reporter: Jason Gustafson
            Assignee: Jason Gustafson


We hit an interesting liveness condition in the truncation protocol. Broker A 
was leader in epoch 7, broker B was leader in epoch 8, and then broker A was 
leader in epoch 9 again.

On broker A, we had the following state in the epoch cache:
{code}
epoch 4, offset 3953
epoch 7, offset 3983
epoch 9, offset 3988
{code}

On broker B, we had the following:
{code}
epoch 4, start offset 3953
epoch 8, start offset 3983
{code}

After A was elected, broker B sent epoch 8 in OffsetsForLeaderEpoch. Broker A 
correctly responded with epoch 7 ending at offset 3988. The end offset on 
broker B was in fact 3983, so this truncation had no effect. Broker B then 
retried with epoch 8 again and replication was stuck. 

When a replica becomes leader, it first inserts an entry into the epoch cache 
with the current log end offset. This ensures that that it has a larger epoch 
in the cache than any epoch that could be requested by a valid replica. 
However, I think it is incorrect to turn around and use this epoch when 
becoming a follower. It seems like we need symmetric logic after becoming a 
follower to remove this epoch entry.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to