Jason Gustafson created KAFKA-10706: ---------------------------------------
Summary: Liveness bug in truncation protocol can lead to indefinite URP Key: KAFKA-10706 URL: https://issues.apache.org/jira/browse/KAFKA-10706 Project: Kafka Issue Type: Bug Reporter: Jason Gustafson Assignee: Jason Gustafson We hit an interesting liveness condition in the truncation protocol. Broker A was leader in epoch 7, broker B was leader in epoch 8, and then broker A was leader in epoch 9 again. On broker A, we had the following state in the epoch cache: {code} epoch 4, offset 3953 epoch 7, offset 3983 epoch 9, offset 3988 {code} On broker B, we had the following: {code} epoch 4, start offset 3953 epoch 8, start offset 3983 {code} After A was elected, broker B sent epoch 8 in OffsetsForLeaderEpoch. Broker A correctly responded with epoch 7 ending at offset 3988. The end offset on broker B was in fact 3983, so this truncation had no effect. Broker B then retried with epoch 8 again and replication was stuck. When a replica becomes leader, it first inserts an entry into the epoch cache with the current log end offset. This ensures that that it has a larger epoch in the cache than any epoch that could be requested by a valid replica. However, I think it is incorrect to turn around and use this epoch when becoming a follower. It seems like we need symmetric logic after becoming a follower to remove this epoch entry. -- This message was sent by Atlassian Jira (v8.3.4#803005)