[ https://issues.apache.org/jira/browse/KAFKA-6361?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Ismael Juma updated KAFKA-6361: ------------------------------- Labels: reliability (was: ) > Fast leader fail over can lead to log divergence between leader and follower > ---------------------------------------------------------------------------- > > Key: KAFKA-6361 > URL: https://issues.apache.org/jira/browse/KAFKA-6361 > Project: Kafka > Issue Type: Bug > Reporter: Jason Gustafson > Assignee: Jason Gustafson > Labels: reliability > > We have observed an edge case in the replication failover logic which can > cause a replica to permanently fall out of sync with the leader or, in the > worst case, actually have localized divergence between logs. This occurs in > spite of the improved truncation logic from KIP-101. > Suppose we have brokers A and B. Initially A is the leader in epoch 1. It > appends two batches: one in the range (0, 10) and the other in the range (11, > 20). The first one successfully replicates to B, but the second one does not. > In other words, the logs on the brokers look like this: > {code} > Broker A: > 0: offsets [0, 10], leader epoch: 1 > 1: offsets [11, 20], leader epoch: 1 > Broker B: > 0: offsets [0, 10], leader epoch: 1 > {code} > Broker A then has a zk session expiration and broker B is elected with epoch > 2. It appends a new batch with offsets (11, n) to its local log. So we now > have this: > {code} > Broker A: > 0: offsets [0, 10], leader epoch: 1 > 1: offsets [11, 20], leader epoch: 1 > Broker B: > 0: offsets [0, 10], leader epoch: 1 > 1: offsets: [11, n], leader epoch: 2 > {code} > Normally we expect broker A to truncate to offset 11 on becoming the > follower, but before it is able to do so, broker B has its own zk session > expiration and broker A again becomes leader, now with epoch 3. It then > appends a new entry in the range (21, 30). The updated logs look like this: > {code} > Broker A: > 0: offsets [0, 10], leader epoch: 1 > 1: offsets [11, 20], leader epoch: 1 > 2: offsets: [21, 30], leader epoch: 3 > Broker B: > 0: offsets [0, 10], leader epoch: 1 > 1: offsets: [11, n], leader epoch: 2 > {code} > Now what happens next depends on the last offset of the batch appended in > epoch 2. On becoming follower, broker B will send an OffsetForLeaderEpoch > request to broker A with epoch 2. Broker A will respond that epoch 2 ends at > offset 21. There are three cases: > 1) n < 20: In this case, broker B will not do any truncation. It will begin > fetching from offset n, which will ultimately cause an out of order offset > error because broker A will return the full batch beginning from offset 11 > which broker B will be unable to append. > 2) n == 20: Again broker B does not truncate. It will fetch from offset 21 > and everything will appear fine though the logs have actually diverged. > 3) n > 20: Broker B will attempt to truncate to offset 21. Since this is in > the middle of the batch, it will truncate all the way to offset 10. It can > begin fetching from offset 11 and everything is fine. > The case we have actually seen is the first one. The second one would likely > go unnoticed in practice and everything is fine in the third case. To > workaround the issue, we deleted the active segment on the replica which > allowed it to re-replicate consistently from the leader. > I'm not sure the best solution for this scenario. Maybe if the leader isn't > aware of an epoch, it should always respond with {{UNDEFINED_EPOCH_OFFSET}} > instead of using the offset of the next highest epoch. That would cause the > follower to truncate using its high watermark. Or perhaps instead of doing > so, it could send another OffsetForLeaderEpoch request at the next previous > cached epoch and then truncate using that. -- This message was sent by Atlassian JIRA (v6.4.14#64029)