[
https://issues.apache.org/jira/browse/KAFKA-6361?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Anna Povzner reassigned KAFKA-6361:
-----------------------------------
Assignee: Anna Povzner (was: Jason Gustafson)
> Fast leader fail over can lead to log divergence between leader and follower
> ----------------------------------------------------------------------------
>
> Key: KAFKA-6361
> URL: https://issues.apache.org/jira/browse/KAFKA-6361
> Project: Kafka
> Issue Type: Bug
> Reporter: Jason Gustafson
> Assignee: Anna Povzner
> Priority: Major
> Labels: reliability
>
> We have observed an edge case in the replication failover logic which can
> cause a replica to permanently fall out of sync with the leader or, in the
> worst case, actually have localized divergence between logs. This occurs in
> spite of the improved truncation logic from KIP-101.
> Suppose we have brokers A and B. Initially A is the leader in epoch 1. It
> appends two batches: one in the range (0, 10) and the other in the range (11,
> 20). The first one successfully replicates to B, but the second one does not.
> In other words, the logs on the brokers look like this:
> {code}
> Broker A:
> 0: offsets [0, 10], leader epoch: 1
> 1: offsets [11, 20], leader epoch: 1
> Broker B:
> 0: offsets [0, 10], leader epoch: 1
> {code}
> Broker A then has a zk session expiration and broker B is elected with epoch
> 2. It appends a new batch with offsets (11, n) to its local log. So we now
> have this:
> {code}
> Broker A:
> 0: offsets [0, 10], leader epoch: 1
> 1: offsets [11, 20], leader epoch: 1
> Broker B:
> 0: offsets [0, 10], leader epoch: 1
> 1: offsets: [11, n], leader epoch: 2
> {code}
> Normally we expect broker A to truncate to offset 11 on becoming the
> follower, but before it is able to do so, broker B has its own zk session
> expiration and broker A again becomes leader, now with epoch 3. It then
> appends a new entry in the range (21, 30). The updated logs look like this:
> {code}
> Broker A:
> 0: offsets [0, 10], leader epoch: 1
> 1: offsets [11, 20], leader epoch: 1
> 2: offsets: [21, 30], leader epoch: 3
> Broker B:
> 0: offsets [0, 10], leader epoch: 1
> 1: offsets: [11, n], leader epoch: 2
> {code}
> Now what happens next depends on the last offset of the batch appended in
> epoch 2. On becoming follower, broker B will send an OffsetForLeaderEpoch
> request to broker A with epoch 2. Broker A will respond that epoch 2 ends at
> offset 21. There are three cases:
> 1) n < 20: In this case, broker B will not do any truncation. It will begin
> fetching from offset n, which will ultimately cause an out of order offset
> error because broker A will return the full batch beginning from offset 11
> which broker B will be unable to append.
> 2) n == 20: Again broker B does not truncate. It will fetch from offset 21
> and everything will appear fine though the logs have actually diverged.
> 3) n > 20: Broker B will attempt to truncate to offset 21. Since this is in
> the middle of the batch, it will truncate all the way to offset 10. It can
> begin fetching from offset 11 and everything is fine.
> The case we have actually seen is the first one. The second one would likely
> go unnoticed in practice and everything is fine in the third case. To
> workaround the issue, we deleted the active segment on the replica which
> allowed it to re-replicate consistently from the leader.
> I'm not sure the best solution for this scenario. Maybe if the leader isn't
> aware of an epoch, it should always respond with {{UNDEFINED_EPOCH_OFFSET}}
> instead of using the offset of the next highest epoch. That would cause the
> follower to truncate using its high watermark. Or perhaps instead of doing
> so, it could send another OffsetForLeaderEpoch request at the next previous
> cached epoch and then truncate using that.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)