[jira] [Assigned] (KAFKA-6361) Fast leader fail over can lead to log divergence between leader and follower

Anna Povzner (JIRA) Mon, 16 Apr 2018 10:32:46 -0700

     [ 
https://issues.apache.org/jira/browse/KAFKA-6361?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Anna Povzner reassigned KAFKA-6361:
-----------------------------------

    Assignee: Anna Povzner  (was: Jason Gustafson)

> Fast leader fail over can lead to log divergence between leader and follower
> ----------------------------------------------------------------------------
>
>                 Key: KAFKA-6361
>                 URL: https://issues.apache.org/jira/browse/KAFKA-6361
>             Project: Kafka
>          Issue Type: Bug
>            Reporter: Jason Gustafson
>            Assignee: Anna Povzner
>            Priority: Major
>              Labels: reliability
>
> We have observed an edge case in the replication failover logic which can 
> cause a replica to permanently fall out of sync with the leader or, in the 
> worst case, actually have localized divergence between logs. This occurs in 
> spite of the improved truncation logic from KIP-101. 
> Suppose we have brokers A and B. Initially A is the leader in epoch 1. It 
> appends two batches: one in the range (0, 10) and the other in the range (11, 
> 20). The first one successfully replicates to B, but the second one does not. 
> In other words, the logs on the brokers look like this:
> {code}
> Broker A:
> 0: offsets [0, 10], leader epoch: 1
> 1: offsets [11, 20], leader epoch: 1
> Broker B:
> 0: offsets [0, 10], leader epoch: 1
> {code}
> Broker A then has a zk session expiration and broker B is elected with epoch 
> 2. It appends a new batch with offsets (11, n) to its local log. So we now 
> have this:
> {code}
> Broker A:
> 0: offsets [0, 10], leader epoch: 1
> 1: offsets [11, 20], leader epoch: 1
> Broker B:
> 0: offsets [0, 10], leader epoch: 1
> 1: offsets: [11, n], leader epoch: 2
> {code}
> Normally we expect broker A to truncate to offset 11 on becoming the 
> follower, but before it is able to do so, broker B has its own zk session 
> expiration and broker A again becomes leader, now with epoch 3. It then 
> appends a new entry in the range (21, 30). The updated logs look like this:
> {code}
> Broker A:
> 0: offsets [0, 10], leader epoch: 1
> 1: offsets [11, 20], leader epoch: 1
> 2: offsets: [21, 30], leader epoch: 3
> Broker B:
> 0: offsets [0, 10], leader epoch: 1
> 1: offsets: [11, n], leader epoch: 2
> {code}
> Now what happens next depends on the last offset of the batch appended in 
> epoch 2. On becoming follower, broker B will send an OffsetForLeaderEpoch 
> request to broker A with epoch 2. Broker A will respond that epoch 2 ends at 
> offset 21. There are three cases:
> 1) n < 20: In this case, broker B will not do any truncation. It will begin 
> fetching from offset n, which will ultimately cause an out of order offset 
> error because broker A will return the full batch beginning from offset 11 
> which broker B will be unable to append.
> 2) n == 20: Again broker B does not truncate. It will fetch from offset 21 
> and everything will appear fine though the logs have actually diverged.
> 3) n > 20: Broker B will attempt to truncate to offset 21. Since this is in 
> the middle of the batch, it will truncate all the way to offset 10. It can 
> begin fetching from offset 11 and everything is fine.
> The case we have actually seen is the first one. The second one would likely 
> go unnoticed in practice and everything is fine in the third case. To 
> workaround the issue, we deleted the active segment on the replica which 
> allowed it to re-replicate consistently from the leader.
> I'm not sure the best solution for this scenario. Maybe if the leader isn't 
> aware of an epoch, it should always respond with {{UNDEFINED_EPOCH_OFFSET}} 
> instead of using the offset of the next highest epoch. That would cause the 
> follower to truncate using its high watermark. Or perhaps instead of doing 
> so, it could send another OffsetForLeaderEpoch request at the next previous 
> cached epoch and then truncate using that. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Assigned] (KAFKA-6361) Fast leader fail over can lead to log divergence between leader and follower

Reply via email to