[ 
https://issues.apache.org/jira/browse/KAFKA-6880?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Guozhang Wang updated KAFKA-6880:
---------------------------------
    Description: 
Let's say we have three replicas for a partition: 1, 2 ,and 3.

In epoch 0, broker 1 is the leader and writes up to offset 50. Broker 2 
replicates up to offset 50, but broker 3 is a little behind at offset 40. The 
high watermark is 40. 

Suppose that broker 2 has a zk session expiration event, but fails to detect it 
or fails to reestablish a session (e.g. due to a bug like KAFKA-6879), and it 
continues fetching from broker 1.

For whatever reason, broker 3 is elected the leader for epoch 1 beginning at 
offset 40. Broker 1 detects the leader change and truncates its log to offset 
40. Some new data is appended up to offset 60, which is fully replicated to 
broker 1. Broker 2 continues fetching from broker 1 at offset 50, but gets 
NOT_LEADER_FOR_PARTITION errors, which is retriable and hence broker 2 will 
retry.

After some time, broker 1 becomes the leader again for epoch 2. Broker 1 begins 
at offset 60. Broker 2 has not exhausted retries and is now able to fetch at 
offset 50 and append the last 10 records in order to catch up. However, because 
it did not observed the leader changes, it never saw the need to truncate its 
log. Hence offsets 40-49 still reflect the uncommitted changes from epoch 0. 
Neither KIP-101 nor KIP-279 can fix this because the tail of the log is correct.

The basic problem is that zombie replicas are not fenced properly by the leader 
epoch. We actually observed a sequence roughly like this after a broker had 
partially deadlocked from KAFKA-6879. We should add the leader epoch to fetch 
requests so that the leader can fence the zombie replicas.

A related problem is that we currently allow such zombie replicas to be added 
to the ISR even if they are in an offline state. The problem is that the 
controller will never elect them, so being part of the ISR does not give the 
availability guarantee that is intended. This would also be fixed by verifying 
replica leader epoch in fetch requests.

  was:
Let's say we have three replicas for a partition: 1, 2 ,and 3.

In epoch 0, broker 1 is the leader and writes up to offset 50. Broker 2 
replicates up to offset 50, but broker 3 is a little behind at offset 40. The 
high watermark is 40. 

Suppose that broker 2 has a zk session expiration event, but fails to detect it 
or fails to reestablish a session (e.g. due to a bug like KAFKA-6879), and it 
continues fetching from broker 1.

For whatever reason, broker 3 is elected the leader for epoch 1 beginning at 
offset 40. Broker 1 detects the leader change and truncates its log to offset 
40. Some new data is appended up to offset 60, which is fully replicated to 
broker 2. Broker 2 continues fetching from broker 1 at offset 50, but gets 
NOT_LEADER_FOR_PARTITION errors.

After some time, broker 1 becomes the leader again for epoch 2. Broker 1 begins 
at offset 60. Broker 2 is now able to fetch at offset 50 and append the last 10 
records in order to catch up. However, because it did not observed the leader 
changes, it never saw the need to truncate its log. Hence offsets 40-49 still 
reflect the uncommitted changes from epoch 0. Neither KIP-101 nor KIP-279 can 
fix this because the tail of the log is correct.

The basic problem is that zombie replicas are not fenced properly by the leader 
epoch. We actually observed a sequence roughly like this after a broker had 
partially deadlocked from KAFKA-6879. We should add the leader epoch to fetch 
requests so that the leader can fence the zombie replicas.

A related problem is that we currently allow such zombie replicas to be added 
to the ISR even if they are in an offline state. The problem is that the 
controller will never elect them, so being part of the ISR does not give the 
availability guarantee that is intended. This would also be fixed by verifying 
replica leader epoch in fetch requests.


> Zombie replicas must be fenced
> ------------------------------
>
>                 Key: KAFKA-6880
>                 URL: https://issues.apache.org/jira/browse/KAFKA-6880
>             Project: Kafka
>          Issue Type: Improvement
>            Reporter: Jason Gustafson
>            Assignee: Jason Gustafson
>            Priority: Major
>              Labels: needs-kip
>
> Let's say we have three replicas for a partition: 1, 2 ,and 3.
> In epoch 0, broker 1 is the leader and writes up to offset 50. Broker 2 
> replicates up to offset 50, but broker 3 is a little behind at offset 40. The 
> high watermark is 40. 
> Suppose that broker 2 has a zk session expiration event, but fails to detect 
> it or fails to reestablish a session (e.g. due to a bug like KAFKA-6879), and 
> it continues fetching from broker 1.
> For whatever reason, broker 3 is elected the leader for epoch 1 beginning at 
> offset 40. Broker 1 detects the leader change and truncates its log to offset 
> 40. Some new data is appended up to offset 60, which is fully replicated to 
> broker 1. Broker 2 continues fetching from broker 1 at offset 50, but gets 
> NOT_LEADER_FOR_PARTITION errors, which is retriable and hence broker 2 will 
> retry.
> After some time, broker 1 becomes the leader again for epoch 2. Broker 1 
> begins at offset 60. Broker 2 has not exhausted retries and is now able to 
> fetch at offset 50 and append the last 10 records in order to catch up. 
> However, because it did not observed the leader changes, it never saw the 
> need to truncate its log. Hence offsets 40-49 still reflect the uncommitted 
> changes from epoch 0. Neither KIP-101 nor KIP-279 can fix this because the 
> tail of the log is correct.
> The basic problem is that zombie replicas are not fenced properly by the 
> leader epoch. We actually observed a sequence roughly like this after a 
> broker had partially deadlocked from KAFKA-6879. We should add the leader 
> epoch to fetch requests so that the leader can fence the zombie replicas.
> A related problem is that we currently allow such zombie replicas to be added 
> to the ISR even if they are in an offline state. The problem is that the 
> controller will never elect them, so being part of the ISR does not give the 
> availability guarantee that is intended. This would also be fixed by 
> verifying replica leader epoch in fetch requests.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to