[jira] [Work started] (KAFKA-1097) Race condition while reassigning low throughput partition leads to incorrect ISR information in zookeeper

Neha Narkhede (JIRA) Fri, 01 Nov 2013 10:28:41 -0700

     [ 
https://issues.apache.org/jira/browse/KAFKA-1097?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Work on KAFKA-1097 started by Neha Narkhede.

> Race condition while reassigning low throughput partition leads to incorrect 
> ISR information in zookeeper 
> ----------------------------------------------------------------------------------------------------------
>
>                 Key: KAFKA-1097
>                 URL: https://issues.apache.org/jira/browse/KAFKA-1097
>             Project: Kafka
>          Issue Type: Bug
>          Components: controller
>    Affects Versions: 0.8
>            Reporter: Neha Narkhede
>            Assignee: Neha Narkhede
>            Priority: Critical
>             Fix For: 0.8.1
>
>         Attachments: KAFKA-1097.patch, KAFKA-1097_2013-10-29_10:49:45.patch, 
> KAFKA-1097_2013-10-30_21:46:00.patch, KAFKA-1097_2013-10-31_10:37:29.patch, 
> KAFKA-1097_2013-11-01_09:55:33.patch
>
>
> While moving partitions, the controller moves the old replicas through the 
> following state changes -
> ONLINE -> OFFLINE -> NON_EXISTENT
> During the offline state change, the controller removes the old replica and 
> writes the updated ISR to zookeeper and notifies the leader. Note that it 
> doesn't notify the old replicas to stop fetching from the leader (to be fixed 
> in KAFKA-1032). During the non-existent state change, the controller does not 
> write the updated ISR or replica list to zookeeper. Right after the 
> non-existent state change, the controller writes the new replica list to 
> zookeeper, but does not update the ISR. So an old replica can send a fetch 
> request after the offline state change, essentially letting the leader add it 
> back to the ISR. The problem is that if there is no new data coming in for 
> the partition and the old replica is fully caught up, the leader cannot 
> remove it from the ISR. That lets a non existent replica live in the ISR at 
> least until new data comes in to the partition



--
This message was sent by Atlassian JIRA
(v6.1#6144)

[jira] [Work started] (KAFKA-1097) Race condition while reassigning low throughput partition leads to incorrect ISR information in zookeeper

Reply via email to