[ https://issues.apache.org/jira/browse/KAFKA-1097?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Work on KAFKA-1097 started by Neha Narkhede. > Race condition while reassigning low throughput partition leads to incorrect > ISR information in zookeeper > ---------------------------------------------------------------------------------------------------------- > > Key: KAFKA-1097 > URL: https://issues.apache.org/jira/browse/KAFKA-1097 > Project: Kafka > Issue Type: Bug > Components: controller > Affects Versions: 0.8 > Reporter: Neha Narkhede > Assignee: Neha Narkhede > Priority: Critical > Fix For: 0.8.1 > > Attachments: KAFKA-1097.patch, KAFKA-1097_2013-10-29_10:49:45.patch, > KAFKA-1097_2013-10-30_21:46:00.patch, KAFKA-1097_2013-10-31_10:37:29.patch, > KAFKA-1097_2013-11-01_09:55:33.patch > > > While moving partitions, the controller moves the old replicas through the > following state changes - > ONLINE -> OFFLINE -> NON_EXISTENT > During the offline state change, the controller removes the old replica and > writes the updated ISR to zookeeper and notifies the leader. Note that it > doesn't notify the old replicas to stop fetching from the leader (to be fixed > in KAFKA-1032). During the non-existent state change, the controller does not > write the updated ISR or replica list to zookeeper. Right after the > non-existent state change, the controller writes the new replica list to > zookeeper, but does not update the ISR. So an old replica can send a fetch > request after the offline state change, essentially letting the leader add it > back to the ISR. The problem is that if there is no new data coming in for > the partition and the old replica is fully caught up, the leader cannot > remove it from the ISR. That lets a non existent replica live in the ISR at > least until new data comes in to the partition -- This message was sent by Atlassian JIRA (v6.1#6144)