[ https://issues.apache.org/jira/browse/KAFKA-3042?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15238651#comment-15238651 ]
Jun Rao commented on KAFKA-3042: -------------------------------- The issue seems to be the following. In 0.9.0, we changed the logic a bit in ReplicaManager.makeFollowers() to ensure that the new leader is in the liveBrokers of metadataCache. However, during a controller failover, the new controller first sends leaderAndIsr requests, followed by an UpdateMetaRequest. So, it is possible when a broker receives a leaderAndIsr request, the liveBrokers in metadataCache are stale and don't include the leader and therefore causes the becoming follower logic to error out. Indeed, from broker 1's state-change log, the last UpdateMetaRequest before the error in becoming follower came from controller 1. {code} [2016-04-09 00:40:52,929] TRACE Broker 1 cached leader info (LeaderAndIsrInfo:(Leader:1,ISR:1,LeaderEpoch:330,ControllerEpoch:414),ReplicationFactor:3),AllReplicas:2,1,4) for partit ion [tec1.usqe1.frontend.syncPing,1] in response to UpdateMetadata request sent by controller 1 epoch 414 with correlation id 877 (state.change.logger) {code} In controller 1's log, the last time it updated the live broker list is the following and it didn't include broker 4 in the live broker list. {code} [2016-04-09 00:39:33,005] INFO [BrokerChangeListener on Controller 1]: Newly added brokers: , deleted brokers: 2, all live brokers: 1,3,5 (kafka.controller.ReplicaStateMachine$BrokerChangeListener) {code} To fix this, we should probably send an UpdateMetadataRequest before any leaderAndIsrRequest during controller failover. > updateIsr should stop after failed several times due to zkVersion issue > ----------------------------------------------------------------------- > > Key: KAFKA-3042 > URL: https://issues.apache.org/jira/browse/KAFKA-3042 > Project: Kafka > Issue Type: Bug > Affects Versions: 0.8.2.1 > Environment: jdk 1.7 > centos 6.4 > Reporter: Jiahongchao > Attachments: controller.log, server.log.2016-03-23-01, > state-change.log > > > sometimes one broker may repeatly log > "Cached zkVersion 54 not equal to that in zookeeper, skip updating ISR" > I think this is because the broker consider itself as the leader in fact it's > a follower. > So after several failed tries, it need to find out who is the leader -- This message was sent by Atlassian JIRA (v6.3.4#6332)