[ 
https://issues.apache.org/jira/browse/KAFKA-3042?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15238651#comment-15238651
 ] 

Jun Rao commented on KAFKA-3042:
--------------------------------

The issue seems to be the following. In 0.9.0, we changed the logic a bit in 
ReplicaManager.makeFollowers() to ensure that the new leader is in the 
liveBrokers of metadataCache. However, during a controller failover, the new 
controller first sends leaderAndIsr requests, followed by an UpdateMetaRequest. 
So, it is possible when a broker receives a leaderAndIsr request, the 
liveBrokers in metadataCache are stale and don't include the leader and 
therefore causes the becoming follower logic to error out. Indeed, from broker 
1's state-change log, the last UpdateMetaRequest before the error in becoming 
follower came from controller 1.

{code}
[2016-04-09 00:40:52,929] TRACE Broker 1 cached leader info 
(LeaderAndIsrInfo:(Leader:1,ISR:1,LeaderEpoch:330,ControllerEpoch:414),ReplicationFactor:3),AllReplicas:2,1,4)
 for partit
ion [tec1.usqe1.frontend.syncPing,1] in response to UpdateMetadata request sent 
by controller 1 epoch 414 with correlation id 877 (state.change.logger)
{code}

In controller 1's log, the last time it updated the live broker list is the 
following and it didn't include broker 4 in the live broker list.
{code}
[2016-04-09 00:39:33,005] INFO [BrokerChangeListener on Controller 1]: Newly 
added brokers: , deleted brokers: 2, all live brokers: 1,3,5 
(kafka.controller.ReplicaStateMachine$BrokerChangeListener)
{code}

To fix this, we should probably send an UpdateMetadataRequest before any 
leaderAndIsrRequest during controller failover.

> updateIsr should stop after failed several times due to zkVersion issue
> -----------------------------------------------------------------------
>
>                 Key: KAFKA-3042
>                 URL: https://issues.apache.org/jira/browse/KAFKA-3042
>             Project: Kafka
>          Issue Type: Bug
>    Affects Versions: 0.8.2.1
>         Environment: jdk 1.7
> centos 6.4
>            Reporter: Jiahongchao
>         Attachments: controller.log, server.log.2016-03-23-01, 
> state-change.log
>
>
> sometimes one broker may repeatly log
> "Cached zkVersion 54 not equal to that in zookeeper, skip updating ISR"
> I think this is because the broker consider itself as the leader in fact it's 
> a follower.
> So after several failed tries, it need to find out who is the leader



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to