[jira] [Comment Edited] (KAFKA-3042) updateIsr should stop after failed several times due to zkVersion issue

Jun Rao (JIRA) Mon, 18 Apr 2016 13:55:12 -0700

    [ 
https://issues.apache.org/jira/browse/KAFKA-3042?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15246508#comment-15246508
 ]


Jun Rao edited comment on KAFKA-3042 at 4/18/16 8:53 PM:
---------------------------------------------------------

[~fpj], yes, I was confused by that initially too. The following is my finding 
after I dug a bit deeper. When a controller receives a broker change event, it 
will also log the live brokers after processing the event. So, in the 
controller log of broker 5, you will see the following matching entries.

{code}
[2016-04-09 00:41:35,442] INFO [BrokerChangeListener on Controller 5]: Broker 
change listener fired for path /brokers/ids with children 1,4,5 
(kafka.controller.ReplicaStateMachine$BrokerChangeListener)
[2016-04-09 00:41:36,235] INFO [BrokerChangeListener on Controller 5]: Newly 
added brokers: , deleted brokers: 3, all live brokers: 1,4,5 
(kafka.controller.ReplicaStateMachine$BrokerChangeListener)
{code}

However, after 

{code}
[2016-04-09 00:37:54,079] INFO [BrokerChangeListener on Controller 5]: Broker 
change listener fired for path /brokers/ids with children 1,2,3,5 
(kafka.controller.ReplicaStateMachine$BrokerChangeListener)
{code}

you don't see the corresponding entry on "Newly added broker" afterward. The 
reason is that at 00:37:54, broker 5 has resigned as the controller and 
therefore ignored the broker change listener event. 

{code}
[2016-04-09 00:37:54,064] INFO [Controller 5]: Broker 5 resigned as the 
controller (kafka.controller.KafkaController)
{code}

Broker 5 became the controller again a bit later.

{code}
[2016-04-09 00:40:54,990] INFO [Controller 5]: Controller 5 incremented epoch 
to 416 (kafka.controller.KafkaController)
[2016-04-09 00:40:57,877] INFO [Controller 5]: Currently active brokers in the 
cluster: Set(1, 3, 4, 5) (kafka.controller.KafkaController)
{code}

As you can see, it initialized with 4 live brokers, including broker 4. It then 
sent the following LeaderAndIsrRequest to broker 1 after initialization.
{code}
[2016-04-09 00:40:58,089] TRACE Controller 5 epoch 416 sending become-follower 
LeaderAndIsr request (Leader:4,ISR:4,LeaderEpoch:364,ControllerEpoch:415) to 
broker 1 for partition [tec1.en2.frontend.syncPing,7] (state.change.logger)
{code}

The above LeaderAndIsrRequest likely includes broker 4 in the live broker list 
since there is no other indication that the live broker list has changed since 
the initialization. However, if broker 4 is not in the metadata cache of broker 
1 (which seems to be the case), it can still lead to the following error.

{code}
[2016-04-09 00:40:58,144] ERROR Broker 1 received LeaderAndIsrRequest with 
correlation id 0 from controller 5 epoch 415 for partition 
[tec1.en2.frontend.syncPing,7] but cannot become follower since the new leader 
4 is unavailable. (state.change.logger)
{code}



was (Author: junrao):
[~fpj], yes, I was confused by that initially too. The following is my finding 
after I dug a bit deeper. When a controller receives a broker change event, it 
will also log the live brokers after processing the event. So, in the 
controller log of broker 5, you will see the following matching entries.

{code}
[2016-04-09 00:41:35,442] INFO [BrokerChangeListener on Controller 5]: Broker 
change listener fired for path /brokers/ids with children 1,4,5 
(kafka.controller.ReplicaStateMachine$BrokerChangeListener)
[2016-04-09 00:41:36,235] INFO [BrokerChangeListener on Controller 5]: Newly 
added brokers: , deleted brokers: 3, all live brokers: 1,4,5 
(kafka.controller.ReplicaStateMachine$BrokerChangeListener)
{code}

However, after 

{code}
[2016-04-09 00:37:54,079] INFO [BrokerChangeListener on Controller 5]: Broker 
change listener fired for path /brokers/ids with children 1,2,3,5 
(kafka.controller.ReplicaStateMachine$BrokerChangeListener)
{code}

you don't see the corresponding entry on "Newly added broker" afterward. The 
reason is that at 00:37:54, broker 5 has resigned as the controller and 
therefore ignored the broker change listener event. 

{code}
[2016-04-09 00:37:54,064] INFO [Controller 5]: Broker 5 resigned as the 
controller (kafka.controller.KafkaController)
{code}

Broker 5 became the controller again a bit later.

{code}
[2016-04-09 00:40:54,990] INFO [Controller 5]: Controller 5 incremented epoch 
to 416 (kafka.controller.KafkaController)
[2016-04-09 00:40:57,877] INFO [Controller 5]: Currently active brokers in the 
cluster: Set(1, 3, 4, 5) (kafka.controller.KafkaController)
{code}

As you can see, it initialized with 4 live brokers, including broker 4. It then 
sent the following LeaderAndIsrRequest to broker 1 after initialization.
{code}
[2016-04-09 00:40:58,089] TRACE Controller 5 epoch 416 sending become-follower 
LeaderAndIsr request (Leader:4,ISR:4,LeaderEpoch:364,ControllerEpoch:415) to 
broker 1 for partition [tec1.en2.frontend.syncPing,7] (state.change.logger)
{code}

The above LeaderAndIsrRequest likely includes broker 4 in the live broker list 
since there is no other indication that the live broker list has changed since 
the initialization. However, if broker 4 is not the metadata cache of broker 1 
(which seems to be the case), it can still lead to the following error.

{code}
[2016-04-09 00:40:58,144] ERROR Broker 1 received LeaderAndIsrRequest with 
correlation id 0 from controller 5 epoch 415 for partition 
[tec1.en2.frontend.syncPing,7] but cannot become follower since the new leader 
4 is unavailable. (state.change.logger)
{code}


> updateIsr should stop after failed several times due to zkVersion issue
> -----------------------------------------------------------------------
>
>                 Key: KAFKA-3042
>                 URL: https://issues.apache.org/jira/browse/KAFKA-3042
>             Project: Kafka
>          Issue Type: Bug
>    Affects Versions: 0.8.2.1
>         Environment: jdk 1.7
> centos 6.4
>            Reporter: Jiahongchao
>             Fix For: 0.10.0.0
>
>         Attachments: controller.log, server.log.2016-03-23-01, 
> state-change.log
>
>
> sometimes one broker may repeatly log
> "Cached zkVersion 54 not equal to that in zookeeper, skip updating ISR"
> I think this is because the broker consider itself as the leader in fact it's 
> a follower.
> So after several failed tries, it need to find out who is the leader



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Comment Edited] (KAFKA-3042) updateIsr should stop after failed several times due to zkVersion issue

Reply via email to