[ https://issues.apache.org/jira/browse/KAFKA-3042?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15246508#comment-15246508 ]
Jun Rao edited comment on KAFKA-3042 at 4/18/16 8:53 PM: --------------------------------------------------------- [~fpj], yes, I was confused by that initially too. The following is my finding after I dug a bit deeper. When a controller receives a broker change event, it will also log the live brokers after processing the event. So, in the controller log of broker 5, you will see the following matching entries. {code} [2016-04-09 00:41:35,442] INFO [BrokerChangeListener on Controller 5]: Broker change listener fired for path /brokers/ids with children 1,4,5 (kafka.controller.ReplicaStateMachine$BrokerChangeListener) [2016-04-09 00:41:36,235] INFO [BrokerChangeListener on Controller 5]: Newly added brokers: , deleted brokers: 3, all live brokers: 1,4,5 (kafka.controller.ReplicaStateMachine$BrokerChangeListener) {code} However, after {code} [2016-04-09 00:37:54,079] INFO [BrokerChangeListener on Controller 5]: Broker change listener fired for path /brokers/ids with children 1,2,3,5 (kafka.controller.ReplicaStateMachine$BrokerChangeListener) {code} you don't see the corresponding entry on "Newly added broker" afterward. The reason is that at 00:37:54, broker 5 has resigned as the controller and therefore ignored the broker change listener event. {code} [2016-04-09 00:37:54,064] INFO [Controller 5]: Broker 5 resigned as the controller (kafka.controller.KafkaController) {code} Broker 5 became the controller again a bit later. {code} [2016-04-09 00:40:54,990] INFO [Controller 5]: Controller 5 incremented epoch to 416 (kafka.controller.KafkaController) [2016-04-09 00:40:57,877] INFO [Controller 5]: Currently active brokers in the cluster: Set(1, 3, 4, 5) (kafka.controller.KafkaController) {code} As you can see, it initialized with 4 live brokers, including broker 4. It then sent the following LeaderAndIsrRequest to broker 1 after initialization. {code} [2016-04-09 00:40:58,089] TRACE Controller 5 epoch 416 sending become-follower LeaderAndIsr request (Leader:4,ISR:4,LeaderEpoch:364,ControllerEpoch:415) to broker 1 for partition [tec1.en2.frontend.syncPing,7] (state.change.logger) {code} The above LeaderAndIsrRequest likely includes broker 4 in the live broker list since there is no other indication that the live broker list has changed since the initialization. However, if broker 4 is not in the metadata cache of broker 1 (which seems to be the case), it can still lead to the following error. {code} [2016-04-09 00:40:58,144] ERROR Broker 1 received LeaderAndIsrRequest with correlation id 0 from controller 5 epoch 415 for partition [tec1.en2.frontend.syncPing,7] but cannot become follower since the new leader 4 is unavailable. (state.change.logger) {code} was (Author: junrao): [~fpj], yes, I was confused by that initially too. The following is my finding after I dug a bit deeper. When a controller receives a broker change event, it will also log the live brokers after processing the event. So, in the controller log of broker 5, you will see the following matching entries. {code} [2016-04-09 00:41:35,442] INFO [BrokerChangeListener on Controller 5]: Broker change listener fired for path /brokers/ids with children 1,4,5 (kafka.controller.ReplicaStateMachine$BrokerChangeListener) [2016-04-09 00:41:36,235] INFO [BrokerChangeListener on Controller 5]: Newly added brokers: , deleted brokers: 3, all live brokers: 1,4,5 (kafka.controller.ReplicaStateMachine$BrokerChangeListener) {code} However, after {code} [2016-04-09 00:37:54,079] INFO [BrokerChangeListener on Controller 5]: Broker change listener fired for path /brokers/ids with children 1,2,3,5 (kafka.controller.ReplicaStateMachine$BrokerChangeListener) {code} you don't see the corresponding entry on "Newly added broker" afterward. The reason is that at 00:37:54, broker 5 has resigned as the controller and therefore ignored the broker change listener event. {code} [2016-04-09 00:37:54,064] INFO [Controller 5]: Broker 5 resigned as the controller (kafka.controller.KafkaController) {code} Broker 5 became the controller again a bit later. {code} [2016-04-09 00:40:54,990] INFO [Controller 5]: Controller 5 incremented epoch to 416 (kafka.controller.KafkaController) [2016-04-09 00:40:57,877] INFO [Controller 5]: Currently active brokers in the cluster: Set(1, 3, 4, 5) (kafka.controller.KafkaController) {code} As you can see, it initialized with 4 live brokers, including broker 4. It then sent the following LeaderAndIsrRequest to broker 1 after initialization. {code} [2016-04-09 00:40:58,089] TRACE Controller 5 epoch 416 sending become-follower LeaderAndIsr request (Leader:4,ISR:4,LeaderEpoch:364,ControllerEpoch:415) to broker 1 for partition [tec1.en2.frontend.syncPing,7] (state.change.logger) {code} The above LeaderAndIsrRequest likely includes broker 4 in the live broker list since there is no other indication that the live broker list has changed since the initialization. However, if broker 4 is not the metadata cache of broker 1 (which seems to be the case), it can still lead to the following error. {code} [2016-04-09 00:40:58,144] ERROR Broker 1 received LeaderAndIsrRequest with correlation id 0 from controller 5 epoch 415 for partition [tec1.en2.frontend.syncPing,7] but cannot become follower since the new leader 4 is unavailable. (state.change.logger) {code} > updateIsr should stop after failed several times due to zkVersion issue > ----------------------------------------------------------------------- > > Key: KAFKA-3042 > URL: https://issues.apache.org/jira/browse/KAFKA-3042 > Project: Kafka > Issue Type: Bug > Affects Versions: 0.8.2.1 > Environment: jdk 1.7 > centos 6.4 > Reporter: Jiahongchao > Fix For: 0.10.0.0 > > Attachments: controller.log, server.log.2016-03-23-01, > state-change.log > > > sometimes one broker may repeatly log > "Cached zkVersion 54 not equal to that in zookeeper, skip updating ISR" > I think this is because the broker consider itself as the leader in fact it's > a follower. > So after several failed tries, it need to find out who is the leader -- This message was sent by Atlassian JIRA (v6.3.4#6332)