[ https://issues.apache.org/jira/browse/KAFKA-3042?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15236055#comment-15236055 ]
Flavio Junqueira edited comment on KAFKA-3042 at 4/12/16 2:58 PM: ------------------------------------------------------------------ Thanks for the logs. I still don’t have an answer, but I wanted to report some progress. I tracked one of the partitions that I’ve chosen arbitrarily for which broker 1 was complaining about the zk version: [tec1.en2.frontend.syncPing,7]. Here are a few things that I’ve observed: Broker 1 is the leader of the partition: {noformat} [2016-04-09 00:40:52,883] TRACE Broker 1 cached leader info (LeaderAndIsrInfo:(Leader:1,ISR:1,LeaderEpoch:363,ControllerEpoch:414),ReplicationFactor:3),AllReplicas:1,2,4) for partition [tec1.en2.frontend.syncPing,7] in response to UpdateMetadata request sent by controller 1 epoch 414 with correlation id 876 (state.change.logger) {noformat} but soon after it fails to release leadership to broker 4: {noformat} [2016-04-09 00:40:58,106] TRACE Broker 1 received LeaderAndIsr request (LeaderAndIsrInfo:(Leader:4,ISR:4,LeaderEpoch:364,ControllerEpoch:415),ReplicationFactor:3),AllReplicas:1,2,4) correlation id 0 from controller 5 epoch 416 for partition [tec1.en2.frontend.syncPing,7] (state.change.logger) [2016-04-09 00:40:58,139] TRACE Broker 1 handling LeaderAndIsr request correlationId 0 from controller 5 epoch 416 starting the become-follower transition for partition [tec1.en2.frontend.syncPing,7] (state.change.logger) [2016-04-09 00:40:58,144] ERROR Broker 1 received LeaderAndIsrRequest with correlation id 0 from controller 5 epoch 415 for partition [tec1.en2.frontend.syncPing,7] but cannot become follower since the new leader 4 is unavailable. (state.change.logger) {noformat} Now, a bit later in the log, the broker says that it is caching the leader info for the partition: {noformat} [2016-04-09 00:42:02,456] TRACE Broker 1 cached leader info (LeaderAndIsrInfo:(Leader:4,ISR:4,LeaderEpoch:364,ControllerEpoch:415),ReplicationFactor:3),AllReplicas:1,2,4) for partition [tec1.en2.frontend.syncPing,7] in response to UpdateMetadata request sent by controller 5 epoch 416 with correlation id 1473 (state.change.logger) {noformat} but it keeps printing the “Cached zkVersion…” errors, which indicate that the broker still believes it is the leader of the partition, or at least the variable {{leaderReplicaIdOpt}} is set this way. I also inspected other partitions, and the behavior doesn’t seem to be consistent. I’ve seen at least one partition in broker 2 for which the broker made the appropriate transition: {noformat} [2016-04-09 00:39:23,840] TRACE Broker 2 received LeaderAndIsr request (LeaderAndIsrInfo:(Leader:3,ISR:2,3,LeaderEpoch:305,ControllerEpoch:414),ReplicationFactor:3),AllReplicas:3,2,4) correlation id 535 from controller 1 epoch 414 for partition [tec1.ono_qe1.bodydata.recordings,20] (state.change.logger) [2016-04-09 00:39:23,841] TRACE Broker 2 handling LeaderAndIsr request correlationId 535 from controller 1 epoch 414 starting the become-follower transition for partition [tec1.ono_qe1.bodydata.recordings,20] (state.change.logger) [2016-04-09 00:39:23,841] TRACE Broker 2 stopped fetchers as part of become-follower request from controller 1 epoch 414 with correlation id 535 for partition [tec1.ono_qe1.bodydata.recordings,20] (state.change.logger) [2016-04-09 00:39:23,856] TRACE Broker 2 truncated logs and checkpointed recovery boundaries for partition [tec1.ono_qe1.bodydata.recordings,20] as part of become-follower request with correlation id 535 from controller 1 epoch 414 (state.change.logger) [2016-04-09 00:39:23,856] TRACE Broker 2 started fetcher to new leader as part of become-follower request from controller 1 epoch 414 with correlation id 535 for partition [tec1.ono_qe1.bodydata.recordings,20] (state.change.logger) [2016-04-09 00:39:23,856] TRACE Broker 2 completed LeaderAndIsr request correlationId 535 from controller 1 epoch 414 for the become-follower transition for partition [tec1.ono_qe1.bodydata.recordings,20] (state.change.logger) {noformat} Actually, the state-change log of broker 2 seems to have a gap starting at {{[2016-04-09 00:39:46,246]}}. Is it when you’ve restarted the broker? was (Author: fpj): Thanks for the logs. I still don’t have an answer, but I wanted to report some progress. I tracked one of the partitions that I’ve chosen arbitrarily for which broker 1 was complaining about the zk version: [tec1.en2.frontend.syncPing,7]. Here are a few things that I’ve observed: Broker 1 is the leader of the partition: {noformat} [2016-04-09 00:40:52,883] TRACE Broker 1 cached leader info (LeaderAndIsrInfo:(Leader:1,ISR:1,LeaderEpoch:363,ControllerEpoch:414),ReplicationFactor:3),AllReplicas:1,2,4) for partition [tec1.en2.frontend.syncPing,7] in response to UpdateMetadata request sent by controller 1 epoch 414 with correlation id 876 (state.change.logger) {noformat} but soon after it fails to release leadership to broker 4: {noformat} [2016-04-09 00:40:58,139] TRACE Broker 1 handling LeaderAndIsr request correlationId 0 from controller 5 epoch 416 starting the become-follower transition for partition [tec1.en2.frontend.syncPing,7] (state.change.logger) [2016-04-09 00:40:58,144] ERROR Broker 1 received LeaderAndIsrRequest with correlation id 0 from controller 5 epoch 415 for partition [tec1.en2.frontend.syncPing,7] but cannot become follower since the new leader 4 is unavailable. (state.change.logger) {noformat} Now, a bit later in the log, the broker says that it is caching the leader info for the partition: {noformat} [2016-04-09 00:42:02,456] TRACE Broker 1 cached leader info (LeaderAndIsrInfo:(Leader:4,ISR:4,LeaderEpoch:364,ControllerEpoch:415),ReplicationFactor:3),AllReplicas:1,2,4) for partition [tec1.en2.frontend.syncPing,7] in response to UpdateMetadata request sent by controller 5 epoch 416 with correlation id 1473 (state.change.logger) {noformat} but it keeps printing the “Cached zkVersion…” errors, which indicate that the broker still believes it is the leader of the partition, or at least the variable {{leaderReplicaIdOpt}} is set this way. I also inspected other partitions, and the behavior doesn’t seem to be consistent. I’ve seen at least one partition in broker 2 for which the broker made the appropriate transition: {noformat} [2016-04-09 00:39:23,840] TRACE Broker 2 received LeaderAndIsr request (LeaderAndIsrInfo:(Leader:3,ISR:2,3,LeaderEpoch:305,ControllerEpoch:414),ReplicationFactor:3),AllReplicas:3,2,4) correlation id 535 from controller 1 epoch 414 for partition [tec1.ono_qe1.bodydata.recordings,20] (state.change.logger) [2016-04-09 00:39:23,841] TRACE Broker 2 handling LeaderAndIsr request correlationId 535 from controller 1 epoch 414 starting the become-follower transition for partition [tec1.ono_qe1.bodydata.recordings,20] (state.change.logger) [2016-04-09 00:39:23,841] TRACE Broker 2 stopped fetchers as part of become-follower request from controller 1 epoch 414 with correlation id 535 for partition [tec1.ono_qe1.bodydata.recordings,20] (state.change.logger) [2016-04-09 00:39:23,856] TRACE Broker 2 truncated logs and checkpointed recovery boundaries for partition [tec1.ono_qe1.bodydata.recordings,20] as part of become-follower request with correlation id 535 from controller 1 epoch 414 (state.change.logger) [2016-04-09 00:39:23,856] TRACE Broker 2 started fetcher to new leader as part of become-follower request from controller 1 epoch 414 with correlation id 535 for partition [tec1.ono_qe1.bodydata.recordings,20] (state.change.logger) [2016-04-09 00:39:23,856] TRACE Broker 2 completed LeaderAndIsr request correlationId 535 from controller 1 epoch 414 for the become-follower transition for partition [tec1.ono_qe1.bodydata.recordings,20] (state.change.logger) {noformat} Actually, the state-change log of broker 2 seems to have a gap starting at {{[2016-04-09 00:39:46,246]}}. Is it when you’ve restarted the broker? > updateIsr should stop after failed several times due to zkVersion issue > ----------------------------------------------------------------------- > > Key: KAFKA-3042 > URL: https://issues.apache.org/jira/browse/KAFKA-3042 > Project: Kafka > Issue Type: Bug > Affects Versions: 0.8.2.1 > Environment: jdk 1.7 > centos 6.4 > Reporter: Jiahongchao > Attachments: controller.log, server.log.2016-03-23-01, > state-change.log > > > sometimes one broker may repeatly log > "Cached zkVersion 54 not equal to that in zookeeper, skip updating ISR" > I think this is because the broker consider itself as the leader in fact it's > a follower. > So after several failed tries, it need to find out who is the leader -- This message was sent by Atlassian JIRA (v6.3.4#6332)