[jira] [Comment Edited] (KAFKA-3042) updateIsr should stop after failed several times due to zkVersion issue

Flavio Junqueira (JIRA) Tue, 12 Apr 2016 07:59:30 -0700

    [ 
https://issues.apache.org/jira/browse/KAFKA-3042?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15236055#comment-15236055
 ]


Flavio Junqueira edited comment on KAFKA-3042 at 4/12/16 2:58 PM:
------------------------------------------------------------------

Thanks for the logs. I still don’t have an answer, but I wanted to report some 
progress.

I tracked one of the partitions that I’ve chosen arbitrarily for which broker 1 
was complaining about the zk version: [tec1.en2.frontend.syncPing,7]. Here are 
a few things that I’ve observed:

Broker 1 is the leader of the partition:

{noformat}
[2016-04-09 00:40:52,883] TRACE Broker 1 cached leader info 
(LeaderAndIsrInfo:(Leader:1,ISR:1,LeaderEpoch:363,ControllerEpoch:414),ReplicationFactor:3),AllReplicas:1,2,4)
 for partition [tec1.en2.frontend.syncPing,7] in response to UpdateMetadata 
request sent by controller 1 epoch 414 with correlation id 876 
(state.change.logger)
{noformat}

but soon after it fails to release leadership to broker 4:

{noformat}
[2016-04-09 00:40:58,106] TRACE Broker 1 received LeaderAndIsr request 
(LeaderAndIsrInfo:(Leader:4,ISR:4,LeaderEpoch:364,ControllerEpoch:415),ReplicationFactor:3),AllReplicas:1,2,4)
 correlation id 0 from controller 5 epoch 416 for partition 
[tec1.en2.frontend.syncPing,7] (state.change.logger)

[2016-04-09 00:40:58,139] TRACE Broker 1 handling LeaderAndIsr request 
correlationId 0 from controller 5 epoch 416 starting the become-follower 
transition for partition [tec1.en2.frontend.syncPing,7] (state.change.logger)

[2016-04-09 00:40:58,144] ERROR Broker 1 received LeaderAndIsrRequest with 
correlation id 0 from controller 5 epoch 415 for partition 
[tec1.en2.frontend.syncPing,7] but cannot become follower since the new leader 
4 is unavailable. (state.change.logger)
{noformat}

Now, a bit later in the log, the broker says that it is caching the leader info 
for the partition:

{noformat}
[2016-04-09 00:42:02,456] TRACE Broker 1 cached leader info 
(LeaderAndIsrInfo:(Leader:4,ISR:4,LeaderEpoch:364,ControllerEpoch:415),ReplicationFactor:3),AllReplicas:1,2,4)
 for partition [tec1.en2.frontend.syncPing,7] in response to UpdateMetadata 
request sent by controller 5 epoch 416 with correlation id 1473 
(state.change.logger)
{noformat}

but it keeps printing the “Cached zkVersion…” errors, which indicate that the 
broker still believes it is the leader of the partition, or at least the 
variable {{leaderReplicaIdOpt}} is set this way.

I also inspected other partitions, and the behavior doesn’t seem to be 
consistent. I’ve seen at least one partition in broker 2 for which the broker 
made the appropriate transition:

{noformat}
[2016-04-09 00:39:23,840] TRACE Broker 2 received LeaderAndIsr request 
(LeaderAndIsrInfo:(Leader:3,ISR:2,3,LeaderEpoch:305,ControllerEpoch:414),ReplicationFactor:3),AllReplicas:3,2,4)
 correlation id 535 from controller 1 epoch 414 for partition 
[tec1.ono_qe1.bodydata.recordings,20] (state.change.logger)
[2016-04-09 00:39:23,841] TRACE Broker 2 handling LeaderAndIsr request 
correlationId 535 from controller 1 epoch 414 starting the become-follower 
transition for partition [tec1.ono_qe1.bodydata.recordings,20] 
(state.change.logger)
[2016-04-09 00:39:23,841] TRACE Broker 2 stopped fetchers as part of 
become-follower request from controller 1 epoch 414 with correlation id 535 for 
partition [tec1.ono_qe1.bodydata.recordings,20] (state.change.logger)
[2016-04-09 00:39:23,856] TRACE Broker 2 truncated logs and checkpointed 
recovery boundaries for partition [tec1.ono_qe1.bodydata.recordings,20] as part 
of become-follower request with correlation id 535 from controller 1 epoch 414 
(state.change.logger)
[2016-04-09 00:39:23,856] TRACE Broker 2 started fetcher to new leader as part 
of become-follower request from controller 1 epoch 414 with correlation id 535 
for partition [tec1.ono_qe1.bodydata.recordings,20] (state.change.logger)
[2016-04-09 00:39:23,856] TRACE Broker 2 completed LeaderAndIsr request 
correlationId 535 from controller 1 epoch 414 for the become-follower 
transition for partition [tec1.ono_qe1.bodydata.recordings,20] 
(state.change.logger)
{noformat}

Actually, the state-change log of broker 2 seems to have a gap starting at 
{{[2016-04-09 00:39:46,246]}}. Is it when you’ve restarted the broker? 


was (Author: fpj):
Thanks for the logs. I still don’t have an answer, but I wanted to report some 
progress.

I tracked one of the partitions that I’ve chosen arbitrarily for which broker 1 
was complaining about the zk version: [tec1.en2.frontend.syncPing,7]. Here are 
a few things that I’ve observed:

Broker 1 is the leader of the partition:

{noformat}
[2016-04-09 00:40:52,883] TRACE Broker 1 cached leader info 
(LeaderAndIsrInfo:(Leader:1,ISR:1,LeaderEpoch:363,ControllerEpoch:414),ReplicationFactor:3),AllReplicas:1,2,4)
 for partition [tec1.en2.frontend.syncPing,7] in response to UpdateMetadata 
request sent by controller 1 epoch 414 with correlation id 876 
(state.change.logger)
{noformat}

but soon after it fails to release leadership to broker 4:

{noformat}
[2016-04-09 00:40:58,139] TRACE Broker 1 handling LeaderAndIsr request 
correlationId 0 from controller 5 epoch 416 starting the become-follower 
transition for partition [tec1.en2.frontend.syncPing,7] (state.change.logger)

[2016-04-09 00:40:58,144] ERROR Broker 1 received LeaderAndIsrRequest with 
correlation id 0 from controller 5 epoch 415 for partition 
[tec1.en2.frontend.syncPing,7] but cannot become follower since the new leader 
4 is unavailable. (state.change.logger)
{noformat}

Now, a bit later in the log, the broker says that it is caching the leader info 
for the partition:

{noformat}
[2016-04-09 00:42:02,456] TRACE Broker 1 cached leader info 
(LeaderAndIsrInfo:(Leader:4,ISR:4,LeaderEpoch:364,ControllerEpoch:415),ReplicationFactor:3),AllReplicas:1,2,4)
 for partition [tec1.en2.frontend.syncPing,7] in response to UpdateMetadata 
request sent by controller 5 epoch 416 with correlation id 1473 
(state.change.logger)
{noformat}

but it keeps printing the “Cached zkVersion…” errors, which indicate that the 
broker still believes it is the leader of the partition, or at least the 
variable {{leaderReplicaIdOpt}} is set this way.

I also inspected other partitions, and the behavior doesn’t seem to be 
consistent. I’ve seen at least one partition in broker 2 for which the broker 
made the appropriate transition:

{noformat}
[2016-04-09 00:39:23,840] TRACE Broker 2 received LeaderAndIsr request 
(LeaderAndIsrInfo:(Leader:3,ISR:2,3,LeaderEpoch:305,ControllerEpoch:414),ReplicationFactor:3),AllReplicas:3,2,4)
 correlation id 535 from controller 1 epoch 414 for partition 
[tec1.ono_qe1.bodydata.recordings,20] (state.change.logger)
[2016-04-09 00:39:23,841] TRACE Broker 2 handling LeaderAndIsr request 
correlationId 535 from controller 1 epoch 414 starting the become-follower 
transition for partition [tec1.ono_qe1.bodydata.recordings,20] 
(state.change.logger)
[2016-04-09 00:39:23,841] TRACE Broker 2 stopped fetchers as part of 
become-follower request from controller 1 epoch 414 with correlation id 535 for 
partition [tec1.ono_qe1.bodydata.recordings,20] (state.change.logger)
[2016-04-09 00:39:23,856] TRACE Broker 2 truncated logs and checkpointed 
recovery boundaries for partition [tec1.ono_qe1.bodydata.recordings,20] as part 
of become-follower request with correlation id 535 from controller 1 epoch 414 
(state.change.logger)
[2016-04-09 00:39:23,856] TRACE Broker 2 started fetcher to new leader as part 
of become-follower request from controller 1 epoch 414 with correlation id 535 
for partition [tec1.ono_qe1.bodydata.recordings,20] (state.change.logger)
[2016-04-09 00:39:23,856] TRACE Broker 2 completed LeaderAndIsr request 
correlationId 535 from controller 1 epoch 414 for the become-follower 
transition for partition [tec1.ono_qe1.bodydata.recordings,20] 
(state.change.logger)
{noformat}

Actually, the state-change log of broker 2 seems to have a gap starting at 
{{[2016-04-09 00:39:46,246]}}. Is it when you’ve restarted the broker? 

> updateIsr should stop after failed several times due to zkVersion issue
> -----------------------------------------------------------------------
>
>                 Key: KAFKA-3042
>                 URL: https://issues.apache.org/jira/browse/KAFKA-3042
>             Project: Kafka
>          Issue Type: Bug
>    Affects Versions: 0.8.2.1
>         Environment: jdk 1.7
> centos 6.4
>            Reporter: Jiahongchao
>         Attachments: controller.log, server.log.2016-03-23-01, 
> state-change.log
>
>
> sometimes one broker may repeatly log
> "Cached zkVersion 54 not equal to that in zookeeper, skip updating ISR"
> I think this is because the broker consider itself as the leader in fact it's 
> a follower.
> So after several failed tries, it need to find out who is the leader



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Comment Edited] (KAFKA-3042) updateIsr should stop after failed several times due to zkVersion issue

Reply via email to