[ 
https://issues.apache.org/jira/browse/KAFKA-3042?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15959479#comment-15959479
 ] 

Jeff Widman commented on KAFKA-3042:
------------------------------------

We hit this on 0.10.0.1. 

Root cause was a really long zookeeper GC pause that caused the brokers to lose 
their connection. Producers / Consumers were working successfully as they'd 
established their connections to the brokers before the zk issue, so they kept 
happily working. But the broker logs were throwing these warnings about cached 
zkVersion not matching. And anything that required controller was broken, for 
example any newly created partitions didn't have leaders. 

I think this log message could be made more specific to show which znodes don't 
match.

I don't know if this error message is thrown whenever two znodes don't match, 
but in our case the ZK GC pause resulted in a race condition sequence where 
somehow the epoch of /controller znode did not match the partition controller 
epoch under /brokers znode. I'm not sure if it's possible to fix this, perhaps 
with the ZK multi-command where updates are transactional.

It took us a while to realize that was what the log message meant, so the log 
message could be made more specific to report exactly which znode paths don't 
match in zookeper.

For us, forcing a controller re-election by deleting the /controller znode 
immediately fixed the issue without having to restart brokers. 

> updateIsr should stop after failed several times due to zkVersion issue
> -----------------------------------------------------------------------
>
>                 Key: KAFKA-3042
>                 URL: https://issues.apache.org/jira/browse/KAFKA-3042
>             Project: Kafka
>          Issue Type: Bug
>          Components: controller
>    Affects Versions: 0.8.2.1
>         Environment: jdk 1.7
> centos 6.4
>            Reporter: Jiahongchao
>            Assignee: Dong Lin
>              Labels: reliability
>             Fix For: 0.11.0.0
>
>         Attachments: controller.log, server.log.2016-03-23-01, 
> state-change.log
>
>
> sometimes one broker may repeatly log
> "Cached zkVersion 54 not equal to that in zookeeper, skip updating ISR"
> I think this is because the broker consider itself as the leader in fact it's 
> a follower.
> So after several failed tries, it need to find out who is the leader



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Reply via email to