[ https://issues.apache.org/jira/browse/KAFKA-3042?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15959479#comment-15959479 ]
Jeff Widman commented on KAFKA-3042: ------------------------------------ We hit this on 0.10.0.1. Root cause was a really long zookeeper GC pause that caused the brokers to lose their connection. Producers / Consumers were working successfully as they'd established their connections to the brokers before the zk issue, so they kept happily working. But the broker logs were throwing these warnings about cached zkVersion not matching. And anything that required controller was broken, for example any newly created partitions didn't have leaders. I think this log message could be made more specific to show which znodes don't match. I don't know if this error message is thrown whenever two znodes don't match, but in our case the ZK GC pause resulted in a race condition sequence where somehow the epoch of /controller znode did not match the partition controller epoch under /brokers znode. I'm not sure if it's possible to fix this, perhaps with the ZK multi-command where updates are transactional. It took us a while to realize that was what the log message meant, so the log message could be made more specific to report exactly which znode paths don't match in zookeper. For us, forcing a controller re-election by deleting the /controller znode immediately fixed the issue without having to restart brokers. > updateIsr should stop after failed several times due to zkVersion issue > ----------------------------------------------------------------------- > > Key: KAFKA-3042 > URL: https://issues.apache.org/jira/browse/KAFKA-3042 > Project: Kafka > Issue Type: Bug > Components: controller > Affects Versions: 0.8.2.1 > Environment: jdk 1.7 > centos 6.4 > Reporter: Jiahongchao > Assignee: Dong Lin > Labels: reliability > Fix For: 0.11.0.0 > > Attachments: controller.log, server.log.2016-03-23-01, > state-change.log > > > sometimes one broker may repeatly log > "Cached zkVersion 54 not equal to that in zookeeper, skip updating ISR" > I think this is because the broker consider itself as the leader in fact it's > a follower. > So after several failed tries, it need to find out who is the leader -- This message was sent by Atlassian JIRA (v6.3.15#6346)