[ https://issues.apache.org/jira/browse/KAFKA-4418?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Ismael Juma updated KAFKA-4418: ------------------------------- Labels: reliability (was: ) > Broker Leadership Election Fails If Missing ZK Path Raises Exception > -------------------------------------------------------------------- > > Key: KAFKA-4418 > URL: https://issues.apache.org/jira/browse/KAFKA-4418 > Project: Kafka > Issue Type: Bug > Affects Versions: 0.9.0.1, 0.10.0.0, 0.10.0.1 > Reporter: Michael Pedersen > Labels: reliability > > Our Kafka cluster went down because a single node went down *and* a path in > Zookeeper was missing for one topic (/brokers/topics/<topicname>/partitions). > When this occurred, leadership election could not run, and produced a stack > trace that looked like this: > Failed to start preferred replica election > org.I0Itec.zkclient.exception.ZkNoNodeException: > org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = > NoNode for /brokers/topics/warandpeace/partitions > at org.I0Itec.zkclient.exception.ZkException.create(ZkException.java:47) > at org.I0Itec.zkclient.ZkClient.retryUntilConnected(ZkClient.java:995) > at org.I0Itec.zkclient.ZkClient.getChildren(ZkClient.java:675) > at org.I0Itec.zkclient.ZkClient.getChildren(ZkClient.java:671) > at kafka.utils.ZkUtils.getChildren(ZkUtils.scala:537) > at > kafka.utils.ZkUtils$$anonfun$getAllPartitions$1.apply(ZkUtils.scala:817) > at > kafka.utils.ZkUtils$$anonfun$getAllPartitions$1.apply(ZkUtils.scala:816) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at scala.collection.Iterator$class.foreach(Iterator.scala:727) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) > at scala.collection.IterableLike$class.foreach(IterableLike.scala:72) > at scala.collection.AbstractIterable.foreach(Iterable.scala:54) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:244) > at scala.collection.AbstractTraversable.map(Traversable.scala:105) > at kafka.utils.ZkUtils.getAllPartitions(ZkUtils.scala:816) > at > kafka.admin.PreferredReplicaLeaderElectionCommand$.main(PreferredReplicaLeaderElectionCommand.scala:64) > at > kafka.admin.PreferredReplicaLeaderElectionCommand.main(PreferredReplicaLeaderElectionCommand.scala) > Caused by: org.apache.zookeeper.KeeperException$NoNodeException: > KeeperErrorCode = NoNode for /brokers/topics/warandpeace/partitions > at org.apache.zookeeper.KeeperException.create(KeeperException.java:111) > at org.apache.zookeeper.KeeperException.create(KeeperException.java:51) > at org.apache.zookeeper.ZooKeeper.getChildren(ZooKeeper.java:1472) > at org.apache.zookeeper.ZooKeeper.getChildren(ZooKeeper.java:1500) > at org.I0Itec.zkclient.ZkConnection.getChildren(ZkConnection.java:114) > at org.I0Itec.zkclient.ZkClient$4.call(ZkClient.java:678) > at org.I0Itec.zkclient.ZkClient$4.call(ZkClient.java:675) > at org.I0Itec.zkclient.ZkClient.retryUntilConnected(ZkClient.java:985) > ... 16 more > I have checked through the code a bit, and have found a quick place to > introduce a fix that would seem to allow the leadership election to continue. > Specifically, the function at > https://github.com/apache/kafka/blob/trunk/core/src/main/scala/kafka/utils/ZkUtils.scala#L633 > does not handle possible exceptions. Wrapping a try/catch block here would > work, but could introduce a number of other problems: > * If the code is used elsewhere, the exception might be needed at a higher > level to prevent something else. > * Unless the exception is logged/reported somehow, no one will know this > problem exists, which makes debugging other problems harder. > I'm sure there are other issues I'm not aware of, but those two come to mind > quickly. What would be the best route for getting this resolved quickly? -- This message was sent by Atlassian JIRA (v6.3.4#6332)