[ https://issues.apache.org/jira/browse/KAFKA-2182?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14536286#comment-14536286 ]
Igor Maravić commented on KAFKA-2182: ------------------------------------- This is continuation of discussion from KAFKA-2169 It would be also interesting to see how the Apache Curator is handling reconnect in this case. As far as I can tell, the failure scenario would be exactly the same if we were using zkClient curator bridge. > zkClient dies if there is any exception while reconnecting > ---------------------------------------------------------- > > Key: KAFKA-2182 > URL: https://issues.apache.org/jira/browse/KAFKA-2182 > Project: Kafka > Issue Type: Bug > Components: core > Affects Versions: 0.8.1 > Reporter: Igor Maravić > Priority: Critical > > We, Spotify, have just been hit by a BUG that's related to ZkClient. It made > a whole Kafka cluster go down. > Long story short, we've restarted TOR switch and all of our brokers from the > cluster lost all the connectivity with the zookeeper cluster, which was > living in another rack. Because of that, all the connections to Zookeeper got > expired. > Everything would be fine if we haven't lost hostname to IP Address mapping > for some reason. Since we did lost that mapping, we got a > UnknownHostException when the broker tried to reconnect. This exception got > swallowed up, and we never got reconnected to Zookeeper, which in turn made > our cluster of brokers useless. > If we had "handleSessionEstablishmentError" function, the whole exception > could be caught, we could just completely kill KafkaServer process and start > it cleanly, since this kind of exception is fatal for the KafkaCluster. > {code} > 2015-05-05T12:49:01.709+00:00 127.0.0.1 apache-kafka[main-EventThread] INFO > zookeeper.ZooKeeper - Initiating client connection, > connectString=zookeeper1.spotify.net:2181,zookeeper2.spotify.net:2181,zookeeper3.spotify.net:2181/gabobroker-local > sessionTimeout=6000 watcher=org.I0Itec.zkclient.ZkClient@7303d690 > 2015-05-05T12:49:01.711+00:00 127.0.0.1 apache-kafka[main-EventThread] ERROR > zookeeper.ClientCnxn - Error while calling watcher > 2015-05-05T12:49:01.711+00:00 127.0.0.1 java.lang.RuntimeException: Exception > while restarting zk client > 2015-05-05T12:49:01.711+00:00 127.0.0.1 at > org.I0Itec.zkclient.ZkClient.processStateChanged(ZkClient.java:462) > 2015-05-05T12:49:01.711+00:00 127.0.0.1 at > org.I0Itec.zkclient.ZkClient.process(ZkClient.java:368) > 2015-05-05T12:49:01.711+00:00 127.0.0.1 at > org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:522) > 2015-05-05T12:49:01.711+00:00 127.0.0.1 at > org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:498) > 2015-05-05T12:49:01.711+00:00 127.0.0.1 Caused by: > org.I0Itec.zkclient.exception.ZkException: Unable to connect to > zookeeper1.spotify.net:2181,zookeeper2.spotify.net:2181,zookeeper3.spotify.net:2181/gabobroker-local > 2015-05-05T12:49:01.711+00:00 127.0.0.1 at > org.I0Itec.zkclient.ZkConnection.connect(ZkConnection.java:66) > 2015-05-05T12:49:01.711+00:00 127.0.0.1 at > org.I0Itec.zkclient.ZkClient.reconnect(ZkClient.java:939) > 2015-05-05T12:49:01.711+00:00 127.0.0.1 at > org.I0Itec.zkclient.ZkClient.processStateChanged(ZkClient.java:458) > 2015-05-05T12:49:01.711+00:00 127.0.0.1 ... 3 more > 2015-05-05T12:49:01.712+00:00 127.0.0.1 Caused by: > java.net.UnknownHostException: zookeeper1.spotify.net: Name or service not > known > 2015-05-05T12:49:01.712+00:00 127.0.0.1 at > java.net.Inet6AddressImpl.lookupAllHostAddr(Native Method) > 2015-05-05T12:49:01.712+00:00 127.0.0.1 at > java.net.InetAddress$1.lookupAllHostAddr(InetAddress.java:901) > 2015-05-05T12:49:01.712+00:00 127.0.0.1 at > java.net.InetAddress.getAddressesFromNameService(InetAddress.java:1293) > 2015-05-05T12:49:01.712+00:00 127.0.0.1 at > java.net.InetAddress.getAllByName0(InetAddress.java:1246) > 2015-05-05T12:49:01.712+00:00 127.0.0.1 at > java.net.InetAddress.getAllByName(InetAddress.java:1162) > 2015-05-05T12:49:01.712+00:00 127.0.0.1 at > java.net.InetAddress.getAllByName(InetAddress.java:1098) > 2015-05-05T12:49:01.712+00:00 127.0.0.1 at > org.apache.zookeeper.client.StaticHostProvider.<init>(StaticHostProvider.java:61) > 2015-05-05T12:49:01.712+00:00 127.0.0.1 at > org.apache.zookeeper.ZooKeeper.<init>(ZooKeeper.java:445) > 2015-05-05T12:49:01.712+00:00 127.0.0.1 at > org.apache.zookeeper.ZooKeeper.<init>(ZooKeeper.java:380) > 2015-05-05T12:49:01.713+00:00 127.0.0.1 at > org.I0Itec.zkclient.ZkConnection.connect(ZkConnection.java:64) > 2015-05-05T12:49:01.713+00:00 127.0.0.1 ... 5 more > 2015-05-05T12:49:01.713+00:00 127.0.0.1 > apache-kafka[ZkClient-EventThread-18-zookeeper1.spotify.net:2181,zookeeper2.spotify.net:2181,zookeeper3.spotify.net:2181/gabobroker-local] > ERROR zkclient.ZkEventThread - Error handling event ZkEvent[Children of > /config/changes changed sent to > kafka.server.TopicConfigManager$ConfigChangeListener$@17638f6] > 2015-05-05T12:49:01.713+00:00 127.0.0.1 java.lang.NullPointerException > 2015-05-05T12:49:01.713+00:00 127.0.0.1 at > org.I0Itec.zkclient.ZkConnection.exists(ZkConnection.java:95) > 2015-05-05T12:49:01.713+00:00 127.0.0.1 at > org.I0Itec.zkclient.ZkClient$3.call(ZkClient.java:439) > 2015-05-05T12:49:01.713+00:00 127.0.0.1 at > org.I0Itec.zkclient.ZkClient$3.call(ZkClient.java:436) > 2015-05-05T12:49:01.713+00:00 127.0.0.1 at > org.I0Itec.zkclient.ZkClient.retryUntilConnected(ZkClient.java:675) > 2015-05-05T12:49:01.713+00:00 127.0.0.1 at > org.I0Itec.zkclient.ZkClient.exists(ZkClient.java:436) > 2015-05-05T12:49:01.713+00:00 127.0.0.1 at > org.I0Itec.zkclient.ZkClient.exists(ZkClient.java:445) > 2015-05-05T12:49:01.714+00:00 127.0.0.1 at > org.I0Itec.zkclient.ZkClient$7.run(ZkClient.java:566) > 2015-05-05T12:49:01.714+00:00 127.0.0.1 at > org.I0Itec.zkclient.ZkEventThread.run(ZkEventThread.java:71) > 2015-05-05T12:49:01.714+00:00 127.0.0.1 apache-kafka[main-EventThread] INFO > zookeeper.ClientCnxn - EventThread shut down > 2015-05-05T12:49:01.714+00:00 127.0.0.1 > apache-kafka[ZkClient-EventThread-18-zookeeper1.spotify.net:2181,zookeeper2.spotify.net:2181,zookeeper3.spotify.net:2181/gabobroker-local] > ERROR zkclient.ZkEventThread - Error handling event ZkEvent[Data of > /controller changed sent to > kafka.server.ZookeeperLeaderElector$LeaderChangeListener@18360394] > 2015-05-05T12:49:01.714+00:00 127.0.0.1 java.lang.NullPointerException > 2015-05-05T12:49:01.714+00:00 127.0.0.1 at > org.I0Itec.zkclient.ZkConnection.exists(ZkConnection.java:95) > 2015-05-05T12:49:01.714+00:00 127.0.0.1 at > org.I0Itec.zkclient.ZkClient$3.call(ZkClient.java:439) > 2015-05-05T12:49:01.714+00:00 127.0.0.1 at > org.I0Itec.zkclient.ZkClient$3.call(ZkClient.java:436) > 2015-05-05T12:49:01.714+00:00 127.0.0.1 at > org.I0Itec.zkclient.ZkClient.retryUntilConnected(ZkClient.java:675) > 2015-05-05T12:49:01.714+00:00 127.0.0.1 at > org.I0Itec.zkclient.ZkClient.exists(ZkClient.java:436) > 2015-05-05T12:49:01.714+00:00 127.0.0.1 at > org.I0Itec.zkclient.ZkClient$6.run(ZkClient.java:544) > 2015-05-05T12:49:01.714+00:00 127.0.0.1 at > org.I0Itec.zkclient.ZkEventThread.run(ZkEventThread.java:71) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)