Yes I think so. We specifically upgraded the Kafka broker with a patch to avoid the ZK client NPEs. Guess not all of them are fixed. The Kafka broker becoming a zombie even if one ZK node is bad is especially terrible.
On Tuesday, November 17, 2015, Mahdi Ben Hamida <ma...@signalfx.com> wrote: > Hello, > > See below for my original email. I was wondering if anybody has feedback > on the 4 questions I've asked. Should I go ahead and file this as a bug ? > > Thanks. > > -- > Mahdi. > > On 11/12/15 2:37 PM, Mahdi Ben Hamida wrote: > >> Hi Everyone, >> >> We are using kafka 0.8.2.1 and we noticed that kafka/zookeeper-client >> were not able to gracefully handle a non existing zookeeper instance. This >> caused one of our brokers to get stuck during a shutdown and that seemed to >> impact the partitions for which the broker was a leader even though we had >> two other replicas. >> >> Here is a timeline of what happened (shortened for brevity, longer logs >> here [*]): >> >> We have a 7 node zookeeper cluster. Two of our nodes were decommissioned >> and their dns records removed (zookeeper15 and zookeeper16). The >> decommissioning happened about two weeks earlier. We noticed the following >> in the logs >> >> - Opening socket connection to server ip-10-0-0-1.ec2.internal/ >> 10.0.0.1:2181. Will not attempt to authenticate using SASL (unknown >> error) >> - Client session timed out, have not heard from server in 858ms for >> sessionid 0x1250c5c0f1f5001c, closing socket connection and attempting >> reconnect >> - Opening socket connection to server ip-10.0.0.2.ec2.internal/ >> 10.0.0.2:2181. Will not attempt to authenticate using SASL (unknown >> error) >> - zookeeper state changed (Disconnected) >> - Client session timed out, have not heard from server in 2677ms for >> sessionid 0x1250c5c0f1f5001c, closing socket connection and attempting >> reconnect >> - Opening socket connection to server ip-10.0.0.3.ec2.internal/ >> 10.0.36.107:2181. Will not attempt to authenticate using SASL (unknown >> error) >> - Socket connection established to ip-10.0.0.3.ec2.internal/10.0.0.2:2181, >> initiating session >> - zookeeper state changed (Expired) >> - Initiating client connection, connectString= >> zookeeper21.example.com:2181,zookeeper19.example.com:2181, >> zookeeper22.example.com:2181,zookeeper18.example.com:2181, >> zookeeper20.example.com:2181,zookeeper16.example.com:2181, >> zookeeper15.example.com:2181/foo/kafka/central sessionTimeout=6000 >> watcher=org.I0Itec.zkclient.ZkClient@3bbc39f8 >> - Unable to reconnect to ZooKeeper service, session 0x1250c5c0f1f5001c >> has expired, closing socket connection >> - Unable to re-establish connection. Notifying consumer of the following >> exception: >> org.I0Itec.zkclient.exception.ZkException: Unable to connect to >> zookeeper21.example.com:2181,zookeeper19.example.com:2181, >> zookeeper22.example.com:2181,zookeeper18.example.com:2181, >> zookeeper20.example.com:2181,zookeeper16.example.com:2181, >> zookeeper15.example.com:2181/foo/kafka/central >> at org.I0Itec.zkclient.ZkConnection.connect(ZkConnection.java:69) >> at org.I0Itec.zkclient.ZkClient.reconnect(ZkClient.java:1176) >> at >> org.I0Itec.zkclient.ZkClient.processStateChanged(ZkClient.java:649) >> at org.I0Itec.zkclient.ZkClient.process(ZkClient.java:560) >> at >> org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:522) >> at >> org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:498) >> *Caused by: java.net.UnknownHostException: zookeeper16.example.com: >> unknown error* >> at java.net.Inet6AddressImpl.lookupAllHostAddr(Native Method) >> at java.net.InetAddress$2.lookupAllHostAddr(InetAddress.java:928) >> at >> java.net.InetAddress.getAddressesFromNameService(InetAddress.java:1323) >> at java.net.InetAddress.getAllByName0(InetAddress.java:1276) >> at java.net.InetAddress.getAllByName(InetAddress.java:1192) >> at java.net.InetAddress.getAllByName(InetAddress.java:1126) >> at >> org.apache.zookeeper.client.StaticHostProvider.<init>(StaticHostProvider.java:61) >> at org.apache.zookeeper.ZooKeeper.<init>(ZooKeeper.java:445) >> at org.apache.zookeeper.ZooKeeper.<init>(ZooKeeper.java:380) >> at org.I0Itec.zkclient.ZkConnection.connect(ZkConnection.java:67) >> ... 5 more >> >> >> That seems to have caused the following: >> [main-EventThread] [org.apache.zookeeper.ClientCnxn ]: EventThread >> shut down >> >> Which in turn caused kafka to shut itself down >> [Thread-2] [kafka.server.KafkaServer ]: [Kafka Server 13], >> shutting down >> [Thread-2] [kafka.server.KafkaServer ]: [Kafka Server 13], >> Starting controlled shutdown >> >> However, the shutdown didn't go as expected apparently due to an NPE in >> the zk client >> >> 2015-11-12T12:03:40.101Z WARN [Thread-2 ] [kafka.utils.Utils$ >> ]: >> *java.lang.NullPointerException* >> at >> org.I0Itec.zkclient.ZkConnection.readData(ZkConnection.java:117) >> at org.I0Itec.zkclient.ZkClient$10.call(ZkClient.java:992) >> at org.I0Itec.zkclient.ZkClient$10.call(ZkClient.java:988) >> at >> org.I0Itec.zkclient.ZkClient.retryUntilConnected(ZkClient.java:883) >> at org.I0Itec.zkclient.ZkClient.readData(ZkClient.java:988) >> at org.I0Itec.zkclient.ZkClient.readData(ZkClient.java:983) >> at kafka.utils.ZkUtils$.readDataMaybeNull(ZkUtils.scala:450) >> at kafka.utils.ZkUtils$.getController(ZkUtils.scala:65) >> at >> kafka.server.KafkaServer.kafka$server$KafkaServer$$controlledShutdown(KafkaServer.scala:194) >> at >> kafka.server.KafkaServer$$anonfun$shutdown$1.apply$mcV$sp(KafkaServer.scala:269) >> at kafka.utils.Utils$.swallow(Utils.scala:172) >> at kafka.utils.Logging$class.swallowWarn(Logging.scala:92) >> at kafka.utils.Utils$.swallowWarn(Utils.scala:45) >> at kafka.utils.Logging$class.swallow(Logging.scala:94) >> at kafka.utils.Utils$.swallow(Utils.scala:45) >> at kafka.server.KafkaServer.shutdown(KafkaServer.scala:269) >> at >> kafka.server.KafkaServerStartable.shutdown(KafkaServerStartable.scala:42) >> at kafka.Kafka$$anon$1.run(Kafka.scala:42) >> 2015-11-12T12:03:40.106Z INFO [Thread-2 ] [kafka.network.SocketServer >> ]: [Socket Server on Broker 13], Shutting down >> >> The kafka process continued running after this point. This is confirmed >> by the continuous rolling of logs >> [ReplicaFetcherThread-3-9 ] [kafka.log.Log >> ]: Rolled new log segment for 'topic-a-1' in 0 ms. >> [ReplicaFetcherThread-0-12 ] [kafka.log.Log >> ]: Rolled new log segment for 'topic-b-4' in 0 ms. >> >> etc.. >> >> At this point, that broker was in a half-dead state. Our clients were >> still timing out enqueuing messages to it. The under-replicated partition >> count on the other brokers was stuck at a positive, constant value and did >> not make any progress. We also noticed that the jmx connector threads >> weren't responding, which is how we found out that the process was in a bad >> shape. This happened for about 40mn till we killed the process and >> restarted it. Things have recovered after the restart. >> >> 1. is this a known kafka/zookeeper issue impacting the version we are >> running ? If not, I'd be happy to file a bug. >> 2. since we had other healthy zookeeper instance (5 out of 7), is there a >> reason why kafka/zkclient couldn't have handled this more gracefully. My >> assumption was that the zookeeper client will pick other instances from the >> list until a healthy node is found. It is odd that it gave up this quickly. >> 3. on the kafka side, the server decided to shutdown by itself, however >> the shutdown should have been a clean one. The NPE should either have been >> avoided, or caught and handled differently. It would be good to have some >> clarifications on when this self-shutdown is invoked and whether we could >> have done things differently on our side to avoid it (for example, restart >> the broker after decommissioning some of our zookeeper nodes). >> 4. more generally, what do other kafka users do when recycling their >> zookeeper clusters or replacing old machines with new ones. >> >> Thanks. >> -- >> Mahdi. >> >> [*]https://gist.github.com/mahdibh/76b230e25a3f7113349e >> > >