That does sound like a saner solution. Which github repo do you submit patches to? It looks like the repo I posted on originally( https://github.com/sgroschupf/zkclient/issues/23) might be a little stale.
On Tue, Sep 24, 2013 at 11:34 AM, Neha Narkhede <neha.narkh...@gmail.com>wrote: > Thanks for explaining the bug. This is a serious issue that we should fix > at the zkclient level. We have submitted patches to them before and they > were pretty helpful in releasing a new version with the patch. I think that > will lead to a cleaner solution than trying to get around it in Kafka code > since zkclient usage is pretty wide spread across the server and consumer > code today. > > Thanks, > Neha > > > On Tue, Sep 24, 2013 at 8:28 AM, Anatoly Fayngelerin <fanat...@gmail.com > >wrote: > > > Joel - that is exactly right. ZkClient has no way to notify consumers of > > this situation. The session end event gets fired, however, the session > > begin event never occurs. > > > > Neha - The issue manifested itself when producers were attempting to > > discover topics/brokers. The kafka brokers had lost their ZK sessions > > during a network outage. The outage was long enough for ZooKeeper to > expire > > the sessions corresponding to the ephemeral nodes in /broker/. The > zkclient > > bug prevented the broker from ever re-establishing the ZK session. > > Subsequently, no zookeeper based producer was able to discover > > topic->broker mappings. The resulting exceptions looked like: > > > > Caused by: kafka.common.NoBrokersForPartitionException: Partition = null > > at > > > > > kafka.producer.Producer.kafka$producer$Producer$getPartitionListForTopic(Producer.scala:167) > > at kafka.producer.Producer$anonfun$3.apply(Producer.scala:116) > > at kafka.producer.Producer$anonfun$3.apply(Producer.scala:105) > > at > > > > > scala.collection.TraversableLike$anonfun$map$1.apply(TraversableLike.scala:233) > > at > > > > > scala.collection.TraversableLike$anonfun$map$1.apply(TraversableLike.scala:233) > > at > > > > > scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:34) > > at scala.collection.mutable.WrappedArray.foreach(WrappedArray.scala:33) > > at scala.collection.TraversableLike$class.map(TraversableLike.scala:233) > > at scala.collection.mutable.WrappedArray.map(WrappedArray.scala:33) > > at kafka.producer.Producer.zkSend(Producer.scala:105) > > at kafka.producer.Producer.send(Producer.scala:99) > > at > > > > > com.yieldmo.common.protobuf.ProtoKafkaWriter$class.write(ProtoKafka.scala:20) > > at com.yieldmo.common.protobuf.ProtoWriter.write(ProtoKafka.scala:40) > > at > > > > > com.yieldmo.storm.bolt.KafkaProtoWriterBolt.execute(KafkaProtoWriterBolt.scala:48) > > > > As far as I can see, the only way to deal with this without patching > > zkclient is to periodically check the status of the zk connection and try > > to detect this kind of situation. I would love to hear better ideas for > how > > to handle this. > > > > > > On Tue, Sep 24, 2013 at 3:31 AM, Joel Koshy <jjkosh...@gmail.com> wrote: > > > > > > node loss. Did the Kafka consumer not respond to rebalance events or > > did > > > > the server not respond to state change events ? Also, ephemeral nodes > > are > > > > lost only when sessions are expired on the zookeeper server or if > > clients > > > > close the session actively, how does losing connection lead to > > ephemeral > > > > node loss? > > > > > > My understanding of Anatoly's observation is that on session > > > expiration, zkclient will reconnect > > > ( > > > > > > https://github.com/sgroschupf/zkclient/blob/master/src/main/java/org/I0Itec/zkclient/ZkClient.java#L458 > > > ) > > > but if the connect causes an IOException, that would effectively mean > > > that the session will not get re-established. Anatoly, can you > > > confirm? > > > > > > > On Mon, Sep 23, 2013 at 7:02 AM, Anatoly Fayngelerin < > > fanat...@gmail.com > > > >wrote: > > > > > > > >> Hi Everyone, > > > >> > > > >> I've run into the following issue with the Kafka server. The > zkclient > > > lib > > > >> seems to die silently if there is an UnknownHostException(or any > > > >> IOException) while reconnecting the ZK session. I've filed a bug > about > > > this > > > >> with the zkclient lib( > > https://github.com/sgroschupf/zkclient/issues/23 > > > ). > > > >> The > > > >> ramifications for Kafka were the silent loss of all ephemeral nodes > > > >> associated with the affected process. > > > >> > > > >> Has anyone faced this issue? If so, what is the recommended way of > > > dealing > > > >> with this? > > > >> > > > >> If there is no good solution available, would the community be open > > to a > > > >> patch that periodically verifies ZK connectivity? > > > >> > > > >> Thanks, > > > >> Anatoly > > > >> > > > > > >