For what it is worth I am currently looking into a problem that sounds suspiciously related. We're seeing no node exceptions for the consumer node during rebalance:
Caused by: org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode for /consumers/es_consumer/ids/es_consumer_cloudalytics-preprod-app3.s1phx1.jivehosted.com-1379900787620-f067bcb7 (this is for the local consumer node) and looking at the code I was having a hard time figuring out how this happened. Sadly didn't get the logs for the period of time when the problem started, so I've been stumped so far... /Sam On Sep 24, 2013, at 5:43 AM, Neha Narkhede <neha.narkh...@gmail.com> wrote: > Interesting. I haven't had the chance to dive into the zkclient codebase to > understand the root cause yet, but since you mentioned this can cause > ephemeral node loss, I am curious to know how you detected the ephemeral > node loss. Did the Kafka consumer not respond to rebalance events or did > the server not respond to state change events ? Also, ephemeral nodes are > lost only when sessions are expired on the zookeeper server or if clients > close the session actively, how does losing connection lead to ephemeral > node loss? > > Thanks, > Neha > > > On Mon, Sep 23, 2013 at 7:02 AM, Anatoly Fayngelerin > <fanat...@gmail.com>wrote: > >> Hi Everyone, >> >> I've run into the following issue with the Kafka server. The zkclient lib >> seems to die silently if there is an UnknownHostException(or any >> IOException) while reconnecting the ZK session. I've filed a bug about this >> with the zkclient lib(https://github.com/sgroschupf/zkclient/issues/23). >> The >> ramifications for Kafka were the silent loss of all ephemeral nodes >> associated with the affected process. >> >> Has anyone faced this issue? If so, what is the recommended way of dealing >> with this? >> >> If there is no good solution available, would the community be open to a >> patch that periodically verifies ZK connectivity? >> >> Thanks, >> Anatoly >>