Hi Guozhang, OK, I spent some time to understand a bit more how Kafka uses ZooKeeper and how sessions are handled and it seems that the change you proposed should do the job. Thanks :-)
But I still think that (optional?) automatic restart of a consumer could be a good idea! ;-) M. Kind regards, Michał Michalski, michal.michal...@boxever.com On 11 July 2014 16:18, Guozhang Wang <wangg...@gmail.com> wrote: > Hi Michal, > > In your case you could try to increase the zookeeper session timeout value > on the consumer side (default is 6 sec) and see if this is sufficient to > cover the latency jitters. > > Guozhang > > > On Fri, Jul 11, 2014 at 5:25 AM, Michal Michalski < > michal.michal...@boxever.com> wrote: > > > Hey Guozhang, > > > > Thanks for reply. I get your point on "hiding" some issues, but I'd > prefer > > to separate the recovery and reporting a failure. Also, I think if simple > > restart is a possible solution, it shouldn't require implementing it > > separately or, what's even worse, a manual intervention. Maybe I'll > > describe my problem then to show you my point of view: > > > > ZK latency spiked for few seconds making ZK effectively dead from > > consumers' point of view. Then they all reconnected. As I understand, > when > > it happened, it caused rebalancing. Some consumer groups succeeded, but > > then another spike in latency happened and - as we suspect - it caused > > rebalancing to fail, because creation of that ZK node failed at some > point. > > Ideally, I'd like to get notified about that problem (rebalancing failed > > after X retries etc.), so I know there is an issue and I can investigate > > it, but then I'd like Kafka consumer (or my app) to fallback to restart, > > which could *possibly* make consumer recover. If not - that's my problem > > then ;-) > > > > In our case it was enough to restart the app to get consumer working > again, > > but - as we didn't know about that behaviour before and we weren't > prepared > > for it - it required manual intervention (on Friday night, which made it > > even more painful ;> ) which, we believe, wasn't necessary in that case > and > > could have been handled automatically. > > > > M. > > > > > > > > Kind regards, > > Michał Michalski, > > michal.michal...@boxever.com > > > > > > On 10 July 2014 23:43, Guozhang Wang <wangg...@gmail.com> wrote: > > > > > Hi Michal, > > > > > > The rebalance will only be triggered on consumer membership or > > > topic/partition changes. Once triggered it will try to finish the > > rebalance > > > for at most rebalance.max.retries times, i.e. if it fails it will wait > > for > > > rebalance.backoff.ms, and then try again until number of retries > > > exhausted. > > > When it happens an exception will be thrown and the consumer may be > > fallen > > > to a bad state. > > > > > > Then reason we did not implement automatic restart upon rebalance > > failures > > > is that it may actually "hide" some issues in the systems that actually > > > caused the rebalance failure. The general design is that if some > > > exception/errors are not expected like the rebalance failures we will > let > > > it to possibly hault/kill the instance rather than automatically > restart > > > and let it go. > > > > > > Guozhang > > > > > > > > > > > > > > > On Thu, Jul 10, 2014 at 2:24 AM, Michal Michalski < > > > michal.michal...@boxever.com> wrote: > > > > > > > Hi, > > > > > > > > Just wondering - is there any reason why rebalance.max.retries is 4 > by > > > > default? Is there any good reason why I shouldn't expect my consumers > > to > > > > keep trying to rebalance for minutes (e.g. 30 retries every 6 > seconds), > > > > rather than seconds (4 retries every 2 seconds by default)? > > > > > > > > Also, if my consumer fails to rebalance because of NoNodeException > > > > (org.apache.zookeeper.KeeperException$NoNodeException: > KeeperErrorCode > > = > > > > NoNode for > > /consumers/is-entity-modified-document-group/ids/<something>) > > > > wouldn't that make sense to make Kafka restart it automatically once > it > > > > "uses" all the retries attempts? Or recreate the inexistent ZK node > > > like, I > > > > believe, it will happen on consumer restart? > > > > > > > > I'm asking because that kind of errors seem to be "recoverable" ones, > > > but - > > > > if I understand it correctly - with current design they require > > > > implementing additional mechanisms or manual intervention. > > > > > > > > > > > > Kind regards, > > > > Michał > > > > > > > > > > > > > > > > -- > > > -- Guozhang > > > > > > > > > -- > -- Guozhang >