So I think I got to the root of the problem. Thanks for pointing me in the direction of zookeeper data conflicts.
I turned the log level up to INFO and captured a bunch of conflict messages from the zookeeper client. I did an "rmr" on the consumers/<topic name> zookeeper node to clear out any lingering data and fired up my consumers again. Whatever node data was present seems to have been corrupted by an earlier version of Kafka. I can now terminate consumer JVMs (I've even rebooted a machine running 4 consumers) and the topic immediately rebalances. I'll keep testing and follow up here if I can replicate the error with clean ZK data. On Mon, Nov 18, 2013 at 3:10 PM, Guozhang Wang <wangg...@gmail.com> wrote: > Could you find some entries in the log with the key word "conflict"? If yes > could you paste them here? > > Guozhang > > > On Mon, Nov 18, 2013 at 2:56 PM, Drew Goya <d...@gradientx.com> wrote: > > > Also of note, this is all running from within a storm topology, when I > kill > > a JVM, another is started very quickly. > > > > Could this be a problem with a consumer leaving and rejoining within a > > small window? > > > > > > On Mon, Nov 18, 2013 at 2:52 PM, Drew Goya <d...@gradientx.com> wrote: > > > > > Hey Guozhang, I just forced the error by killing one of my consumer > JVMs > > > and I am getting a consumer rebalance failure: > > > > > > 2013-11-18 22:46:54 k.c.ZookeeperConsumerConnector [ERROR] > > > [bridgeTopology_host-1384493092466-7099d843], error during > > syncedRebalance > > > kafka.common.ConsumerRebalanceFailedException: > > > bridgeTopology_host-1384493092466-7099d843 can't rebalance after 10 > > retries > > > at > > > > > > kafka.consumer.ZookeeperConsumerConnector$ZKRebalancerListener.syncedRebalance(ZookeeperConsumerConnector.scala:428) > > > ~[stormjar.jar:na] > > > at > > > > > > kafka.consumer.ZookeeperConsumerConnector$ZKRebalancerListener$$anon$1.run(ZookeeperConsumerConnector.scala:355) > > > ~[stormjar.jar:na] > > > > > > These are the relevant lines in my consumer properties file: > > > > > > rebalance.max.retries=10 > > > rebalance.backoff.ms=10000 > > > > > > My topic has 128 partitions > > > > > > Are there some other configuration settings I should be using? > > > > > > > > > On Mon, Nov 18, 2013 at 2:37 PM, Guozhang Wang <wangg...@gmail.com> > > wrote: > > > > > >> Hello Drew, > > >> > > >> Do you see any rebalance failure exceptions in the consumer log? > > >> > > >> Guozhang > > >> > > >> > > >> On Mon, Nov 18, 2013 at 2:14 PM, Drew Goya <d...@gradientx.com> > wrote: > > >> > > >> > So I've run into a problem where occasionally, some partitions > within > > a > > >> > topic end up in a "none" owner state for a long time. > > >> > > > >> > I'm using the high-level consumer on several machines, each consumer > > >> has 4 > > >> > threads. > > >> > > > >> > Normally when I run the ConsumerOffsetChecker, all partitions have > > >> owners > > >> > and similar lag. > > >> > > > >> > Occasionally I end up in this state: > > >> > > > >> > trackingGroup Events2 32 552506856 > > >> > 569853398 17346542 none > > >> > trackingGroup Events2 33 553649131 > > >> > 569775298 16126167 none > > >> > trackingGroup Events2 34 552380321 > > >> > 569572719 17192398 none > > >> > trackingGroup Events2 35 553206745 > > >> > 569448821 16242076 none > > >> > trackingGroup Events2 36 553673576 > > >> > 570084283 16410707 none > > >> > trackingGroup Events2 37 552669833 > > >> > 569765642 17095809 none > > >> > trackingGroup Events2 38 553147178 > > >> > 569766985 16619807 none > > >> > trackingGroup Events2 39 552495219 > > >> > 569837815 17342596 none > > >> > trackingGroup Events2 40 570108655 > > >> > 570111080 2425 > > >> > trackingGroup_host6-1384385417822-23157ae8-0 > > >> > trackingGroup Events2 41 570288505 > > >> > 570291068 2563 > > >> > trackingGroup_host6-1384385417822-23157ae8-0 > > >> > trackingGroup Events2 42 569929870 > > >> > 569932330 2460 > > >> > trackingGroup_host6-1384385417822-23157ae8-0 > > >> > > > >> > I'm at the point where I'm considering writing my own client but > > >> hopefully > > >> > the user group has the answer! > > >> > > > >> > I am using this commit of 8.0 on both the brokers and clients: > > >> > d4553da609ea9af6db8a79faf116d1623c8a856f > > >> > > > >> > > >> > > >> > > >> -- > > >> -- Guozhang > > >> > > > > > > > > > > > > -- > -- Guozhang >