So I think I got to the root of the problem. Thanks for pointing me in the
direction of zookeeper data conflicts.

I turned the log level up to INFO and captured a bunch of conflict messages
from the zookeeper client.

I did an "rmr" on the consumers/<topic name> zookeeper node to clear out
any lingering data and fired up my consumers again.

Whatever node data was present seems to have been corrupted by an earlier
version of Kafka.

I can now terminate consumer JVMs (I've even rebooted a machine running 4
consumers) and the topic immediately rebalances.

I'll keep testing and follow up here if I can replicate the error with
clean ZK data.


On Mon, Nov 18, 2013 at 3:10 PM, Guozhang Wang <wangg...@gmail.com> wrote:

> Could you find some entries in the log with the key word "conflict"? If yes
> could you paste them here?
>
> Guozhang
>
>
> On Mon, Nov 18, 2013 at 2:56 PM, Drew Goya <d...@gradientx.com> wrote:
>
> > Also of note, this is all running from within a storm topology, when I
> kill
> > a JVM, another is started very quickly.
> >
> > Could this be a problem with a consumer leaving and rejoining within a
> > small window?
> >
> >
> > On Mon, Nov 18, 2013 at 2:52 PM, Drew Goya <d...@gradientx.com> wrote:
> >
> > > Hey Guozhang, I just forced the error by killing one of my consumer
> JVMs
> > > and I am getting a consumer rebalance failure:
> > >
> > > 2013-11-18 22:46:54 k.c.ZookeeperConsumerConnector [ERROR]
> > > [bridgeTopology_host-1384493092466-7099d843], error during
> > syncedRebalance
> > > kafka.common.ConsumerRebalanceFailedException:
> > > bridgeTopology_host-1384493092466-7099d843 can't rebalance after 10
> > retries
> > > at
> > >
> >
> kafka.consumer.ZookeeperConsumerConnector$ZKRebalancerListener.syncedRebalance(ZookeeperConsumerConnector.scala:428)
> > > ~[stormjar.jar:na]
> > > at
> > >
> >
> kafka.consumer.ZookeeperConsumerConnector$ZKRebalancerListener$$anon$1.run(ZookeeperConsumerConnector.scala:355)
> > > ~[stormjar.jar:na]
> > >
> > > These are the relevant lines in my consumer properties file:
> > >
> > > rebalance.max.retries=10
> > > rebalance.backoff.ms=10000
> > >
> > > My topic has 128 partitions
> > >
> > > Are there some other configuration settings I should be using?
> > >
> > >
> > > On Mon, Nov 18, 2013 at 2:37 PM, Guozhang Wang <wangg...@gmail.com>
> > wrote:
> > >
> > >> Hello Drew,
> > >>
> > >> Do you see any rebalance failure exceptions in the consumer log?
> > >>
> > >> Guozhang
> > >>
> > >>
> > >> On Mon, Nov 18, 2013 at 2:14 PM, Drew Goya <d...@gradientx.com>
> wrote:
> > >>
> > >> > So I've run into a problem where occasionally, some partitions
> within
> > a
> > >> > topic end up in a "none" owner state for a long time.
> > >> >
> > >> > I'm using the high-level consumer on several machines, each consumer
> > >> has 4
> > >> > threads.
> > >> >
> > >> > Normally when I run the ConsumerOffsetChecker, all partitions have
> > >> owners
> > >> > and similar lag.
> > >> >
> > >> > Occasionally I end up in this state:
> > >> >
> > >> > trackingGroup   Events2                        32  552506856
> > >> > 569853398       17346542        none
> > >> > trackingGroup   Events2                        33  553649131
> > >> > 569775298       16126167        none
> > >> > trackingGroup   Events2                        34  552380321
> > >> > 569572719       17192398        none
> > >> > trackingGroup   Events2                        35  553206745
> > >> > 569448821       16242076        none
> > >> > trackingGroup   Events2                        36  553673576
> > >> > 570084283       16410707        none
> > >> > trackingGroup   Events2                        37  552669833
> > >> > 569765642       17095809        none
> > >> > trackingGroup   Events2                        38  553147178
> > >> > 569766985       16619807        none
> > >> > trackingGroup   Events2                        39  552495219
> > >> > 569837815       17342596        none
> > >> > trackingGroup   Events2                        40  570108655
> > >> > 570111080       2425
> > >> >  trackingGroup_host6-1384385417822-23157ae8-0
> > >> > trackingGroup   Events2                        41  570288505
> > >> > 570291068       2563
> > >> >  trackingGroup_host6-1384385417822-23157ae8-0
> > >> > trackingGroup   Events2                        42  569929870
> > >> > 569932330       2460
> > >> >  trackingGroup_host6-1384385417822-23157ae8-0
> > >> >
> > >> > I'm at the point where I'm considering writing my own client but
> > >> hopefully
> > >> > the user group has the answer!
> > >> >
> > >> > I am using this commit of 8.0 on both the brokers and clients:
> > >> > d4553da609ea9af6db8a79faf116d1623c8a856f
> > >> >
> > >>
> > >>
> > >>
> > >> --
> > >> -- Guozhang
> > >>
> > >
> > >
> >
>
>
>
> --
> -- Guozhang
>

Reply via email to