Hello, we have a set of JVMs that consume messages from Kafka topics. Each
JVM creates 4 ConsumerConnectors that are used by 4 separate threads.
These JVMs also create and use the CuratorFramework's Path children cache
to watch and keep a sub-tree of the ZooKeeper in sync with other JVMs. This
path has several thousand children elements.

Everything was working perfectly until one fine day we decided to restart
these JVMs. We restart these JVMs to roll in new code every few weeks or
so. We never had any problems until suddenly the Kafka consumers on these
JVMs were unable to rebalance partitions among themselves.  We have bounced
these JVMs before with no issues.

The exception:
Caused by: kafka.common.ConsumerRebalanceFailedException:
group1-system01-27422-kafka-787 can't rebalance after 12 retries
at
kafka.consumer.ZookeeperConsumerConnector$ZKRebalancerListener.syncedRebalance(ZookeeperConsumerConnector.scala:432)
at
kafka.consumer.ZookeeperConsumerConnector.kafka$consumer$ZookeeperConsumerConnector$$reinitializeConsumer(ZookeeperConsumerConnector.scala:722)
at
kafka.consumer.ZookeeperConsumerConnector$WildcardStreamsHandler.<init>(ZookeeperConsumerConnector.scala:756)
at
kafka.consumer.ZookeeperConsumerConnector.createMessageStreamsByFilter(ZookeeperConsumerConnector.scala:145)
at
kafka.javaapi.consumer.ZookeeperConsumerConnector.createMessageStreamsByFilter(ZookeeperConsumerConnector.scala:96)
at
kafka.javaapi.consumer.ZookeeperConsumerConnector.createMessageStreamsByFilter(ZookeeperConsumerConnector.scala:100)

We then set rebalance.max.retries=16 and rebalance.backoff.ms=10000. I've
seen the Spark-Kafka issue https://issues.apache.org/jira/browse/SPARK-5505
and Jun's recommendation to increase the backoff property.

We must've tried restarting these JVMs about 20 times now both with and
without the "rebalance.xx" properties. Every time it is the same issue.
Except for the first time we applied the "rebalance.backoff.ms=10000"
property when all 4 JVMs started! We thought that solved everything and
then we tried restarting it just to make sure and then we were back to
square one.

If we have only 1 thread create 1 ConsumerConnector instead of 4 it works.
This way we can have any number of JVMs running 1 ConsumerConnector and
they all behave well and rebalance partitions. It is only when we try to
start multiple ConsumerConnectors on the same JVM does this problem occur.
I'd like to remind you that 4 ConsumerConnectors was working for several
months. The ZK sub-tree for our non-Kafka part of the code was small when
we started.

Does anybody have any thoughts on this? What could be causing this issue?
Could there be a Curator/ZK client conflict with the High level Kafka
consumer? Or is the number of nodes that we have on ZK from our code
causing problems with partition assignment in the Kafka code? Because the
Curator framework keeps syncing data in the background while the Kafka code
is creating ConsumerConnectors and rebalancing topics.

Thanks,
Ashwin Jayaprakash.

Reply via email to