Hello, we have a set of JVMs that consume messages from Kafka topics. Each JVM creates 4 ConsumerConnectors that are used by 4 separate threads. These JVMs also create and use the CuratorFramework's Path children cache to watch and keep a sub-tree of the ZooKeeper in sync with other JVMs. This path has several thousand children elements.
Everything was working perfectly until one fine day we decided to restart these JVMs. We restart these JVMs to roll in new code every few weeks or so. We never had any problems until suddenly the Kafka consumers on these JVMs were unable to rebalance partitions among themselves. We have bounced these JVMs before with no issues. The exception: Caused by: kafka.common.ConsumerRebalanceFailedException: group1-system01-27422-kafka-787 can't rebalance after 12 retries at kafka.consumer.ZookeeperConsumerConnector$ZKRebalancerListener.syncedRebalance(ZookeeperConsumerConnector.scala:432) at kafka.consumer.ZookeeperConsumerConnector.kafka$consumer$ZookeeperConsumerConnector$$reinitializeConsumer(ZookeeperConsumerConnector.scala:722) at kafka.consumer.ZookeeperConsumerConnector$WildcardStreamsHandler.<init>(ZookeeperConsumerConnector.scala:756) at kafka.consumer.ZookeeperConsumerConnector.createMessageStreamsByFilter(ZookeeperConsumerConnector.scala:145) at kafka.javaapi.consumer.ZookeeperConsumerConnector.createMessageStreamsByFilter(ZookeeperConsumerConnector.scala:96) at kafka.javaapi.consumer.ZookeeperConsumerConnector.createMessageStreamsByFilter(ZookeeperConsumerConnector.scala:100) We then set rebalance.max.retries=16 and rebalance.backoff.ms=10000. I've seen the Spark-Kafka issue https://issues.apache.org/jira/browse/SPARK-5505 and Jun's recommendation to increase the backoff property. We must've tried restarting these JVMs about 20 times now both with and without the "rebalance.xx" properties. Every time it is the same issue. Except for the first time we applied the "rebalance.backoff.ms=10000" property when all 4 JVMs started! We thought that solved everything and then we tried restarting it just to make sure and then we were back to square one. If we have only 1 thread create 1 ConsumerConnector instead of 4 it works. This way we can have any number of JVMs running 1 ConsumerConnector and they all behave well and rebalance partitions. It is only when we try to start multiple ConsumerConnectors on the same JVM does this problem occur. I'd like to remind you that 4 ConsumerConnectors was working for several months. The ZK sub-tree for our non-Kafka part of the code was small when we started. Does anybody have any thoughts on this? What could be causing this issue? Could there be a Curator/ZK client conflict with the High level Kafka consumer? Or is the number of nodes that we have on ZK from our code causing problems with partition assignment in the Kafka code? Because the Curator framework keeps syncing data in the background while the Kafka code is creating ConsumerConnectors and rebalancing topics. Thanks, Ashwin Jayaprakash.