Neha, Looks like an issue with the consumer rebalance not able to complete successfully. We were able to reproduce the issue on topic with 30 partitions, 3 consumer processes(p1,p2 and p3), properties - 40 rebalance.max.retries and 10000(10s) rebalance.backoff.ms.
Before the process p3 was started, partition ownership was as expected: partitions 0-14 owned by p1 partitions 15-29 -> owner p2 As the process p3 started, rebalance was triggered. Process p3 was successfully able to acquire partition ownership for partitions 20-29 as expected as per the rebalance algorithm. However, process p2 while trying to acquire ownership of partitions 10-19 saw rebalance failure after 40 retries. Attaching the logs from process p2 and process p1. It says that p2 was attempting to rebalance, it was trying to acquire ownership of partitions 10-14 which were owned by process p1. However, at the same time process p1 did not get any event for giving up the partition ownership for partitions 1-14. We were expecting a rebalance to have triggered in p1 - but it didn't and hence not giving up ownership. Is our assumption correct/incorrect? And if the rebalance gets triggered in p1 - how to figure out apart from logs as the logs on p1 did not have anything. *2014-11-03 06:57:36 k.c.ZookeeperConsumerConnector [INFO] [topic_consumerIdString], waiting for the partition ownership to be deleted: 11* During and after the rebalance failed on process p2, Partition Ownership was as below: 0-14 -> owner p1 15-19 -> none 20-29 -> owner p3 This left the consumers in inconsistent state as 5 partitions were never consumer from and neither was the partitions ownership balanced. However, there was no conflict in creating the ephemeral node which was the case last time. Just to note that the ephemeral node conflict which we were seeing earlier also appeared after rebalance failed. My hunch is that fixing the rebalance failure will fix that issue as well. -Thanks, Mohit