Hi, Recently we started facing some issue in our stage/production setup, where we are running 3 brokers in a cluster and ~800 partitions with 130 Topics.
Most of the topics has High level consumers. In total more than 200 Consumers groups are running and listening for the topics. 1. We stared getting problems like, After sometime consumer stopped consuming from the partitions in some group. Once we restarted it started consuming but lag was showing too much(which was more than before restart). We are running consumer in multiple servers.(Took the thread dump ConsumerThreads were alive). Even sometime after restarting consumer A for the topic which has 1 partition the consumer show like "No broker partition available". Using *ConsumerOffsetChecker* tools also it was showing no owner for the group. Once we looking into Zookeepers consumer node it shows some other machines consumers are also registered in /consumers/<group>/id and that znode time was also up-to-date. After kill that consumer B only A started getting message. ** How ConsumerOffsetChecker was not showing even though some consumer is alive and using zknode :( ** is Zookeeper is not able to handle the re-balances because of 200 consumers? 2. After restart the lag was more than the previous. ** Why committed offset values also get changed(as lag was increased after restart)? Do any one of you also faced the same issue? Please give some suggestion to bring it back to stable state. Note: we have checked the FAQ and set the condition properly for re-balance issue. We are using 0.8.2.1. Regards, Madhukar