Hi,

Recently we started facing some issue in our stage/production setup, where
we are running 3 brokers in a cluster and ~800 partitions with 130 Topics.

Most of the topics has High level consumers. In total more than 200
Consumers groups are running and listening for the topics.

1. We stared getting problems like, After sometime consumer stopped
consuming from the partitions in some group. Once we restarted it started
consuming but lag was showing too much(which was more than before restart).
We are running consumer in multiple servers.(Took the thread dump
ConsumerThreads were alive).

Even sometime after restarting consumer A for the topic which has 1
partition the consumer show like "No broker partition available". Using
*ConsumerOffsetChecker* tools also it was showing no owner for the group.
Once we looking into Zookeepers consumer node it shows some other machines
consumers are also registered in /consumers/<group>/id and that znode time
was also up-to-date. After kill that consumer B only A started getting
message.

** How ConsumerOffsetChecker was not showing even though some consumer is
alive and using zknode :(
** is Zookeeper is not able to handle the re-balances because of 200
consumers?

2. After restart the lag was more than the previous.

** Why committed offset values also get changed(as lag was increased after
restart)?

Do any one of you also faced the same issue? Please give some suggestion to
bring it back to stable state.


Note: we have checked the FAQ and set the condition properly for re-balance
issue. We are using 0.8.2.1.


Regards,
Madhukar

Reply via email to