I have a situation in which the current rebalancing algorithm seems to
be extremely sub-optimal.

I have a topic with 100 partitions, and up to 100 separate consumers.
Processing each message on this topic takes between 1 and 20 minutes,
depending on the message.

If any of the 100 consumers dies or drops out of the group, there is a
huge amount of idle time as many consumers (up to 99 of them) finish
their work and sit around idle, just waiting for the rebalance to
complete.

In addition, with 100 consumers, its not unusual for one to die for
one reason or another, so these stop-the-world rebalances are
happening all the time, making the entire system slow to a snail's
pace.

It surprises me that rebalance is so inefficient. I would have thought
that partitions would just be assigned/unassigned to consumers in
real-time without waiting for the entire consumer group to quiesce.

Is there anything I can do to improve matters?

Regards,
Raman

Reply via email to