I've found 
https://cwiki.apache.org/confluence/display/KAFKA/Incremental+Cooperative+Rebalancing:+Support+and+Policies
and 
https://cwiki.apache.org/confluence/display/KAFKA/Incremental+Cooperative+Rebalancing+for+Streams.
This is *exactly* what I need, right down to the Kubernetes pod
restart case. The number of issues with the current approach to
rebalancing elucidated in these documents is downright scary, and now
I am not surprised I am having tonnes of issues.

Are there any plans to start implementing delayed imbalance and
standby bootstrap?

Are there any short-term best practices that can help alleviate these
issues? My main problem right now is the "Instance Bounce" and
"Instance Failover" scenarios, and according to this wiki page,
num.standby.replicas should help with at least the former. Can someone
explain what this does?

Regards,
Raman

On Fri, Jul 19, 2019 at 12:53 PM Raman Gupta <rocketra...@gmail.com> wrote:
>
> I have a situation in which the current rebalancing algorithm seems to
> be extremely sub-optimal.
>
> I have a topic with 100 partitions, and up to 100 separate consumers.
> Processing each message on this topic takes between 1 and 20 minutes,
> depending on the message.
>
> If any of the 100 consumers dies or drops out of the group, there is a
> huge amount of idle time as many consumers (up to 99 of them) finish
> their work and sit around idle, just waiting for the rebalance to
> complete.
>
> In addition, with 100 consumers, its not unusual for one to die for
> one reason or another, so these stop-the-world rebalances are
> happening all the time, making the entire system slow to a snail's
> pace.
>
> It surprises me that rebalance is so inefficient. I would have thought
> that partitions would just be assigned/unassigned to consumers in
> real-time without waiting for the entire consumer group to quiesce.
>
> Is there anything I can do to improve matters?
>
> Regards,
> Raman

Reply via email to