I am also interested in learning how others are handling this.

I also support several services where average message processing time takes
20 seconds per message but p99 time is about 20 minutes and the
stop-the-world rebalancing is very painful

On Fri, Jul 19, 2019, 11:38 AM Raman Gupta <rocketra...@gmail.com> wrote:

> I've found
> https://cwiki.apache.org/confluence/display/KAFKA/Incremental+Cooperative+Rebalancing:+Support+and+Policies
> and
> https://cwiki.apache.org/confluence/display/KAFKA/Incremental+Cooperative+Rebalancing+for+Streams
> .
> This is *exactly* what I need, right down to the Kubernetes pod
> restart case. The number of issues with the current approach to
> rebalancing elucidated in these documents is downright scary, and now
> I am not surprised I am having tonnes of issues.
>
> Are there any plans to start implementing delayed imbalance and
> standby bootstrap?
>
> Are there any short-term best practices that can help alleviate these
> issues? My main problem right now is the "Instance Bounce" and
> "Instance Failover" scenarios, and according to this wiki page,
> num.standby.replicas should help with at least the former. Can someone
> explain what this does?
>
> Regards,
> Raman
>
> On Fri, Jul 19, 2019 at 12:53 PM Raman Gupta <rocketra...@gmail.com>
> wrote:
> >
> > I have a situation in which the current rebalancing algorithm seems to
> > be extremely sub-optimal.
> >
> > I have a topic with 100 partitions, and up to 100 separate consumers.
> > Processing each message on this topic takes between 1 and 20 minutes,
> > depending on the message.
> >
> > If any of the 100 consumers dies or drops out of the group, there is a
> > huge amount of idle time as many consumers (up to 99 of them) finish
> > their work and sit around idle, just waiting for the rebalance to
> > complete.
> >
> > In addition, with 100 consumers, its not unusual for one to die for
> > one reason or another, so these stop-the-world rebalances are
> > happening all the time, making the entire system slow to a snail's
> > pace.
> >
> > It surprises me that rebalance is so inefficient. I would have thought
> > that partitions would just be assigned/unassigned to consumers in
> > real-time without waiting for the entire consumer group to quiesce.
> >
> > Is there anything I can do to improve matters?
> >
> > Regards,
> > Raman
>

Reply via email to