Re: [DISCUSS] KIP-134: Delay initial consumer group rebalance

Damian Guy Tue, 28 Mar 2017 01:49:55 -0700

Matthias,

Yes i know.


Thanks,
Damian

On Mon, 27 Mar 2017 at 18:17 Matthias J. Sax <[email protected]> wrote:

> Damian,
>
> about "rebalance immediately" on timeout -- I guess, that's a different
> case as no LeaveGroupRequest will be sent. Thus, the broker should be
> able to distinguish both cases easily, and apply the delay only if it
> received the LeaveGroupRequest but not if a consumer times out.
>
> Does this make sense?
>
> -Matthias
>
> On 3/27/17 1:56 AM, Damian Guy wrote:
> > @Becket
> >
> > Thanks for the feedback. Yes, i like the idea of extending the delay as
> > each new consumer joins the group. Though, i think this could be done
> with
> > either a consumer or broker side config. But i get your point that some
> > consumers in the group can be misconfigured.
> >
> > @Matthias & @Eno - yes we could probably do something similar if the
> member
> > has sent the LeaveGroupRequest. I'm not sure it would be valid if the
> > member crashed, hence session.timeout would come into play, we'd probably
> > want to rebalance immediately. I'd be interested in hearing thoughts from
> > other core kafka folks on this one.
> >
> > Thanks,
> > Damian
> >
> >
> >
> > On Fri, 24 Mar 2017 at 23:01 Becket Qin <[email protected]> wrote:
> >
> >> Hi Matthias,
> >>
> >> Yes, that was what I was thinking. We will keep delay it until either
> >> reaching the rebalance timeout or no new consumer joins in that small
> delay
> >> which is configured on the broker side.
> >>
> >> Thanks,
> >>
> >> Jiangjie (Becket) Qin
> >>
> >> On Fri, Mar 24, 2017 at 1:39 PM, Matthias J. Sax <[email protected]
> >
> >> wrote:
> >>
> >>> @Becket:
> >>>
> >>> I am not sure, if I understand this correctly. Instead of applying a
> >>> fixed delay, that starts when the first consumer of an (empty) group
> >>> joins, you suggest to re-trigger/re-set the delay each time a new
> >>> consumer joins?
> >>>
> >>> This sound like a good strategy to me, if the config is on the broker
> >> side.
> >>>
> >>> @Eno:
> >>>
> >>> I think that's a valid point and I like this idea!
> >>>
> >>>
> >>> -Matthias
> >>>
> >>>
> >>> On 3/24/17 1:23 PM, Eno Thereska wrote:
> >>>> Thanks Damian,
> >>>>
> >>>> This KIP deals with the initial phase only. What about the cases when
> >>> several consumers leave a group? Won't there be several expensive
> >>> rebalances then as well? I'm wondering if it makes sense for the delay
> to
> >>> hold anytime the "set" of consumers in a group changes, be it addition
> to
> >>> the group or removal from group.
> >>>>
> >>>> Thanks
> >>>> Eno
> >>>>
> >>>>
> >>>>> On 24 Mar 2017, at 20:04, Becket Qin <[email protected]> wrote:
> >>>>>
> >>>>> Thanks for the KIP, Damian.
> >>>>>
> >>>>> My two cents on this. It seems there are two things worth thinking
> >> here:
> >>>>>
> >>>>> 1. Better rebalance timing. We will try to rebalance only when all
> the
> >>>>> consumers in a group have joined. The challenge would be someone has
> >> to
> >>>>> define what does ALL consumers mean, it could either be a time or
> >>> number of
> >>>>> consumers, etc.
> >>>>>
> >>>>> 2. Avoid frequent rebalance. For example, if there are 100 consumers
> >> in
> >>> a
> >>>>> group, today, in the worst case, we may end up with 100 rebalances
> >> even
> >>> if
> >>>>> all the consumers joined the group in a reasonably small amount of
> >> time.
> >>>>> Frequent rebalance is also a bad thing for brokers.
> >>>>>
> >>>>> Having a client side configuration may solve problem 1 better because
> >>> each
> >>>>> consumer group can potentially configure their own timing. However,
> it
> >>> does
> >>>>> not really prevent frequent rebalance in general because some of the
> >>>>> consumers can be misconfigured. (This may have something to do with
> >>> KIP-124
> >>>>> as well. But if quota is applied on the JoinGroup/SyncGroup request
> it
> >>> may
> >>>>> cause some unwanted cascading effects.)
> >>>>>
> >>>>> Having a broker side configuration may result in less flexibility for
> >>> each
> >>>>> consumer group, but it can prevent frequent rebalance better. I think
> >>> with
> >>>>> some reasonable design, the rebalance timing issue can be resolved on
> >>> the
> >>>>> broker side as well. Matthias had a good point on extending the delay
> >>> when
> >>>>> a new consumer joins a group (we actually did something similar to
> >> batch
> >>>>> ISR change propagation). For example, let's say on the broker side,
> we
> >>> will
> >>>>> always delay 2 seconds each time we see a new consumer joining a
> >>> consumer
> >>>>> group. This would probably work for most of the consumer groups and
> >> will
> >>>>> also limit the rebalance frequency to protect the brokers.
> >>>>>
> >>>>> I am not sure about the streams use case here, but if something like
> 2
> >>>>> seconds of delay is acceptable for streams, I would prefer adding the
> >>>>> configuration to the broker so that we can address both problems.
> >>>>>
> >>>>> Thanks,
> >>>>>
> >>>>> Jiangjie (Becket) Qin
> >>>>>
> >>>>>
> >>>>> On Fri, Mar 24, 2017 at 5:30 AM, Damian Guy <[email protected]>
> >>> wrote:
> >>>>>
> >>>>>> Thanks for the feedback.
> >>>>>>
> >>>>>> Ewen: I'm happy to make it a client side config. Other than the
> >>> protocol
> >>>>>> bump i think the effort is almost the same. Personally i see no
> other
> >>>>>> issues, but based on discussions with others this is what we came up
> >>> with.
> >>>>>>
> >>>>>> True, it can probably be tested easily via an integration test.
> >>>>>>
> >>>>>> Matthias: Yes i agree, the delay could be extended as each new
> member
> >>> joins
> >>>>>> the group.
> >>>>>>
> >>>>>> Thanks,
> >>>>>> Damian
> >>>>>>
> >>>>>> On Fri, 24 Mar 2017 at 05:14 Ewen Cheslack-Postava <
> >> [email protected]>
> >>>>>> wrote:
> >>>>>>
> >>>>>>> I have the same initial response as Ismael re: broker vs consumer
> >>>>>> settings.
> >>>>>>> The global setting seems questionable.
> >>>>>>>
> >>>>>>> Could we maybe summarize what the impact of making this a client
> >>> config
> >>>>>>> would be? Protocol bump is obvious, but is there any other
> >> significant
> >>>>>>> issue? For the protocol bump in particular, I think this change is
> >>>>>>> currently really critical for streams; it will be valuable
> >> elsewhere,
> >>> but
> >>>>>>> the immediate demand is streams, so a protocol bump while being
> >>> backwards
> >>>>>>> compatible wouldn't affect any other clients. Is this still
> actually
> >>>>>>> compatible with different clients given that they would now expect
> >>>>>>> different timeouts? (I think it's strictly compatible if you wait
> >> for
> >>>>>>> responses, but if you enforce any client side timeouts, I'm not so
> >>> sure.)
> >>>>>>>
> >>>>>>> re: test plan, I'm sure this will come as a surprise, but is the
> >>> system
> >>>>>>> test even necessary? Validating # of rebalances seems messy as
> other
> >>>>>> things
> >>>>>>> can cause rebalances (though admittedly not in a "clean" case). But
> >>>>>> really
> >>>>>>> it seems like an integration test could validate this by making
> sure
> >>>>>> only 1
> >>>>>>> rebalance occurred when 2 members joined with a sufficient time
> gap.
> >>>>>>>
> >>>>>>> -Ewen
> >>>>>>>
> >>>>>>> On Thu, Mar 23, 2017 at 3:53 PM, Matthias J. Sax <
> >>> [email protected]>
> >>>>>>> wrote:
> >>>>>>>
> >>>>>>>> Thanks for the KIP Damian!
> >>>>>>>>
> >>>>>>>> My two cents:
> >>>>>>>>
> >>>>>>>> - we should have an explicit parameter for this -- implicit
> setting
> >>>>>> are
> >>>>>>>> always tricky (the "importance" of this parameter would be LOW)
> >>>>>>>>
> >>>>>>>> - the config should be different for each consumer group:
> >>>>>>>>   * assume you have a stateless app, you want to rebalance
> >>> immediately
> >>>>>>>>   * if you start-up in an visualized environment using some tools
> >>> like
> >>>>>>>> Mesos you might need a different value that on bare metal (no VM
> to
> >>> be
> >>>>>>>> started)
> >>>>>>>>   * it also depends, how many consumer instanced you expect --
> it's
> >>>>>>>> harder to start up 100 instances in 3 seconds than 5
> >>>>>>>>
> >>>>>>>> - the default value should be zero
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> One more thought: what about scaling scenarios? If a consumer
> group
> >>> has
> >>>>>>>> 10 instanced and should be scaled up to 20, it would make sense to
> >> do
> >>>>>>>> this with a single rebalance, too. Thus, I am wondering, if it
> >> would
> >>>>>>>> make sense to apply this delay each time a new consumer joins
> >> group,
> >>>>>>>> even if the group is not empty?
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> -Matthias
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> On 3/23/17 10:19 AM, Damian Guy wrote:
> >>>>>>>>> Thanks Gouzhang - i think another problem with this is that is
> >>>>>>>> overloading
> >>>>>>>>> session.timeout.ms to mean multiple things. I'm not sure that is
> >> a
> >>>>>>> good
> >>>>>>>>> thing.
> >>>>>>>>>
> >>>>>>>>> On Thu, 23 Mar 2017 at 17:14 Guozhang Wang <[email protected]>
> >>>>>> wrote:
> >>>>>>>>>
> >>>>>>>>>> The downside of it, though, is that although it "hides" this
> from
> >>>>>> most
> >>>>>>>> of
> >>>>>>>>>> the users needing to be aware of it, by default session timeout
> >>> i.e.
> >>>>>>> the
> >>>>>>>>>> rebalance timeout is 10 seconds which could arguably too long.
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> Guozhang
> >>>>>>>>>>
> >>>>>>>>>> On Thu, Mar 23, 2017 at 10:12 AM, Guozhang Wang <
> >>> [email protected]
> >>>>>>>
> >>>>>>>>>> wrote:
> >>>>>>>>>>
> >>>>>>>>>>> Just throwing another alternative idea here: we can consider
> >> using
> >>>>>>> the
> >>>>>>>>>>> rebalance timeout value which is already included in the join
> >>>>>> request
> >>>>>>>>>>> protocol (and on the current Java client it is always written
> as
> >>>>>> the
> >>>>>>>>>>> session timeout value), that the first member joining will
> >> always
> >>>>>>> force
> >>>>>>>>>> the
> >>>>>>>>>>> coordinator to wait that long. By doing this we do not need to
> >>> bump
> >>>>>>> up
> >>>>>>>>>> the
> >>>>>>>>>>> protocol either.
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>> Guozhang
> >>>>>>>>>>>
> >>>>>>>>>>> On Thu, Mar 23, 2017 at 5:49 AM, Damian Guy <
> >> [email protected]
> >>>>
> >>>>>>>>>> wrote:
> >>>>>>>>>>>
> >>>>>>>>>>>> Hi Ismael,
> >>>>>>>>>>>>
> >>>>>>>>>>>> Mostly to avoid the protocol bump.
> >>>>>>>>>>>>
> >>>>>>>>>>>> I agree that it may be difficult to choose the right delay for
> >>> all
> >>>>>>>>>>>> consumer
> >>>>>>>>>>>> groups, but we wanted to make this something that most users
> >>> don't
> >>>>>>>>>> really
> >>>>>>>>>>>> need to think about, i.e., a small enough default delay that
> >>> works
> >>>>>>> in
> >>>>>>>>>> the
> >>>>>>>>>>>> majority of cases. However it would be much more flexible as a
> >>>>>>>> consumer
> >>>>>>>>>>>> config, which i'm happy to pursue if this change is worthy of
> a
> >>>>>>>> protocol
> >>>>>>>>>>>> bump.
> >>>>>>>>>>>>
> >>>>>>>>>>>> Thanks,
> >>>>>>>>>>>> Damian
> >>>>>>>>>>>>
> >>>>>>>>>>>> On Thu, 23 Mar 2017 at 12:35 Ismael Juma <[email protected]>
> >>>>>> wrote:
> >>>>>>>>>>>>
> >>>>>>>>>>>>> Thanks for the KIP, Damian. It makes sense to avoid multiple
> >>>>>>>>>> rebalances
> >>>>>>>>>>>>> during start-up. One issue with having this as a broker
> config
> >>> is
> >>>>>>>> that
> >>>>>>>>>>>> it
> >>>>>>>>>>>>> may be difficult to choose the right delay for all consumer
> >>>>>> groups.
> >>>>>>>>>> Can
> >>>>>>>>>>>> you
> >>>>>>>>>>>>> elaborate a little more on why the first alternative (add a
> >>>>>>> consumer
> >>>>>>>>>>>>> config) was rejected? We bump protocol versions regularly
> >> (when
> >>>>>> it
> >>>>>>>>>> makes
> >>>>>>>>>>>>> sense), so it would be good to get a bit more detail.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Thanks,
> >>>>>>>>>>>>> Ismael
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> On Thu, Mar 23, 2017 at 12:24 PM, Damian Guy <
> >>>>>> [email protected]
> >>>>>>>>
> >>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>> Hi All,
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> I've prepared a KIP to add a configurable delay to the
> >> initial
> >>>>>>>>>>>> consumer
> >>>>>>>>>>>>>> group rebalance.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Please have look here:
> >>>>>>>>>>>>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-
> >>>>>>>>>>>>>> 134%3A+Delay+initial+consumer+group+rebalance
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Thanks,
> >>>>>>>>>>>>>> Damian
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> BTW, i apologize if this appears twice. Seems the first one
> >> may
> >>>>>>> have
> >>>>>>>>>>>> not
> >>>>>>>>>>>>>> made it.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>> --
> >>>>>>>>>>> -- Guozhang
> >>>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> --
> >>>>>>>>>> -- Guozhang
> >>>>>>>>>>
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>
> >>>>>>
> >>>>
> >>>
> >>>
> >>
> >
>
>

Re: [DISCUSS] KIP-134: Delay initial consumer group rebalance

Reply via email to