Matthias, Yes i know.
Thanks, Damian On Mon, 27 Mar 2017 at 18:17 Matthias J. Sax <matth...@confluent.io> wrote: > Damian, > > about "rebalance immediately" on timeout -- I guess, that's a different > case as no LeaveGroupRequest will be sent. Thus, the broker should be > able to distinguish both cases easily, and apply the delay only if it > received the LeaveGroupRequest but not if a consumer times out. > > Does this make sense? > > -Matthias > > On 3/27/17 1:56 AM, Damian Guy wrote: > > @Becket > > > > Thanks for the feedback. Yes, i like the idea of extending the delay as > > each new consumer joins the group. Though, i think this could be done > with > > either a consumer or broker side config. But i get your point that some > > consumers in the group can be misconfigured. > > > > @Matthias & @Eno - yes we could probably do something similar if the > member > > has sent the LeaveGroupRequest. I'm not sure it would be valid if the > > member crashed, hence session.timeout would come into play, we'd probably > > want to rebalance immediately. I'd be interested in hearing thoughts from > > other core kafka folks on this one. > > > > Thanks, > > Damian > > > > > > > > On Fri, 24 Mar 2017 at 23:01 Becket Qin <becket....@gmail.com> wrote: > > > >> Hi Matthias, > >> > >> Yes, that was what I was thinking. We will keep delay it until either > >> reaching the rebalance timeout or no new consumer joins in that small > delay > >> which is configured on the broker side. > >> > >> Thanks, > >> > >> Jiangjie (Becket) Qin > >> > >> On Fri, Mar 24, 2017 at 1:39 PM, Matthias J. Sax <matth...@confluent.io > > > >> wrote: > >> > >>> @Becket: > >>> > >>> I am not sure, if I understand this correctly. Instead of applying a > >>> fixed delay, that starts when the first consumer of an (empty) group > >>> joins, you suggest to re-trigger/re-set the delay each time a new > >>> consumer joins? > >>> > >>> This sound like a good strategy to me, if the config is on the broker > >> side. > >>> > >>> @Eno: > >>> > >>> I think that's a valid point and I like this idea! > >>> > >>> > >>> -Matthias > >>> > >>> > >>> On 3/24/17 1:23 PM, Eno Thereska wrote: > >>>> Thanks Damian, > >>>> > >>>> This KIP deals with the initial phase only. What about the cases when > >>> several consumers leave a group? Won't there be several expensive > >>> rebalances then as well? I'm wondering if it makes sense for the delay > to > >>> hold anytime the "set" of consumers in a group changes, be it addition > to > >>> the group or removal from group. > >>>> > >>>> Thanks > >>>> Eno > >>>> > >>>> > >>>>> On 24 Mar 2017, at 20:04, Becket Qin <becket....@gmail.com> wrote: > >>>>> > >>>>> Thanks for the KIP, Damian. > >>>>> > >>>>> My two cents on this. It seems there are two things worth thinking > >> here: > >>>>> > >>>>> 1. Better rebalance timing. We will try to rebalance only when all > the > >>>>> consumers in a group have joined. The challenge would be someone has > >> to > >>>>> define what does ALL consumers mean, it could either be a time or > >>> number of > >>>>> consumers, etc. > >>>>> > >>>>> 2. Avoid frequent rebalance. For example, if there are 100 consumers > >> in > >>> a > >>>>> group, today, in the worst case, we may end up with 100 rebalances > >> even > >>> if > >>>>> all the consumers joined the group in a reasonably small amount of > >> time. > >>>>> Frequent rebalance is also a bad thing for brokers. > >>>>> > >>>>> Having a client side configuration may solve problem 1 better because > >>> each > >>>>> consumer group can potentially configure their own timing. However, > it > >>> does > >>>>> not really prevent frequent rebalance in general because some of the > >>>>> consumers can be misconfigured. (This may have something to do with > >>> KIP-124 > >>>>> as well. But if quota is applied on the JoinGroup/SyncGroup request > it > >>> may > >>>>> cause some unwanted cascading effects.) > >>>>> > >>>>> Having a broker side configuration may result in less flexibility for > >>> each > >>>>> consumer group, but it can prevent frequent rebalance better. I think > >>> with > >>>>> some reasonable design, the rebalance timing issue can be resolved on > >>> the > >>>>> broker side as well. Matthias had a good point on extending the delay > >>> when > >>>>> a new consumer joins a group (we actually did something similar to > >> batch > >>>>> ISR change propagation). For example, let's say on the broker side, > we > >>> will > >>>>> always delay 2 seconds each time we see a new consumer joining a > >>> consumer > >>>>> group. This would probably work for most of the consumer groups and > >> will > >>>>> also limit the rebalance frequency to protect the brokers. > >>>>> > >>>>> I am not sure about the streams use case here, but if something like > 2 > >>>>> seconds of delay is acceptable for streams, I would prefer adding the > >>>>> configuration to the broker so that we can address both problems. > >>>>> > >>>>> Thanks, > >>>>> > >>>>> Jiangjie (Becket) Qin > >>>>> > >>>>> > >>>>> On Fri, Mar 24, 2017 at 5:30 AM, Damian Guy <damian....@gmail.com> > >>> wrote: > >>>>> > >>>>>> Thanks for the feedback. > >>>>>> > >>>>>> Ewen: I'm happy to make it a client side config. Other than the > >>> protocol > >>>>>> bump i think the effort is almost the same. Personally i see no > other > >>>>>> issues, but based on discussions with others this is what we came up > >>> with. > >>>>>> > >>>>>> True, it can probably be tested easily via an integration test. > >>>>>> > >>>>>> Matthias: Yes i agree, the delay could be extended as each new > member > >>> joins > >>>>>> the group. > >>>>>> > >>>>>> Thanks, > >>>>>> Damian > >>>>>> > >>>>>> On Fri, 24 Mar 2017 at 05:14 Ewen Cheslack-Postava < > >> e...@confluent.io> > >>>>>> wrote: > >>>>>> > >>>>>>> I have the same initial response as Ismael re: broker vs consumer > >>>>>> settings. > >>>>>>> The global setting seems questionable. > >>>>>>> > >>>>>>> Could we maybe summarize what the impact of making this a client > >>> config > >>>>>>> would be? Protocol bump is obvious, but is there any other > >> significant > >>>>>>> issue? For the protocol bump in particular, I think this change is > >>>>>>> currently really critical for streams; it will be valuable > >> elsewhere, > >>> but > >>>>>>> the immediate demand is streams, so a protocol bump while being > >>> backwards > >>>>>>> compatible wouldn't affect any other clients. Is this still > actually > >>>>>>> compatible with different clients given that they would now expect > >>>>>>> different timeouts? (I think it's strictly compatible if you wait > >> for > >>>>>>> responses, but if you enforce any client side timeouts, I'm not so > >>> sure.) > >>>>>>> > >>>>>>> re: test plan, I'm sure this will come as a surprise, but is the > >>> system > >>>>>>> test even necessary? Validating # of rebalances seems messy as > other > >>>>>> things > >>>>>>> can cause rebalances (though admittedly not in a "clean" case). But > >>>>>> really > >>>>>>> it seems like an integration test could validate this by making > sure > >>>>>> only 1 > >>>>>>> rebalance occurred when 2 members joined with a sufficient time > gap. > >>>>>>> > >>>>>>> -Ewen > >>>>>>> > >>>>>>> On Thu, Mar 23, 2017 at 3:53 PM, Matthias J. Sax < > >>> matth...@confluent.io> > >>>>>>> wrote: > >>>>>>> > >>>>>>>> Thanks for the KIP Damian! > >>>>>>>> > >>>>>>>> My two cents: > >>>>>>>> > >>>>>>>> - we should have an explicit parameter for this -- implicit > setting > >>>>>> are > >>>>>>>> always tricky (the "importance" of this parameter would be LOW) > >>>>>>>> > >>>>>>>> - the config should be different for each consumer group: > >>>>>>>> * assume you have a stateless app, you want to rebalance > >>> immediately > >>>>>>>> * if you start-up in an visualized environment using some tools > >>> like > >>>>>>>> Mesos you might need a different value that on bare metal (no VM > to > >>> be > >>>>>>>> started) > >>>>>>>> * it also depends, how many consumer instanced you expect -- > it's > >>>>>>>> harder to start up 100 instances in 3 seconds than 5 > >>>>>>>> > >>>>>>>> - the default value should be zero > >>>>>>>> > >>>>>>>> > >>>>>>>> One more thought: what about scaling scenarios? If a consumer > group > >>> has > >>>>>>>> 10 instanced and should be scaled up to 20, it would make sense to > >> do > >>>>>>>> this with a single rebalance, too. Thus, I am wondering, if it > >> would > >>>>>>>> make sense to apply this delay each time a new consumer joins > >> group, > >>>>>>>> even if the group is not empty? > >>>>>>>> > >>>>>>>> > >>>>>>>> -Matthias > >>>>>>>> > >>>>>>>> > >>>>>>>> On 3/23/17 10:19 AM, Damian Guy wrote: > >>>>>>>>> Thanks Gouzhang - i think another problem with this is that is > >>>>>>>> overloading > >>>>>>>>> session.timeout.ms to mean multiple things. I'm not sure that is > >> a > >>>>>>> good > >>>>>>>>> thing. > >>>>>>>>> > >>>>>>>>> On Thu, 23 Mar 2017 at 17:14 Guozhang Wang <wangg...@gmail.com> > >>>>>> wrote: > >>>>>>>>> > >>>>>>>>>> The downside of it, though, is that although it "hides" this > from > >>>>>> most > >>>>>>>> of > >>>>>>>>>> the users needing to be aware of it, by default session timeout > >>> i.e. > >>>>>>> the > >>>>>>>>>> rebalance timeout is 10 seconds which could arguably too long. > >>>>>>>>>> > >>>>>>>>>> > >>>>>>>>>> Guozhang > >>>>>>>>>> > >>>>>>>>>> On Thu, Mar 23, 2017 at 10:12 AM, Guozhang Wang < > >>> wangg...@gmail.com > >>>>>>> > >>>>>>>>>> wrote: > >>>>>>>>>> > >>>>>>>>>>> Just throwing another alternative idea here: we can consider > >> using > >>>>>>> the > >>>>>>>>>>> rebalance timeout value which is already included in the join > >>>>>> request > >>>>>>>>>>> protocol (and on the current Java client it is always written > as > >>>>>> the > >>>>>>>>>>> session timeout value), that the first member joining will > >> always > >>>>>>> force > >>>>>>>>>> the > >>>>>>>>>>> coordinator to wait that long. By doing this we do not need to > >>> bump > >>>>>>> up > >>>>>>>>>> the > >>>>>>>>>>> protocol either. > >>>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>>>> Guozhang > >>>>>>>>>>> > >>>>>>>>>>> On Thu, Mar 23, 2017 at 5:49 AM, Damian Guy < > >> damian....@gmail.com > >>>> > >>>>>>>>>> wrote: > >>>>>>>>>>> > >>>>>>>>>>>> Hi Ismael, > >>>>>>>>>>>> > >>>>>>>>>>>> Mostly to avoid the protocol bump. > >>>>>>>>>>>> > >>>>>>>>>>>> I agree that it may be difficult to choose the right delay for > >>> all > >>>>>>>>>>>> consumer > >>>>>>>>>>>> groups, but we wanted to make this something that most users > >>> don't > >>>>>>>>>> really > >>>>>>>>>>>> need to think about, i.e., a small enough default delay that > >>> works > >>>>>>> in > >>>>>>>>>> the > >>>>>>>>>>>> majority of cases. However it would be much more flexible as a > >>>>>>>> consumer > >>>>>>>>>>>> config, which i'm happy to pursue if this change is worthy of > a > >>>>>>>> protocol > >>>>>>>>>>>> bump. > >>>>>>>>>>>> > >>>>>>>>>>>> Thanks, > >>>>>>>>>>>> Damian > >>>>>>>>>>>> > >>>>>>>>>>>> On Thu, 23 Mar 2017 at 12:35 Ismael Juma <ism...@juma.me.uk> > >>>>>> wrote: > >>>>>>>>>>>> > >>>>>>>>>>>>> Thanks for the KIP, Damian. It makes sense to avoid multiple > >>>>>>>>>> rebalances > >>>>>>>>>>>>> during start-up. One issue with having this as a broker > config > >>> is > >>>>>>>> that > >>>>>>>>>>>> it > >>>>>>>>>>>>> may be difficult to choose the right delay for all consumer > >>>>>> groups. > >>>>>>>>>> Can > >>>>>>>>>>>> you > >>>>>>>>>>>>> elaborate a little more on why the first alternative (add a > >>>>>>> consumer > >>>>>>>>>>>>> config) was rejected? We bump protocol versions regularly > >> (when > >>>>>> it > >>>>>>>>>> makes > >>>>>>>>>>>>> sense), so it would be good to get a bit more detail. > >>>>>>>>>>>>> > >>>>>>>>>>>>> Thanks, > >>>>>>>>>>>>> Ismael > >>>>>>>>>>>>> > >>>>>>>>>>>>> On Thu, Mar 23, 2017 at 12:24 PM, Damian Guy < > >>>>>> damian....@gmail.com > >>>>>>>> > >>>>>>>>>>>> wrote: > >>>>>>>>>>>>> > >>>>>>>>>>>>>> Hi All, > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> I've prepared a KIP to add a configurable delay to the > >> initial > >>>>>>>>>>>> consumer > >>>>>>>>>>>>>> group rebalance. > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> Please have look here: > >>>>>>>>>>>>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP- > >>>>>>>>>>>>>> 134%3A+Delay+initial+consumer+group+rebalance > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> Thanks, > >>>>>>>>>>>>>> Damian > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> BTW, i apologize if this appears twice. Seems the first one > >> may > >>>>>>> have > >>>>>>>>>>>> not > >>>>>>>>>>>>>> made it. > >>>>>>>>>>>>>> > >>>>>>>>>>>>> > >>>>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>>>> -- > >>>>>>>>>>> -- Guozhang > >>>>>>>>>>> > >>>>>>>>>> > >>>>>>>>>> > >>>>>>>>>> > >>>>>>>>>> -- > >>>>>>>>>> -- Guozhang > >>>>>>>>>> > >>>>>>>>> > >>>>>>>> > >>>>>>>> > >>>>>>> > >>>>>> > >>>> > >>> > >>> > >> > > > >