Thanks for the KIP Boyang and great to see the progress on solving the rebalance issues with this and KIP-345.
Thanks, Mayuresh On Mon, Dec 3, 2018 at 4:57 AM Stanislav Kozlovski <stanis...@confluent.io> wrote: > Everything sounds good to me. > > On Sun, Dec 2, 2018 at 1:24 PM Boyang Chen <bche...@outlook.com> wrote: > > > In fact, it's probably better to move KIP-394< > > > https://cwiki.apache.org/confluence/display/KAFKA/KIP-394%3A+Require+member.id+for+initial+join+group+request > > > > to the vote stage first, so that it's easier to finalize the timeline and > > smooth the rollout plan for KIP-345. Jason and Stanislav, since you two > > involve most in this KIP, could you let me know if there is still any > > unclarity we want to resolve before moving to vote? > > > > Best, > > Boyang > > ________________________________ > > From: Boyang Chen <bche...@outlook.com> > > Sent: Saturday, December 1, 2018 10:53 AM > > To: dev@kafka.apache.org > > Subject: Re: [DISCUSS] KIP-394: Require member.id for initial join group > > request > > > > Thanks Jason for the reply! Since the overall motivation and design is > > pretty clear, I will go ahead to start implementation and we could > discuss > > the underlying details in the PR. > > > > Best, > > Boyang > > ________________________________ > > From: Matthias J. Sax <matth...@confluent.io> > > Sent: Saturday, December 1, 2018 3:12 AM > > To: dev@kafka.apache.org > > Subject: Re: [DISCUSS] KIP-394: Require member.id for initial join group > > request > > > > SGTM. > > > > On 11/30/18 10:17 AM, Jason Gustafson wrote: > > > Using the session expiration logic we already have seems like the > > simplest > > > option (this is probably a one or two line change). The rejoin should > be > > > quick anyway, so I don't think it's worth optimizing for unjoined new > > > members. Just my two cents. This is more of an implementation detail, > so > > > need not necessarily be resolved here. > > > > > > -Jason > > > > > > On Fri, Nov 30, 2018 at 12:56 AM Boyang Chen <bche...@outlook.com> > > wrote: > > > > > >> Thanks Matthias for the question. I'm thinking of having a separate > hash > > >> set called `registeredMemberIds` which > > >> will be cleared out every time a group finishes one round of > rebalance. > > >> Since storing one id is pretty trivial, using > > >> purgatory to track the id removal is a bit wasteful in my opinion. > > >> ________________________________ > > >> From: Matthias J. Sax <matth...@confluent.io> > > >> Sent: Friday, November 30, 2018 10:26 AM > > >> To: dev@kafka.apache.org > > >> Subject: Re: [DISCUSS] KIP-394: Require member.id for initial join > > group > > >> request > > >> > > >> Thanks! Makes sense. > > >> > > >> I missed that fact, that the `member.id` is added on the second > > >> joinGroup request that contains the `member.id`. > > >> > > >> However, it seems there is another race condition for this design: > > >> > > >> If two consumers join at the same time, it it possible that the broker > > >> assigns the same `member.id` to both (because none of them have > joined > > >> the group yet--ie, second joinGroup request not sent yet--, the > > >> `member.id` is not store broker side yes and broker cannot check for > > >> duplicates when creating a new `member.id`. > > >> > > >> The probability might be fairly low thought. However, what Stanislav > > >> proposed, to add the `member.id` directly, and remove it after > > >> `session.timeout.ms` sound like a save option that avoids this issue. > > >> > > >> Thoughts? > > >> > > >> > > >> -Matthias > > >> > > >> On 11/28/18 8:15 PM, Boyang Chen wrote: > > >>> Thanks Matthias for the question, and Stanislav for the explanation! > > >>> > > >>> For the scenario described, we will never let a member join the > > >> GroupMetadata map > > >>> if it uses UNKNOWN_MEMBER_ID. So the workflow will be like this: > > >>> > > >>> 1. Group is empty. Consumer c1 started. Join with > UNKNOWN_MEMBER_ID; > > >>> 2. Broker rejects while allocating a member.id to c1 in response > > (c1 > > >> protocol version is current); > > >>> 3. c1 handles the error and rejoins with assigned member.id; > > >>> 4. Broker stores c1 in its group metadata; > > >>> 5. Consumer c2 started. Join with UNKNOWN_MEMBER_ID; > > >>> 6. Broker rejects while allocating a member.id to c2 in response > > (c2 > > >> protocol version is current); > > >>> 7. c2 fails to get the response/crashes in the middle; > > >>> 8. After certain time, c2 restarts a join request with > > >> UNKNOWN_MEMBER_ID; > > >>> > > >>> As you could see, c2 will repeat step 6~8 until successfully send > back > > a > > >> join group request with allocated id. > > >>> By then broker will include c2 within the broker metadata map. > > >>> > > >>> Does this sound clear to you? > > >>> > > >>> Best, > > >>> Boyang > > >>> ________________________________ > > >>> From: Stanislav Kozlovski <stanis...@confluent.io> > > >>> Sent: Wednesday, November 28, 2018 7:39 PM > > >>> To: dev@kafka.apache.org > > >>> Subject: Re: [DISCUSS] KIP-394: Require member.id for initial join > > >> group request > > >>> > > >>> Hey Matthias, > > >>> > > >>> I think the notion is to have the `session.timeout.ms` to start > > ticking > > >>> when the broker responds with the member.id. Then, the broker would > > >>> properly expire consumers and not hold too many stale ones. > > >>> This isn't mentioned in the KIP though so it is worth to wait for > > Boyang > > >> to > > >>> confirm > > >>> > > >>> On Wed, Nov 28, 2018 at 3:10 AM Matthias J. Sax < > matth...@confluent.io > > > > > >>> wrote: > > >>> > > >>>> Thanks for the KIP Boyang. > > >>>> > > >>>> I guess I am missing something, but I am still learning more details > > >>>> about the rebalance protocol, so maybe you can help me out? > > >>>> > > >>>> Assume a client sends UNKNOWN_MEMBER_ID in its first joinGroup > > request. > > >>>> The broker generates a `member.id` and sends it back via > > >>>> `MEMBER_ID_REQUIRED` error response. This response might never reach > > the > > >>>> client or the client fails before it can send the second joinGroup > > >>>> request. Thus, a client would need to start over with a new > > >>>> UNKNOWN_MEMBER_ID in its joinGroup request. Thus, the broker needs > to > > >>>> generate a new `member.id` again. > > >>>> > > >>>> So it seems the problem is moved, but not resolved? The motivation > of > > >>>> the KIP is: > > >>>> > > >>>>> The edge case is that if initial join group request keeps failing > due > > >> to > > >>>> connection timeout, or the consumer keeps restarting, > > >>>> > > >>>> From my understanding, this KIP move the issue from the first to the > > >>>> second joinGroup request (or broker joinGroup response). > > >>>> > > >>>> But maybe I am missing something. Can you help me out? > > >>>> > > >>>> > > >>>> -Matthias > > >>>> > > >>>> > > >>>> On 11/27/18 6:00 PM, Boyang Chen wrote: > > >>>>> Thanks Stanislav and Jason for the suggestions! > > >>>>> > > >>>>> > > >>>>>> Thanks for the KIP. Looks good overall. I think we will need to > bump > > >> the > > >>>>>> version of the JoinGroup protocol in order to indicate > compatibility > > >>>> with > > >>>>>> the new behavior. The coordinator needs to know when it is safe to > > >>>> assume > > >>>>>> the client will handle the error code. > > >>>>>> > > >>>>>> Also, I was wondering if we could reuse the REBALANCE_IN_PROGRESS > > >> error > > >>>>>> code. When the client sees this error code, it will take the > > memberId > > >>>> from > > >>>>>> the response and rejoin. We'd still need the protocol bump since > > older > > >>>>>> consumers do not have this logic. > > >>>>> > > >>>>> I will add the join group protocol version change to the KIP. > > Meanwhile > > >>>> I feel for > > >>>>> understandability it's better to define a separate error code since > > >>>> REBALANCE_IN_PROGRESS > > >>>>> is not the actual cause of the returned error. > > >>>>> > > >>>>>> One small question I have is now that we have one and a half > > >> round-trips > > >>>>>> needed to join in a rebalance (1 full RT addition), is it worth it > > to > > >>>>>> consider increasing the default value of ` > > >>>> group.initial.rebalance.delay.ms`? > > >>>>> I guess we could keep it for now. After KIP-345 and incremental > > >>>> cooperative rebalancing > > >>>>> work we should be safe to deprecate ` > > group.initial.rebalance.delay.ms > > >> `. > > >>>> Also one round trip > > >>>>> shouldn't increase the latency too much IMO. > > >>>>> > > >>>>> Best, > > >>>>> Boyang > > >>>>> ________________________________ > > >>>>> From: Stanislav Kozlovski <stanis...@confluent.io> > > >>>>> Sent: Wednesday, November 28, 2018 2:32 AM > > >>>>> To: dev@kafka.apache.org > > >>>>> Subject: Re: [DISCUSS] KIP-394: Require member.id for initial join > > >>>> group request > > >>>>> > > >>>>> Hi Boyang, > > >>>>> > > >>>>> The KIP looks very good. > > >>>>> One small question I have is now that we have one and a half > > >> round-trips > > >>>>> needed to join in a rebalance (1 full RT addition), is it worth it > to > > >>>>> consider increasing the default value of ` > > >>>> group.initial.rebalance.delay.ms`? > > >>>>> > > >>>>> Best, > > >>>>> Stanislav > > >>>>> > > >>>>> On Tue, Nov 27, 2018 at 5:39 PM Jason Gustafson < > ja...@confluent.io> > > >>>> wrote: > > >>>>> > > >>>>>> Hi Boyang, > > >>>>>> > > >>>>>> Thanks for the KIP. Looks good overall. I think we will need to > bump > > >> the > > >>>>>> version of the JoinGroup protocol in order to indicate > compatibility > > >>>> with > > >>>>>> the new behavior. The coordinator needs to know when it is safe to > > >>>> assume > > >>>>>> the client will handle the error code. > > >>>>>> > > >>>>>> Also, I was wondering if we could reuse the REBALANCE_IN_PROGRESS > > >> error > > >>>>>> code. When the client sees this error code, it will take the > > memberId > > >>>> from > > >>>>>> the response and rejoin. We'd still need the protocol bump since > > older > > >>>>>> consumers do not have this logic. > > >>>>>> > > >>>>>> Thanks, > > >>>>>> Jason > > >>>>>> > > >>>>>> On Mon, Nov 26, 2018 at 5:47 PM Boyang Chen <bche...@outlook.com> > > >>>> wrote: > > >>>>>> > > >>>>>>> Hey friends, > > >>>>>>> > > >>>>>>> > > >>>>>>> I would like to start a discussion thread for KIP-394 which is > > trying > > >>>> to > > >>>>>>> mitigate broker cache bursting issue due to anonymous join group > > >>>>>> requests: > > >>>>>>> > > >>>>>>> > > >>>>>>> > > >>>>>> > > >>>> > > >> > > > https://eur01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fcwiki.apache.org%2Fconfluence%2Fdisplay%2FKAFKA%2FKIP-394%253A%2BRequire%2Bmember.id%2Bfor%2Binitial%2Bjoin%2Bgroup%2Brequest&data=02%7C01%7C%7C3ca95629be9e42b1f00108d657383bfd%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C636792296362447479&sdata=3BuPVUH5v3hMYe%2FMgpSsNftTwb5DsHDlm2lN%2FVUR0T8%3D&reserved=0 > > >>>>>>> > > >>>>>>> > > >>>>>>> Thanks! > > >>>>>>> > > >>>>>>> Boyang > > >>>>>>> > > >>>>>> > > >>>>> > > >>>>> > > >>>>> -- > > >>>>> Best, > > >>>>> Stanislav > > >>>>> > > >>>> > > >>>> > > >>> > > >>> -- > > >>> Best, > > >>> Stanislav > > >>> > > >> > > >> > > > > > > > > > -- > Best, > Stanislav > -- -Regards, Mayuresh R. Gharat (862) 250-7125