Using the session expiration logic we already have seems like the simplest option (this is probably a one or two line change). The rejoin should be quick anyway, so I don't think it's worth optimizing for unjoined new members. Just my two cents. This is more of an implementation detail, so need not necessarily be resolved here.
-Jason On Fri, Nov 30, 2018 at 12:56 AM Boyang Chen <bche...@outlook.com> wrote: > Thanks Matthias for the question. I'm thinking of having a separate hash > set called `registeredMemberIds` which > will be cleared out every time a group finishes one round of rebalance. > Since storing one id is pretty trivial, using > purgatory to track the id removal is a bit wasteful in my opinion. > ________________________________ > From: Matthias J. Sax <matth...@confluent.io> > Sent: Friday, November 30, 2018 10:26 AM > To: dev@kafka.apache.org > Subject: Re: [DISCUSS] KIP-394: Require member.id for initial join group > request > > Thanks! Makes sense. > > I missed that fact, that the `member.id` is added on the second > joinGroup request that contains the `member.id`. > > However, it seems there is another race condition for this design: > > If two consumers join at the same time, it it possible that the broker > assigns the same `member.id` to both (because none of them have joined > the group yet--ie, second joinGroup request not sent yet--, the > `member.id` is not store broker side yes and broker cannot check for > duplicates when creating a new `member.id`. > > The probability might be fairly low thought. However, what Stanislav > proposed, to add the `member.id` directly, and remove it after > `session.timeout.ms` sound like a save option that avoids this issue. > > Thoughts? > > > -Matthias > > On 11/28/18 8:15 PM, Boyang Chen wrote: > > Thanks Matthias for the question, and Stanislav for the explanation! > > > > For the scenario described, we will never let a member join the > GroupMetadata map > > if it uses UNKNOWN_MEMBER_ID. So the workflow will be like this: > > > > 1. Group is empty. Consumer c1 started. Join with UNKNOWN_MEMBER_ID; > > 2. Broker rejects while allocating a member.id to c1 in response (c1 > protocol version is current); > > 3. c1 handles the error and rejoins with assigned member.id; > > 4. Broker stores c1 in its group metadata; > > 5. Consumer c2 started. Join with UNKNOWN_MEMBER_ID; > > 6. Broker rejects while allocating a member.id to c2 in response (c2 > protocol version is current); > > 7. c2 fails to get the response/crashes in the middle; > > 8. After certain time, c2 restarts a join request with > UNKNOWN_MEMBER_ID; > > > > As you could see, c2 will repeat step 6~8 until successfully send back a > join group request with allocated id. > > By then broker will include c2 within the broker metadata map. > > > > Does this sound clear to you? > > > > Best, > > Boyang > > ________________________________ > > From: Stanislav Kozlovski <stanis...@confluent.io> > > Sent: Wednesday, November 28, 2018 7:39 PM > > To: dev@kafka.apache.org > > Subject: Re: [DISCUSS] KIP-394: Require member.id for initial join > group request > > > > Hey Matthias, > > > > I think the notion is to have the `session.timeout.ms` to start ticking > > when the broker responds with the member.id. Then, the broker would > > properly expire consumers and not hold too many stale ones. > > This isn't mentioned in the KIP though so it is worth to wait for Boyang > to > > confirm > > > > On Wed, Nov 28, 2018 at 3:10 AM Matthias J. Sax <matth...@confluent.io> > > wrote: > > > >> Thanks for the KIP Boyang. > >> > >> I guess I am missing something, but I am still learning more details > >> about the rebalance protocol, so maybe you can help me out? > >> > >> Assume a client sends UNKNOWN_MEMBER_ID in its first joinGroup request. > >> The broker generates a `member.id` and sends it back via > >> `MEMBER_ID_REQUIRED` error response. This response might never reach the > >> client or the client fails before it can send the second joinGroup > >> request. Thus, a client would need to start over with a new > >> UNKNOWN_MEMBER_ID in its joinGroup request. Thus, the broker needs to > >> generate a new `member.id` again. > >> > >> So it seems the problem is moved, but not resolved? The motivation of > >> the KIP is: > >> > >>> The edge case is that if initial join group request keeps failing due > to > >> connection timeout, or the consumer keeps restarting, > >> > >> From my understanding, this KIP move the issue from the first to the > >> second joinGroup request (or broker joinGroup response). > >> > >> But maybe I am missing something. Can you help me out? > >> > >> > >> -Matthias > >> > >> > >> On 11/27/18 6:00 PM, Boyang Chen wrote: > >>> Thanks Stanislav and Jason for the suggestions! > >>> > >>> > >>>> Thanks for the KIP. Looks good overall. I think we will need to bump > the > >>>> version of the JoinGroup protocol in order to indicate compatibility > >> with > >>>> the new behavior. The coordinator needs to know when it is safe to > >> assume > >>>> the client will handle the error code. > >>>> > >>>> Also, I was wondering if we could reuse the REBALANCE_IN_PROGRESS > error > >>>> code. When the client sees this error code, it will take the memberId > >> from > >>>> the response and rejoin. We'd still need the protocol bump since older > >>>> consumers do not have this logic. > >>> > >>> I will add the join group protocol version change to the KIP. Meanwhile > >> I feel for > >>> understandability it's better to define a separate error code since > >> REBALANCE_IN_PROGRESS > >>> is not the actual cause of the returned error. > >>> > >>>> One small question I have is now that we have one and a half > round-trips > >>>> needed to join in a rebalance (1 full RT addition), is it worth it to > >>>> consider increasing the default value of ` > >> group.initial.rebalance.delay.ms`? > >>> I guess we could keep it for now. After KIP-345 and incremental > >> cooperative rebalancing > >>> work we should be safe to deprecate `group.initial.rebalance.delay.ms > `. > >> Also one round trip > >>> shouldn't increase the latency too much IMO. > >>> > >>> Best, > >>> Boyang > >>> ________________________________ > >>> From: Stanislav Kozlovski <stanis...@confluent.io> > >>> Sent: Wednesday, November 28, 2018 2:32 AM > >>> To: dev@kafka.apache.org > >>> Subject: Re: [DISCUSS] KIP-394: Require member.id for initial join > >> group request > >>> > >>> Hi Boyang, > >>> > >>> The KIP looks very good. > >>> One small question I have is now that we have one and a half > round-trips > >>> needed to join in a rebalance (1 full RT addition), is it worth it to > >>> consider increasing the default value of ` > >> group.initial.rebalance.delay.ms`? > >>> > >>> Best, > >>> Stanislav > >>> > >>> On Tue, Nov 27, 2018 at 5:39 PM Jason Gustafson <ja...@confluent.io> > >> wrote: > >>> > >>>> Hi Boyang, > >>>> > >>>> Thanks for the KIP. Looks good overall. I think we will need to bump > the > >>>> version of the JoinGroup protocol in order to indicate compatibility > >> with > >>>> the new behavior. The coordinator needs to know when it is safe to > >> assume > >>>> the client will handle the error code. > >>>> > >>>> Also, I was wondering if we could reuse the REBALANCE_IN_PROGRESS > error > >>>> code. When the client sees this error code, it will take the memberId > >> from > >>>> the response and rejoin. We'd still need the protocol bump since older > >>>> consumers do not have this logic. > >>>> > >>>> Thanks, > >>>> Jason > >>>> > >>>> On Mon, Nov 26, 2018 at 5:47 PM Boyang Chen <bche...@outlook.com> > >> wrote: > >>>> > >>>>> Hey friends, > >>>>> > >>>>> > >>>>> I would like to start a discussion thread for KIP-394 which is trying > >> to > >>>>> mitigate broker cache bursting issue due to anonymous join group > >>>> requests: > >>>>> > >>>>> > >>>>> > >>>> > >> > https://nam05.safelinks.protection.outlook.com/?url=https%3A%2F%2Fcwiki.apache.org%2Fconfluence%2Fdisplay%2FKAFKA%2FKIP-394%253A%2BRequire%2Bmember.id%2Bfor%2Binitial%2Bjoin%2Bgroup%2Brequest&data=02%7C01%7C%7C046b858d134f4c1c576108d655277552%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C636790025286277394&sdata=QyBzqnislL%2B9fK1mXaRuJ0xpi9Y2JDvHrM881hzjq3A%3D&reserved=0 > >>>>> > >>>>> > >>>>> Thanks! > >>>>> > >>>>> Boyang > >>>>> > >>>> > >>> > >>> > >>> -- > >>> Best, > >>> Stanislav > >>> > >> > >> > > > > -- > > Best, > > Stanislav > > > >