Thanks Jason for the reply! Since the overall motivation and design is pretty clear, I will go ahead to start implementation and we could discuss the underlying details in the PR.
Best, Boyang ________________________________ From: Matthias J. Sax <matth...@confluent.io> Sent: Saturday, December 1, 2018 3:12 AM To: dev@kafka.apache.org Subject: Re: [DISCUSS] KIP-394: Require member.id for initial join group request SGTM. On 11/30/18 10:17 AM, Jason Gustafson wrote: > Using the session expiration logic we already have seems like the simplest > option (this is probably a one or two line change). The rejoin should be > quick anyway, so I don't think it's worth optimizing for unjoined new > members. Just my two cents. This is more of an implementation detail, so > need not necessarily be resolved here. > > -Jason > > On Fri, Nov 30, 2018 at 12:56 AM Boyang Chen <bche...@outlook.com> wrote: > >> Thanks Matthias for the question. I'm thinking of having a separate hash >> set called `registeredMemberIds` which >> will be cleared out every time a group finishes one round of rebalance. >> Since storing one id is pretty trivial, using >> purgatory to track the id removal is a bit wasteful in my opinion. >> ________________________________ >> From: Matthias J. Sax <matth...@confluent.io> >> Sent: Friday, November 30, 2018 10:26 AM >> To: dev@kafka.apache.org >> Subject: Re: [DISCUSS] KIP-394: Require member.id for initial join group >> request >> >> Thanks! Makes sense. >> >> I missed that fact, that the `member.id` is added on the second >> joinGroup request that contains the `member.id`. >> >> However, it seems there is another race condition for this design: >> >> If two consumers join at the same time, it it possible that the broker >> assigns the same `member.id` to both (because none of them have joined >> the group yet--ie, second joinGroup request not sent yet--, the >> `member.id` is not store broker side yes and broker cannot check for >> duplicates when creating a new `member.id`. >> >> The probability might be fairly low thought. However, what Stanislav >> proposed, to add the `member.id` directly, and remove it after >> `session.timeout.ms` sound like a save option that avoids this issue. >> >> Thoughts? >> >> >> -Matthias >> >> On 11/28/18 8:15 PM, Boyang Chen wrote: >>> Thanks Matthias for the question, and Stanislav for the explanation! >>> >>> For the scenario described, we will never let a member join the >> GroupMetadata map >>> if it uses UNKNOWN_MEMBER_ID. So the workflow will be like this: >>> >>> 1. Group is empty. Consumer c1 started. Join with UNKNOWN_MEMBER_ID; >>> 2. Broker rejects while allocating a member.id to c1 in response (c1 >> protocol version is current); >>> 3. c1 handles the error and rejoins with assigned member.id; >>> 4. Broker stores c1 in its group metadata; >>> 5. Consumer c2 started. Join with UNKNOWN_MEMBER_ID; >>> 6. Broker rejects while allocating a member.id to c2 in response (c2 >> protocol version is current); >>> 7. c2 fails to get the response/crashes in the middle; >>> 8. After certain time, c2 restarts a join request with >> UNKNOWN_MEMBER_ID; >>> >>> As you could see, c2 will repeat step 6~8 until successfully send back a >> join group request with allocated id. >>> By then broker will include c2 within the broker metadata map. >>> >>> Does this sound clear to you? >>> >>> Best, >>> Boyang >>> ________________________________ >>> From: Stanislav Kozlovski <stanis...@confluent.io> >>> Sent: Wednesday, November 28, 2018 7:39 PM >>> To: dev@kafka.apache.org >>> Subject: Re: [DISCUSS] KIP-394: Require member.id for initial join >> group request >>> >>> Hey Matthias, >>> >>> I think the notion is to have the `session.timeout.ms` to start ticking >>> when the broker responds with the member.id. Then, the broker would >>> properly expire consumers and not hold too many stale ones. >>> This isn't mentioned in the KIP though so it is worth to wait for Boyang >> to >>> confirm >>> >>> On Wed, Nov 28, 2018 at 3:10 AM Matthias J. Sax <matth...@confluent.io> >>> wrote: >>> >>>> Thanks for the KIP Boyang. >>>> >>>> I guess I am missing something, but I am still learning more details >>>> about the rebalance protocol, so maybe you can help me out? >>>> >>>> Assume a client sends UNKNOWN_MEMBER_ID in its first joinGroup request. >>>> The broker generates a `member.id` and sends it back via >>>> `MEMBER_ID_REQUIRED` error response. This response might never reach the >>>> client or the client fails before it can send the second joinGroup >>>> request. Thus, a client would need to start over with a new >>>> UNKNOWN_MEMBER_ID in its joinGroup request. Thus, the broker needs to >>>> generate a new `member.id` again. >>>> >>>> So it seems the problem is moved, but not resolved? The motivation of >>>> the KIP is: >>>> >>>>> The edge case is that if initial join group request keeps failing due >> to >>>> connection timeout, or the consumer keeps restarting, >>>> >>>> From my understanding, this KIP move the issue from the first to the >>>> second joinGroup request (or broker joinGroup response). >>>> >>>> But maybe I am missing something. Can you help me out? >>>> >>>> >>>> -Matthias >>>> >>>> >>>> On 11/27/18 6:00 PM, Boyang Chen wrote: >>>>> Thanks Stanislav and Jason for the suggestions! >>>>> >>>>> >>>>>> Thanks for the KIP. Looks good overall. I think we will need to bump >> the >>>>>> version of the JoinGroup protocol in order to indicate compatibility >>>> with >>>>>> the new behavior. The coordinator needs to know when it is safe to >>>> assume >>>>>> the client will handle the error code. >>>>>> >>>>>> Also, I was wondering if we could reuse the REBALANCE_IN_PROGRESS >> error >>>>>> code. When the client sees this error code, it will take the memberId >>>> from >>>>>> the response and rejoin. We'd still need the protocol bump since older >>>>>> consumers do not have this logic. >>>>> >>>>> I will add the join group protocol version change to the KIP. Meanwhile >>>> I feel for >>>>> understandability it's better to define a separate error code since >>>> REBALANCE_IN_PROGRESS >>>>> is not the actual cause of the returned error. >>>>> >>>>>> One small question I have is now that we have one and a half >> round-trips >>>>>> needed to join in a rebalance (1 full RT addition), is it worth it to >>>>>> consider increasing the default value of ` >>>> group.initial.rebalance.delay.ms`? >>>>> I guess we could keep it for now. After KIP-345 and incremental >>>> cooperative rebalancing >>>>> work we should be safe to deprecate `group.initial.rebalance.delay.ms >> `. >>>> Also one round trip >>>>> shouldn't increase the latency too much IMO. >>>>> >>>>> Best, >>>>> Boyang >>>>> ________________________________ >>>>> From: Stanislav Kozlovski <stanis...@confluent.io> >>>>> Sent: Wednesday, November 28, 2018 2:32 AM >>>>> To: dev@kafka.apache.org >>>>> Subject: Re: [DISCUSS] KIP-394: Require member.id for initial join >>>> group request >>>>> >>>>> Hi Boyang, >>>>> >>>>> The KIP looks very good. >>>>> One small question I have is now that we have one and a half >> round-trips >>>>> needed to join in a rebalance (1 full RT addition), is it worth it to >>>>> consider increasing the default value of ` >>>> group.initial.rebalance.delay.ms`? >>>>> >>>>> Best, >>>>> Stanislav >>>>> >>>>> On Tue, Nov 27, 2018 at 5:39 PM Jason Gustafson <ja...@confluent.io> >>>> wrote: >>>>> >>>>>> Hi Boyang, >>>>>> >>>>>> Thanks for the KIP. Looks good overall. I think we will need to bump >> the >>>>>> version of the JoinGroup protocol in order to indicate compatibility >>>> with >>>>>> the new behavior. The coordinator needs to know when it is safe to >>>> assume >>>>>> the client will handle the error code. >>>>>> >>>>>> Also, I was wondering if we could reuse the REBALANCE_IN_PROGRESS >> error >>>>>> code. When the client sees this error code, it will take the memberId >>>> from >>>>>> the response and rejoin. We'd still need the protocol bump since older >>>>>> consumers do not have this logic. >>>>>> >>>>>> Thanks, >>>>>> Jason >>>>>> >>>>>> On Mon, Nov 26, 2018 at 5:47 PM Boyang Chen <bche...@outlook.com> >>>> wrote: >>>>>> >>>>>>> Hey friends, >>>>>>> >>>>>>> >>>>>>> I would like to start a discussion thread for KIP-394 which is trying >>>> to >>>>>>> mitigate broker cache bursting issue due to anonymous join group >>>>>> requests: >>>>>>> >>>>>>> >>>>>>> >>>>>> >>>> >> https://nam05.safelinks.protection.outlook.com/?url=https%3A%2F%2Fcwiki.apache.org%2Fconfluence%2Fdisplay%2FKAFKA%2FKIP-394%253A%2BRequire%2Bmember.id%2Bfor%2Binitial%2Bjoin%2Bgroup%2Brequest&data=02%7C01%7C%7C046b858d134f4c1c576108d655277552%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C636790025286277394&sdata=QyBzqnislL%2B9fK1mXaRuJ0xpi9Y2JDvHrM881hzjq3A%3D&reserved=0 >>>>>>> >>>>>>> >>>>>>> Thanks! >>>>>>> >>>>>>> Boyang >>>>>>> >>>>>> >>>>> >>>>> >>>>> -- >>>>> Best, >>>>> Stanislav >>>>> >>>> >>>> >>> >>> -- >>> Best, >>> Stanislav >>> >> >> >