Re: [DISCUSS] KIP-394: Require member.id for initial join group request

Mayuresh Gharat Wed, 05 Dec 2018 07:52:20 -0800

Thanks for the KIP Boyang and great to see the progress on solving the
rebalance issues with this and KIP-345.


Thanks,

Mayuresh

On Mon, Dec 3, 2018 at 4:57 AM Stanislav Kozlovski <[email protected]>
wrote:

> Everything sounds good to me.
>
> On Sun, Dec 2, 2018 at 1:24 PM Boyang Chen <[email protected]> wrote:
>
> > In fact, it's probably better to move KIP-394<
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-394%3A+Require+member.id+for+initial+join+group+request
> >
> > to the vote stage first, so that it's easier to finalize the timeline and
> > smooth the rollout plan for KIP-345. Jason and Stanislav, since you two
> > involve most in this KIP, could you let me know if there is still any
> > unclarity we want to resolve before moving to vote?
> >
> > Best,
> > Boyang
> > ________________________________
> > From: Boyang Chen <[email protected]>
> > Sent: Saturday, December 1, 2018 10:53 AM
> > To: [email protected]
> > Subject: Re: [DISCUSS] KIP-394: Require member.id for initial join group
> > request
> >
> > Thanks Jason for the reply! Since the overall motivation and design is
> > pretty clear, I will go ahead to start implementation and we could
> discuss
> > the underlying details in the PR.
> >
> > Best,
> > Boyang
> > ________________________________
> > From: Matthias J. Sax <[email protected]>
> > Sent: Saturday, December 1, 2018 3:12 AM
> > To: [email protected]
> > Subject: Re: [DISCUSS] KIP-394: Require member.id for initial join group
> > request
> >
> > SGTM.
> >
> > On 11/30/18 10:17 AM, Jason Gustafson wrote:
> > > Using the session expiration logic we already have seems like the
> > simplest
> > > option (this is probably a one or two line change). The rejoin should
> be
> > > quick anyway, so I don't think it's worth optimizing for unjoined new
> > > members. Just my two cents. This is more of an implementation detail,
> so
> > > need not necessarily be resolved here.
> > >
> > > -Jason
> > >
> > > On Fri, Nov 30, 2018 at 12:56 AM Boyang Chen <[email protected]>
> > wrote:
> > >
> > >> Thanks Matthias for the question. I'm thinking of having a separate
> hash
> > >> set called `registeredMemberIds` which
> > >> will be cleared out every time a group finishes one round of
> rebalance.
> > >> Since storing one id is pretty trivial, using
> > >> purgatory to track the id removal is a bit wasteful in my opinion.
> > >> ________________________________
> > >> From: Matthias J. Sax <[email protected]>
> > >> Sent: Friday, November 30, 2018 10:26 AM
> > >> To: [email protected]
> > >> Subject: Re: [DISCUSS] KIP-394: Require member.id for initial join
> > group
> > >> request
> > >>
> > >> Thanks! Makes sense.
> > >>
> > >> I missed that fact, that the `member.id` is added on the second
> > >> joinGroup request that contains the `member.id`.
> > >>
> > >> However, it seems there is another race condition for this design:
> > >>
> > >> If two consumers join at the same time, it it possible that the broker
> > >> assigns the same `member.id` to both (because none of them have
> joined
> > >> the group yet--ie, second joinGroup request not sent yet--, the
> > >> `member.id` is not store broker side yes and broker cannot check for
> > >> duplicates when creating a new `member.id`.
> > >>
> > >> The probability might be fairly low thought. However, what Stanislav
> > >> proposed, to add the `member.id` directly, and remove it after
> > >> `session.timeout.ms` sound like a save option that avoids this issue.
> > >>
> > >> Thoughts?
> > >>
> > >>
> > >> -Matthias
> > >>
> > >> On 11/28/18 8:15 PM, Boyang Chen wrote:
> > >>> Thanks Matthias for the question, and Stanislav for the explanation!
> > >>>
> > >>> For the scenario described, we will never let a member join the
> > >> GroupMetadata map
> > >>> if it uses UNKNOWN_MEMBER_ID. So the workflow will be like this:
> > >>>
> > >>>   1.  Group is empty. Consumer c1 started. Join with
> UNKNOWN_MEMBER_ID;
> > >>>   2.  Broker rejects while allocating a member.id to c1 in response
> > (c1
> > >> protocol version is current);
> > >>>   3.  c1 handles the error and rejoins with assigned member.id;
> > >>>   4.  Broker stores c1 in its group metadata;
> > >>>   5.  Consumer c2 started. Join with UNKNOWN_MEMBER_ID;
> > >>>   6.  Broker rejects while allocating a member.id to c2 in response
> > (c2
> > >> protocol version is current);
> > >>>   7.  c2 fails to get the response/crashes in the middle;
> > >>>   8.  After certain time, c2 restarts a join request with
> > >> UNKNOWN_MEMBER_ID;
> > >>>
> > >>> As you could see, c2 will repeat step 6~8 until successfully send
> back
> > a
> > >> join group request with allocated id.
> > >>> By then broker will include c2 within the broker metadata map.
> > >>>
> > >>> Does this sound clear to you?
> > >>>
> > >>> Best,
> > >>> Boyang
> > >>> ________________________________
> > >>> From: Stanislav Kozlovski <[email protected]>
> > >>> Sent: Wednesday, November 28, 2018 7:39 PM
> > >>> To: [email protected]
> > >>> Subject: Re: [DISCUSS] KIP-394: Require member.id for initial join
> > >> group request
> > >>>
> > >>> Hey Matthias,
> > >>>
> > >>> I think the notion is to have the `session.timeout.ms` to start
> > ticking
> > >>> when the broker responds with the member.id. Then, the broker would
> > >>> properly expire consumers and not hold too many stale ones.
> > >>> This isn't mentioned in the KIP though so it is worth to wait for
> > Boyang
> > >> to
> > >>> confirm
> > >>>
> > >>> On Wed, Nov 28, 2018 at 3:10 AM Matthias J. Sax <
> [email protected]
> > >
> > >>> wrote:
> > >>>
> > >>>> Thanks for the KIP Boyang.
> > >>>>
> > >>>> I guess I am missing something, but I am still learning more details
> > >>>> about the rebalance protocol, so maybe you can help me out?
> > >>>>
> > >>>> Assume a client sends UNKNOWN_MEMBER_ID in its first joinGroup
> > request.
> > >>>> The broker generates a `member.id` and sends it back via
> > >>>> `MEMBER_ID_REQUIRED` error response. This response might never reach
> > the
> > >>>> client or the client fails before it can send the second joinGroup
> > >>>> request. Thus, a client would need to start over with a new
> > >>>> UNKNOWN_MEMBER_ID in its joinGroup request. Thus, the broker needs
> to
> > >>>> generate a new `member.id` again.
> > >>>>
> > >>>> So it seems the problem is moved, but not resolved? The motivation
> of
> > >>>> the KIP is:
> > >>>>
> > >>>>> The edge case is that if initial join group request keeps failing
> due
> > >> to
> > >>>> connection timeout, or the consumer keeps restarting,
> > >>>>
> > >>>> From my understanding, this KIP move the issue from the first to the
> > >>>> second joinGroup request (or broker joinGroup response).
> > >>>>
> > >>>> But maybe I am missing something. Can you help me out?
> > >>>>
> > >>>>
> > >>>> -Matthias
> > >>>>
> > >>>>
> > >>>> On 11/27/18 6:00 PM, Boyang Chen wrote:
> > >>>>> Thanks Stanislav and Jason for the suggestions!
> > >>>>>
> > >>>>>
> > >>>>>> Thanks for the KIP. Looks good overall. I think we will need to
> bump
> > >> the
> > >>>>>> version of the JoinGroup protocol in order to indicate
> compatibility
> > >>>> with
> > >>>>>> the new behavior. The coordinator needs to know when it is safe to
> > >>>> assume
> > >>>>>> the client will handle the error code.
> > >>>>>>
> > >>>>>> Also, I was wondering if we could reuse the REBALANCE_IN_PROGRESS
> > >> error
> > >>>>>> code. When the client sees this error code, it will take the
> > memberId
> > >>>> from
> > >>>>>> the response and rejoin. We'd still need the protocol bump since
> > older
> > >>>>>> consumers do not have this logic.
> > >>>>>
> > >>>>> I will add the join group protocol version change to the KIP.
> > Meanwhile
> > >>>> I feel for
> > >>>>> understandability it's better to define a separate error code since
> > >>>> REBALANCE_IN_PROGRESS
> > >>>>> is not the actual cause of the returned error.
> > >>>>>
> > >>>>>> One small question I have is now that we have one and a half
> > >> round-trips
> > >>>>>> needed to join in a rebalance (1 full RT addition), is it worth it
> > to
> > >>>>>> consider increasing the default value of `
> > >>>> group.initial.rebalance.delay.ms`?
> > >>>>> I guess we could keep it for now. After KIP-345 and incremental
> > >>>> cooperative rebalancing
> > >>>>> work we should be safe to deprecate `
> > group.initial.rebalance.delay.ms
> > >> `.
> > >>>> Also one round trip
> > >>>>> shouldn't increase the latency too much IMO.
> > >>>>>
> > >>>>> Best,
> > >>>>> Boyang
> > >>>>> ________________________________
> > >>>>> From: Stanislav Kozlovski <[email protected]>
> > >>>>> Sent: Wednesday, November 28, 2018 2:32 AM
> > >>>>> To: [email protected]
> > >>>>> Subject: Re: [DISCUSS] KIP-394: Require member.id for initial join
> > >>>> group request
> > >>>>>
> > >>>>> Hi Boyang,
> > >>>>>
> > >>>>> The KIP looks very good.
> > >>>>> One small question I have is now that we have one and a half
> > >> round-trips
> > >>>>> needed to join in a rebalance (1 full RT addition), is it worth it
> to
> > >>>>> consider increasing the default value of `
> > >>>> group.initial.rebalance.delay.ms`?
> > >>>>>
> > >>>>> Best,
> > >>>>> Stanislav
> > >>>>>
> > >>>>> On Tue, Nov 27, 2018 at 5:39 PM Jason Gustafson <
> [email protected]>
> > >>>> wrote:
> > >>>>>
> > >>>>>> Hi Boyang,
> > >>>>>>
> > >>>>>> Thanks for the KIP. Looks good overall. I think we will need to
> bump
> > >> the
> > >>>>>> version of the JoinGroup protocol in order to indicate
> compatibility
> > >>>> with
> > >>>>>> the new behavior. The coordinator needs to know when it is safe to
> > >>>> assume
> > >>>>>> the client will handle the error code.
> > >>>>>>
> > >>>>>> Also, I was wondering if we could reuse the REBALANCE_IN_PROGRESS
> > >> error
> > >>>>>> code. When the client sees this error code, it will take the
> > memberId
> > >>>> from
> > >>>>>> the response and rejoin. We'd still need the protocol bump since
> > older
> > >>>>>> consumers do not have this logic.
> > >>>>>>
> > >>>>>> Thanks,
> > >>>>>> Jason
> > >>>>>>
> > >>>>>> On Mon, Nov 26, 2018 at 5:47 PM Boyang Chen <[email protected]>
> > >>>> wrote:
> > >>>>>>
> > >>>>>>> Hey friends,
> > >>>>>>>
> > >>>>>>>
> > >>>>>>> I would like to start a discussion thread for KIP-394 which is
> > trying
> > >>>> to
> > >>>>>>> mitigate broker cache bursting issue due to anonymous join group
> > >>>>>> requests:
> > >>>>>>>
> > >>>>>>>
> > >>>>>>>
> > >>>>>>
> > >>>>
> > >>
> >
> https://eur01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fcwiki.apache.org%2Fconfluence%2Fdisplay%2FKAFKA%2FKIP-394%253A%2BRequire%2Bmember.id%2Bfor%2Binitial%2Bjoin%2Bgroup%2Brequest&amp;data=02%7C01%7C%7C3ca95629be9e42b1f00108d657383bfd%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C636792296362447479&amp;sdata=3BuPVUH5v3hMYe%2FMgpSsNftTwb5DsHDlm2lN%2FVUR0T8%3D&amp;reserved=0
> > >>>>>>>
> > >>>>>>>
> > >>>>>>> Thanks!
> > >>>>>>>
> > >>>>>>> Boyang
> > >>>>>>>
> > >>>>>>
> > >>>>>
> > >>>>>
> > >>>>> --
> > >>>>> Best,
> > >>>>> Stanislav
> > >>>>>
> > >>>>
> > >>>>
> > >>>
> > >>> --
> > >>> Best,
> > >>> Stanislav
> > >>>
> > >>
> > >>
> > >
> >
> >
>
> --
> Best,
> Stanislav
>


-- 
-Regards,
Mayuresh R. Gharat
(862) 250-7125

Re: [DISCUSS] KIP-394: Require member.id for initial join group request

Reply via email to