Re: [DISCUSS] KIP-394: Require member.id for initial join group request

Boyang Chen Fri, 30 Nov 2018 18:54:02 -0800

Thanks Jason for the reply! Since the overall motivation and design is pretty 
clear, I will go ahead to start implementation and we could discuss the 
underlying details in the PR.


Best,
Boyang
________________________________
From: Matthias J. Sax <matth...@confluent.io>
Sent: Saturday, December 1, 2018 3:12 AM
To: dev@kafka.apache.org
Subject: Re: [DISCUSS] KIP-394: Require member.id for initial join group request

SGTM.

On 11/30/18 10:17 AM, Jason Gustafson wrote:
> Using the session expiration logic we already have seems like the simplest
> option (this is probably a one or two line change). The rejoin should be
> quick anyway, so I don't think it's worth optimizing for unjoined new
> members. Just my two cents. This is more of an implementation detail, so
> need not necessarily be resolved here.
>
> -Jason
>
> On Fri, Nov 30, 2018 at 12:56 AM Boyang Chen <bche...@outlook.com> wrote:
>
>> Thanks Matthias for the question. I'm thinking of having a separate hash
>> set called `registeredMemberIds` which
>> will be cleared out every time a group finishes one round of rebalance.
>> Since storing one id is pretty trivial, using
>> purgatory to track the id removal is a bit wasteful in my opinion.
>> ________________________________
>> From: Matthias J. Sax <matth...@confluent.io>
>> Sent: Friday, November 30, 2018 10:26 AM
>> To: dev@kafka.apache.org
>> Subject: Re: [DISCUSS] KIP-394: Require member.id for initial join group
>> request
>>
>> Thanks! Makes sense.
>>
>> I missed that fact, that the `member.id` is added on the second
>> joinGroup request that contains the `member.id`.
>>
>> However, it seems there is another race condition for this design:
>>
>> If two consumers join at the same time, it it possible that the broker
>> assigns the same `member.id` to both (because none of them have joined
>> the group yet--ie, second joinGroup request not sent yet--, the
>> `member.id` is not store broker side yes and broker cannot check for
>> duplicates when creating a new `member.id`.
>>
>> The probability might be fairly low thought. However, what Stanislav
>> proposed, to add the `member.id` directly, and remove it after
>> `session.timeout.ms` sound like a save option that avoids this issue.
>>
>> Thoughts?
>>
>>
>> -Matthias
>>
>> On 11/28/18 8:15 PM, Boyang Chen wrote:
>>> Thanks Matthias for the question, and Stanislav for the explanation!
>>>
>>> For the scenario described, we will never let a member join the
>> GroupMetadata map
>>> if it uses UNKNOWN_MEMBER_ID. So the workflow will be like this:
>>>
>>>   1.  Group is empty. Consumer c1 started. Join with UNKNOWN_MEMBER_ID;
>>>   2.  Broker rejects while allocating a member.id to c1 in response (c1
>> protocol version is current);
>>>   3.  c1 handles the error and rejoins with assigned member.id;
>>>   4.  Broker stores c1 in its group metadata;
>>>   5.  Consumer c2 started. Join with UNKNOWN_MEMBER_ID;
>>>   6.  Broker rejects while allocating a member.id to c2 in response (c2
>> protocol version is current);
>>>   7.  c2 fails to get the response/crashes in the middle;
>>>   8.  After certain time, c2 restarts a join request with
>> UNKNOWN_MEMBER_ID;
>>>
>>> As you could see, c2 will repeat step 6~8 until successfully send back a
>> join group request with allocated id.
>>> By then broker will include c2 within the broker metadata map.
>>>
>>> Does this sound clear to you?
>>>
>>> Best,
>>> Boyang
>>> ________________________________
>>> From: Stanislav Kozlovski <stanis...@confluent.io>
>>> Sent: Wednesday, November 28, 2018 7:39 PM
>>> To: dev@kafka.apache.org
>>> Subject: Re: [DISCUSS] KIP-394: Require member.id for initial join
>> group request
>>>
>>> Hey Matthias,
>>>
>>> I think the notion is to have the `session.timeout.ms` to start ticking
>>> when the broker responds with the member.id. Then, the broker would
>>> properly expire consumers and not hold too many stale ones.
>>> This isn't mentioned in the KIP though so it is worth to wait for Boyang
>> to
>>> confirm
>>>
>>> On Wed, Nov 28, 2018 at 3:10 AM Matthias J. Sax <matth...@confluent.io>
>>> wrote:
>>>
>>>> Thanks for the KIP Boyang.
>>>>
>>>> I guess I am missing something, but I am still learning more details
>>>> about the rebalance protocol, so maybe you can help me out?
>>>>
>>>> Assume a client sends UNKNOWN_MEMBER_ID in its first joinGroup request.
>>>> The broker generates a `member.id` and sends it back via
>>>> `MEMBER_ID_REQUIRED` error response. This response might never reach the
>>>> client or the client fails before it can send the second joinGroup
>>>> request. Thus, a client would need to start over with a new
>>>> UNKNOWN_MEMBER_ID in its joinGroup request. Thus, the broker needs to
>>>> generate a new `member.id` again.
>>>>
>>>> So it seems the problem is moved, but not resolved? The motivation of
>>>> the KIP is:
>>>>
>>>>> The edge case is that if initial join group request keeps failing due
>> to
>>>> connection timeout, or the consumer keeps restarting,
>>>>
>>>> From my understanding, this KIP move the issue from the first to the
>>>> second joinGroup request (or broker joinGroup response).
>>>>
>>>> But maybe I am missing something. Can you help me out?
>>>>
>>>>
>>>> -Matthias
>>>>
>>>>
>>>> On 11/27/18 6:00 PM, Boyang Chen wrote:
>>>>> Thanks Stanislav and Jason for the suggestions!
>>>>>
>>>>>
>>>>>> Thanks for the KIP. Looks good overall. I think we will need to bump
>> the
>>>>>> version of the JoinGroup protocol in order to indicate compatibility
>>>> with
>>>>>> the new behavior. The coordinator needs to know when it is safe to
>>>> assume
>>>>>> the client will handle the error code.
>>>>>>
>>>>>> Also, I was wondering if we could reuse the REBALANCE_IN_PROGRESS
>> error
>>>>>> code. When the client sees this error code, it will take the memberId
>>>> from
>>>>>> the response and rejoin. We'd still need the protocol bump since older
>>>>>> consumers do not have this logic.
>>>>>
>>>>> I will add the join group protocol version change to the KIP. Meanwhile
>>>> I feel for
>>>>> understandability it's better to define a separate error code since
>>>> REBALANCE_IN_PROGRESS
>>>>> is not the actual cause of the returned error.
>>>>>
>>>>>> One small question I have is now that we have one and a half
>> round-trips
>>>>>> needed to join in a rebalance (1 full RT addition), is it worth it to
>>>>>> consider increasing the default value of `
>>>> group.initial.rebalance.delay.ms`?
>>>>> I guess we could keep it for now. After KIP-345 and incremental
>>>> cooperative rebalancing
>>>>> work we should be safe to deprecate `group.initial.rebalance.delay.ms
>> `.
>>>> Also one round trip
>>>>> shouldn't increase the latency too much IMO.
>>>>>
>>>>> Best,
>>>>> Boyang
>>>>> ________________________________
>>>>> From: Stanislav Kozlovski <stanis...@confluent.io>
>>>>> Sent: Wednesday, November 28, 2018 2:32 AM
>>>>> To: dev@kafka.apache.org
>>>>> Subject: Re: [DISCUSS] KIP-394: Require member.id for initial join
>>>> group request
>>>>>
>>>>> Hi Boyang,
>>>>>
>>>>> The KIP looks very good.
>>>>> One small question I have is now that we have one and a half
>> round-trips
>>>>> needed to join in a rebalance (1 full RT addition), is it worth it to
>>>>> consider increasing the default value of `
>>>> group.initial.rebalance.delay.ms`?
>>>>>
>>>>> Best,
>>>>> Stanislav
>>>>>
>>>>> On Tue, Nov 27, 2018 at 5:39 PM Jason Gustafson <ja...@confluent.io>
>>>> wrote:
>>>>>
>>>>>> Hi Boyang,
>>>>>>
>>>>>> Thanks for the KIP. Looks good overall. I think we will need to bump
>> the
>>>>>> version of the JoinGroup protocol in order to indicate compatibility
>>>> with
>>>>>> the new behavior. The coordinator needs to know when it is safe to
>>>> assume
>>>>>> the client will handle the error code.
>>>>>>
>>>>>> Also, I was wondering if we could reuse the REBALANCE_IN_PROGRESS
>> error
>>>>>> code. When the client sees this error code, it will take the memberId
>>>> from
>>>>>> the response and rejoin. We'd still need the protocol bump since older
>>>>>> consumers do not have this logic.
>>>>>>
>>>>>> Thanks,
>>>>>> Jason
>>>>>>
>>>>>> On Mon, Nov 26, 2018 at 5:47 PM Boyang Chen <bche...@outlook.com>
>>>> wrote:
>>>>>>
>>>>>>> Hey friends,
>>>>>>>
>>>>>>>
>>>>>>> I would like to start a discussion thread for KIP-394 which is trying
>>>> to
>>>>>>> mitigate broker cache bursting issue due to anonymous join group
>>>>>> requests:
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>
>> https://nam05.safelinks.protection.outlook.com/?url=https%3A%2F%2Fcwiki.apache.org%2Fconfluence%2Fdisplay%2FKAFKA%2FKIP-394%253A%2BRequire%2Bmember.id%2Bfor%2Binitial%2Bjoin%2Bgroup%2Brequest&amp;data=02%7C01%7C%7C046b858d134f4c1c576108d655277552%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C636790025286277394&amp;sdata=QyBzqnislL%2B9fK1mXaRuJ0xpi9Y2JDvHrM881hzjq3A%3D&amp;reserved=0
>>>>>>>
>>>>>>>
>>>>>>> Thanks!
>>>>>>>
>>>>>>> Boyang
>>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Best,
>>>>> Stanislav
>>>>>
>>>>
>>>>
>>>
>>> --
>>> Best,
>>> Stanislav
>>>
>>
>>
>

Re: [DISCUSS] KIP-394: Require member.id for initial join group request

Reply via email to