Re: [DISCUSS] KIP-580: Exponential Backoff for Kafka Clients

Cheng Tan Mon, 23 Mar 2020 11:38:52 -0700
+1 (non-binding)

> On Mar 19, 2020, at 7:27 PM, Sanjana Kaundinya <skaundi...@gmail.com> wrote:
> 
> Ah yes that makes sense. I’ll update the KIP to reflect this.
> 
> Thanks,
> Sanjana
> 
> On Thu, Mar 19, 2020 at 5:48 PM Guozhang Wang <wangg...@gmail.com> wrote:
> 
>> Following the formula you have in the KIP, if it is simply:
>> 
>> MIN(retry.backoff.max.ms, (retry.backoff.ms * 2**(failures - 1)) * random(
>> 0.8, 1.2))
>> 
>> then the behavior would stay consistent at retry.backoff.max.ms.
>> 
>> 
>> Guozhang
>> 
>> On Thu, Mar 19, 2020 at 5:46 PM Sanjana Kaundinya <skaundi...@gmail.com>
>> wrote:
>> 
>>> If that’s the case then what should we base the starting point as?
>>> Currently in the KIP the starting point is retry.backoff.ms and it
>>> exponentially goes up to retry.backoff.max.ms. If retry.backoff.max.ms
>> is
>>> smaller than retry.backoff.ms then that could pose a bit of a problem
>>> there right?
>>> 
>>> On Mar 19, 2020, 5:44 PM -0700, Guozhang Wang <wangg...@gmail.com>,
>> wrote:
>>>> Thanks Sanjana, I did not capture the part that Jason referred to, so
>>>> that's my bad :P
>>>> 
>>>> Regarding your last statement, I actually feel that instead of take the
>>>> larger of the two, we should respect "retry.backoff.max.ms" even if it
>>> is
>>>> smaller than "retry.backoff.ms". I do not have a very strong rationale
>>>> except it is logically more aligned to the config names.
>>>> 
>>>> 
>>>> Guozhang
>>>> 
>>>> 
>>>> On Thu, Mar 19, 2020 at 5:39 PM Sanjana Kaundinya <
>> skaundi...@gmail.com>
>>>> wrote:
>>>> 
>>>>> Hey Jason and Guozhang,
>>>>> 
>>>>> Jason is right, I took this inspiration from KIP-144 (
>>>>> 
>>>>> 
>>> 
>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-144%3A+Exponential+backoff+for+broker+reconnect+attempts
>>>>> )
>>>>> which had the same logic in order to preserve the existing behavior.
>> In
>>>>> this case however, if we are thinking to completely eliminate the
>>> static
>>>>> backoff behavior, we can do that and as Jason mentioned put it in the
>>>>> release notes and not add any special logic. In addition I agree that
>>> we
>>>>> should take the larger of the two of `retry.backoff.ms` and `
>>>>> retry.backoff.max.ms`. I'll update the KIP to reflect this and make
>> it
>>>>> clear that the old static retry backoff is getting replaced by the
>> new
>>>>> dynamic retry backoff.
>>>>> 
>>>>> Thanks,
>>>>> Sanjana
>>>>> On Thu, Mar 19, 2020 at 4:23 PM Jason Gustafson <ja...@confluent.io>
>>>>> wrote:
>>>>> 
>>>>>> Hey Guozhang,
>>>>>> 
>>>>>> I was referring to this:
>>>>>> 
>>>>>>> For users who have not set retry.backoff.ms explicitly, the
>>> default
>>>>>> behavior will change so that the backoff will grow up to 1000 ms.
>> For
>>>>> users
>>>>>> who have set retry.backoff.ms explicitly, the behavior will remain
>>> the
>>>>>> same
>>>>>> as they could have specific requirements.
>>>>>> 
>>>>>> I took this to mean that for users who have overridden `
>>> retry.backoff.ms
>>>>> `
>>>>>> to 50ms (say), we will change the default `retry.backoff.max.ms`
>> to
>>> 50ms
>>>>>> as
>>>>>> well in order to preserve existing backoff behavior. Is that not
>>> right?
>>>>> In
>>>>>> any case, I agree that we can use the maximum of the two values as
>>> the
>>>>>> effective `retry.backoff.max.ms` to handle the case when the
>>> configured
>>>>>> value of `retry.backoff.ms` is larger than the default of 1s.
>>>>>> 
>>>>>> -Jason
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> On Thu, Mar 19, 2020 at 3:29 PM Guozhang Wang <wangg...@gmail.com>
>>>>> wrote:
>>>>>> 
>>>>>>> Hey Jason,
>>>>>>> 
>>>>>>> My understanding is a bit different here: even if user has an
>>> explicit
>>>>>>> overridden "retry.backoff.ms", the exponential mechanism still
>>>>> triggers
>>>>>>> and
>>>>>>> the backoff would be increased till "retry.backoff.max.ms"; and
>>> if the
>>>>>>> specified "retry.backoff.ms" is already larger than the "
>>>>>>> retry.backoff.max.ms", we would still take "retry.backoff.max.ms
>> ".
>>>>>>> 
>>>>>>> So if the user does override the "retry.backoff.ms" to a value
>>> larger
>>>>>> than
>>>>>>> 1s and is not aware of the new config, she would be surprised to
>>> see
>>>>> the
>>>>>>> specified value seemingly not being respected, but she could
>> still
>>>>> learn
>>>>>>> that afterwards by reading the release notes introducing this KIP
>>>>>> anyways.
>>>>>>> 
>>>>>>> 
>>>>>>> Guozhang
>>>>>>> 
>>>>>>> On Thu, Mar 19, 2020 at 3:10 PM Jason Gustafson <
>>> ja...@confluent.io>
>>>>>>> wrote:
>>>>>>> 
>>>>>>>> Hi Sanjana,
>>>>>>>> 
>>>>>>>> The KIP looks good to me. I had just one question about the
>>> default
>>>>>>>> behavior. As I understand, if the user has specified `
>>>>> retry.backoff.ms
>>>>>> `
>>>>>>>> explicitly, then we will not apply the default max backoff. As
>>> such,
>>>>>>>> there's no way to get the benefit of this feature if you are
>>>>> providing
>>>>>> a
>>>>>>> `
>>>>>>>> retry.backoff.ms` unless you also provide `
>> retry.backoff.max.ms
>>> `.
>>>>> That
>>>>>>>> makes sense if you assume the user is unaware of the new
>>>>> configuration,
>>>>>>> but
>>>>>>>> it is surprising otherwise. Since it's not a semantic change
>> and
>>>>> since
>>>>>>> the
>>>>>>>> default you're proposing of 1s is fairly low already, I wonder
>> if
>>>>> it's
>>>>>>> good
>>>>>>>> enough to mention the new configuration in the release notes
>> and
>>> not
>>>>>> add
>>>>>>>> any special logic. What do you think?
>>>>>>>> 
>>>>>>>> -Jason
>>>>>>>> 
>>>>>>>> On Thu, Mar 19, 2020 at 1:56 PM Sanjana Kaundinya <
>>>>>> skaundi...@gmail.com>
>>>>>>>> wrote:
>>>>>>>> 
>>>>>>>>> Thank you for the comments Guozhang.
>>>>>>>>> 
>>>>>>>>> I’ll leave this KIP out for discussion till the end of the
>>> week and
>>>>>>> then
>>>>>>>>> start a vote for this early next week.
>>>>>>>>> 
>>>>>>>>> Sanjana
>>>>>>>>> 
>>>>>>>>> On Mar 18, 2020, 3:38 PM -0700, Guozhang Wang <
>>> wangg...@gmail.com
>>>>>> ,
>>>>>>>> wrote:
>>>>>>>>>> Hello Sanjana,
>>>>>>>>>> 
>>>>>>>>>> Thanks for the proposed KIP, I think that makes a lot of
>>> sense --
>>>>>> as
>>>>>>>> you
>>>>>>>>>> mentioned in the motivation, we've indeed seen many issues
>>> with
>>>>>>> regard
>>>>>>>> to
>>>>>>>>>> the frequent retries, with bounded exponential backoff in
>> the
>>>>>>> scenario
>>>>>>>>>> where there's a long connectivity issue we would
>> effectively
>>>>> reduce
>>>>>>> the
>>>>>>>>>> request load by 10 given the default configs.
>>>>>>>>>> 
>>>>>>>>>> For higher-level Streams client and Connect frameworks,
>>> today we
>>>>>> also
>>>>>>>>> have
>>>>>>>>>> a retry logic but that's used in a slightly different way.
>>> For
>>>>>>> example
>>>>>>>> in
>>>>>>>>>> Streams, we tend to handle the retry logic at the
>>> thread-level
>>>>> and
>>>>>>>> hence
>>>>>>>>>> very likely we'd like to change that mechanism in KIP-572
>>>>> anyways.
>>>>>>> For
>>>>>>>>>> producer / consumer / admin clients, I think just applying
>>> this
>>>>>>>>> behavioral
>>>>>>>>>> change across these clients makes lot of sense. So I think
>>> can
>>>>> just
>>>>>>>> leave
>>>>>>>>>> the Streams / Connect out of the scope of this KIP to be
>>>>> addressed
>>>>>> in
>>>>>>>>>> separate discussions.
>>>>>>>>>> 
>>>>>>>>>> I do not have further comments about this KIP :) LGTM.
>>>>>>>>>> 
>>>>>>>>>> Guozhang
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> On Wed, Mar 18, 2020 at 12:09 AM Sanjana Kaundinya <
>>>>>>>> skaundi...@gmail.com
>>>>>>>>>> 
>>>>>>>>>> wrote:
>>>>>>>>>> 
>>>>>>>>>>> Thanks for the feedback Boyang.
>>>>>>>>>>> 
>>>>>>>>>>> If there’s anyone else who has feedback regarding this
>> KIP,
>>>>> would
>>>>>>>>> really
>>>>>>>>>>> appreciate it hearing it!
>>>>>>>>>>> 
>>>>>>>>>>> Thanks,
>>>>>>>>>>> Sanjana
>>>>>>>>>>> 
>>>>>>>>>>> On Tue, Mar 17, 2020 at 11:38 PM Boyang Chen <
>>>>>> bche...@outlook.com>
>>>>>>>>> wrote:
>>>>>>>>>>> 
>>>>>>>>>>>> Sounds great!
>>>>>>>>>>>> 
>>>>>>>>>>>> Get Outlook for iOS<https://aka.ms/o0ukef>
>>>>>>>>>>>> ________________________________
>>>>>>>>>>>> From: Sanjana Kaundinya <skaundi...@gmail.com>
>>>>>>>>>>>> Sent: Tuesday, March 17, 2020 5:54:35 PM
>>>>>>>>>>>> To: dev@kafka.apache.org <dev@kafka.apache.org>
>>>>>>>>>>>> Subject: Re: [DISCUSS] KIP-580: Exponential Backoff for
>>> Kafka
>>>>>>>> Clients
>>>>>>>>>>>> 
>>>>>>>>>>>> Thanks for the explanation Boyang. One of the most
>> common
>>>>>>> problems
>>>>>>>>> that
>>>>>>>>>>> we
>>>>>>>>>>>> have in Kafka is with respect to metadata fetches. For
>>>>> example,
>>>>>>> if
>>>>>>>>> there
>>>>>>>>>>> is
>>>>>>>>>>>> a broker failure, all clients start to fetch metadata
>> at
>>> the
>>>>>> same
>>>>>>>>> time
>>>>>>>>>>> and
>>>>>>>>>>>> it often takes a while for the metadata to converge.
>> In a
>>>>> high
>>>>>>> load
>>>>>>>>>>>> cluster, there are also issues where the volume of
>>> metadata
>>>>> has
>>>>>>>> made
>>>>>>>>>>>> convergence of metadata slower.
>>>>>>>>>>>> 
>>>>>>>>>>>> For this case, exponential backoff helps as it reduces
>>> the
>>>>>> retry
>>>>>>>>> rate and
>>>>>>>>>>>> spaces out how often clients will retry, thereby
>> bringing
>>>>> down
>>>>>>> the
>>>>>>>>> time
>>>>>>>>>>> for
>>>>>>>>>>>> convergence. Something that Jason mentioned that would
>>> be a
>>>>>> great
>>>>>>>>>>> addition
>>>>>>>>>>>> here would be if the backoff should be “jittered” as it
>>> was
>>>>> in
>>>>>>>>> KIP-144
>>>>>>>>>>> with
>>>>>>>>>>>> respect to exponential reconnect backoff. This would
>> help
>>>>>> prevent
>>>>>>>> the
>>>>>>>>>>>> clients from being synchronized on when they retry,
>>> thereby
>>>>>>> spacing
>>>>>>>>> out
>>>>>>>>>>> the
>>>>>>>>>>>> number of requests being sent to the broker at the same
>>> time.
>>>>>>>>>>>> 
>>>>>>>>>>>> I’ll add this example to the KIP and flush out more of
>>> the
>>>>>>> details
>>>>>>>> -
>>>>>>>>> so
>>>>>>>>>>>> it’s more clear.
>>>>>>>>>>>> 
>>>>>>>>>>>> On Mar 17, 2020, 1:24 PM -0700, Boyang Chen <
>>>>>>>>> reluctanthero...@gmail.com
>>>>>>>>>>>> ,
>>>>>>>>>>>> wrote:
>>>>>>>>>>>>> Thanks for the reply Sanjana. I guess I would like to
>>>>>> rephrase
>>>>>>> my
>>>>>>>>>>>> question
>>>>>>>>>>>>> 2 and 3 as my previous response is a little bit
>>>>> unactionable.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> My specific point is that exponential backoff is not
>> a
>>>>> silver
>>>>>>>>> bullet
>>>>>>>>>>> and
>>>>>>>>>>>> we
>>>>>>>>>>>>> should consider using it to solve known problems,
>>> instead
>>>>> of
>>>>>>>>> making the
>>>>>>>>>>>>> holistic changes to all clients in Kafka ecosystem. I
>>> do
>>>>> like
>>>>>>> the
>>>>>>>>>>>>> exponential backoff idea and believe this would be of
>>> great
>>>>>>>> value,
>>>>>>>>> but
>>>>>>>>>>>>> maybe we should focus on proposing some existing
>>> modules
>>>>> that
>>>>>>> are
>>>>>>>>>>>> suffering
>>>>>>>>>>>>> from static retry, and only change them in this first
>>> KIP.
>>>>> If
>>>>>>> in
>>>>>>>>> the
>>>>>>>>>>>>> future, some other component users believe they are
>>> also
>>>>>>>>> suffering, we
>>>>>>>>>>>>> could get more minor KIPs to change the behavior as
>>> well.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Boyang
>>>>>>>>>>>>> 
>>>>>>>>>>>>> On Sun, Mar 15, 2020 at 12:07 AM Sanjana Kaundinya <
>>>>>>>>>>> skaundi...@gmail.com
>>>>>>>>>>>>> 
>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Thanks for the feedback Boyang, I will revise the
>> KIP
>>>>> with
>>>>>>> the
>>>>>>>>>>>>>> mathematical relations as per your suggestion. To
>>> address
>>>>>>> your
>>>>>>>>>>>> feedback:
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 1. Currently, with the default of 100 ms per retry
>>>>> backoff,
>>>>>>> in
>>>>>>>> 1
>>>>>>>>>>> second
>>>>>>>>>>>>>> we would have 10 retries. In the case of using an
>>>>>> exponential
>>>>>>>>>>> backoff,
>>>>>>>>>>>> we
>>>>>>>>>>>>>> would have a total of 4 retries in 1 second. Thus
>> we
>>> have
>>>>>>> less
>>>>>>>>> than
>>>>>>>>>>>> half of
>>>>>>>>>>>>>> the amount of retries in the same timeframe and can
>>>>> lessen
>>>>>>>> broker
>>>>>>>>>>>> pressure.
>>>>>>>>>>>>>> This calculation is done as following (using the
>>> formula
>>>>>> laid
>>>>>>>>> out in
>>>>>>>>>>>> the
>>>>>>>>>>>>>> KIP:
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Try 1 at time 0 ms, failures = 0, next retry in 100
>>> ms
>>>>>>> (default
>>>>>>>>> retry
>>>>>>>>>>>> ms
>>>>>>>>>>>>>> is initially 100 ms)
>>>>>>>>>>>>>> Try 2 at time 100 ms, failures = 1, next retry in
>>> 200 ms
>>>>>>>>>>>>>> Try 3 at time 300 ms, failures = 2, next retry in
>>> 400 ms
>>>>>>>>>>>>>> Try 4 at time 700 ms, failures = 3, next retry in
>>> 800 ms
>>>>>>>>>>>>>> Try 5 at time 1500 ms, failures = 4, next retry in
>>> 1000
>>>>> ms
>>>>>>>>> (default
>>>>>>>>>>> max
>>>>>>>>>>>>>> retry ms is 1000 ms)
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> For 2 and 3, could you elaborate more about what
>> you
>>> mean
>>>>>>> with
>>>>>>>>>>> respect
>>>>>>>>>>>> to
>>>>>>>>>>>>>> client timeouts? I’m not very familiar with the
>>> Streams
>>>>>>>>> framework, so
>>>>>>>>>>>> would
>>>>>>>>>>>>>> love to get more insight to how that currently
>> works,
>>>>> with
>>>>>>>>> respect to
>>>>>>>>>>>>>> producer transactions, so I can appropriately
>> update
>>> the
>>>>>> KIP
>>>>>>> to
>>>>>>>>>>> address
>>>>>>>>>>>>>> these scenarios.
>>>>>>>>>>>>>> On Mar 13, 2020, 7:15 PM -0700, Boyang Chen <
>>>>>>>>>>>> reluctanthero...@gmail.com>,
>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>> Thanks for the KIP Sanjana. I think the
>> motivation
>>> is
>>>>>> good,
>>>>>>>> but
>>>>>>>>>>> lack
>>>>>>>>>>>> of
>>>>>>>>>>>>>>> more quantitative analysis. For instance:
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 1. How much retries we are saving by applying the
>>>>>>> exponential
>>>>>>>>> retry
>>>>>>>>>>>> vs
>>>>>>>>>>>>>>> static retry? There should be some mathematical
>>>>> relations
>>>>>>>>> between
>>>>>>>>>>> the
>>>>>>>>>>>>>>> static retry ms, the initial exponential retry
>> ms,
>>> the
>>>>>> max
>>>>>>>>>>>> exponential
>>>>>>>>>>>>>>> retry ms in a given time interval.
>>>>>>>>>>>>>>> 2. How does this affect the client timeout? With
>>>>>>> exponential
>>>>>>>>> retry,
>>>>>>>>>>>> the
>>>>>>>>>>>>>>> client shall be getting easier to timeout on a
>>> parent
>>>>>> level
>>>>>>>>> caller,
>>>>>>>>>>>> for
>>>>>>>>>>>>>>> instance stream attempts to retry initializing
>>> producer
>>>>>>>>>>> transactions
>>>>>>>>>>>> with
>>>>>>>>>>>>>>> given 5 minute interval. With exponential retry
>>> this
>>>>>>>> mechanism
>>>>>>>>>>> could
>>>>>>>>>>>>>>> experience more frequent timeout which we should
>> be
>>>>>> careful
>>>>>>>>> with.
>>>>>>>>>>>>>>> 3. With regards to #2, we should have more
>> detailed
>>>>>>> checklist
>>>>>>>>> of
>>>>>>>>>>> all
>>>>>>>>>>>> the
>>>>>>>>>>>>>>> existing static retry scenarios, and adjust the
>>> initial
>>>>>>>>> exponential
>>>>>>>>>>>> retry
>>>>>>>>>>>>>>> ms to make sure we won't get easily timeout in
>> high
>>>>> level
>>>>>>> due
>>>>>>>>> to
>>>>>>>>>>> too
>>>>>>>>>>>> few
>>>>>>>>>>>>>>> attempts.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Boyang
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> On Fri, Mar 13, 2020 at 4:38 PM Sanjana
>> Kaundinya <
>>>>>>>>>>>> skaundi...@gmail.com>
>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> Hi Everyone,
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> I’ve written a KIP about introducing
>> exponential
>>>>>> backoff
>>>>>>>> for
>>>>>>>>>>> Kafka
>>>>>>>>>>>>>>>> clients. Would appreciate any feedback on this.
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>> 
>>> 
>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-580%3A+Exponential+Backoff+for+Kafka+Clients
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>> Sanjana
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> --
>>>>>>>>>> -- Guozhang
>>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> --
>>>>>>> -- Guozhang
>>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>>>> 
>>>> --
>>>> -- Guozhang
>>> 
>> 
>> 
>> --
>> -- Guozhang
>>
Re: [DISCUSS] KIP-580: Exponential Backoff for Kafka Clients

Reply via email to