Hey Sanjana, my understanding with the update is that if a user provides no config at all, a Producer/Consumer/Admin client user would by default enjoying a starting backoff.ms as 100 ms and max.backoff.ms as 1000 ms? If I already override the backoff.ms to 2000 ms for instance, will I be choosing the default max.backoff here?
I guess my question would be whether we should just reject a config with backoff.ms > max.backoff.ms in the first place, as this looks like mis-configuration to me. Second question is whether we allow fallback to static backoffs if the user wants to do so, or we should just ship this as an opt-in feature? Let me know your thoughts. Boyang On Mon, Mar 23, 2020 at 11:38 AM Cheng Tan <c...@confluent.io> wrote: > +1 (non-binding) > > > On Mar 19, 2020, at 7:27 PM, Sanjana Kaundinya <skaundi...@gmail.com> > wrote: > > > > Ah yes that makes sense. I’ll update the KIP to reflect this. > > > > Thanks, > > Sanjana > > > > On Thu, Mar 19, 2020 at 5:48 PM Guozhang Wang <wangg...@gmail.com> > wrote: > > > >> Following the formula you have in the KIP, if it is simply: > >> > >> MIN(retry.backoff.max.ms, (retry.backoff.ms * 2**(failures - 1)) * > random( > >> 0.8, 1.2)) > >> > >> then the behavior would stay consistent at retry.backoff.max.ms. > >> > >> > >> Guozhang > >> > >> On Thu, Mar 19, 2020 at 5:46 PM Sanjana Kaundinya <skaundi...@gmail.com > > > >> wrote: > >> > >>> If that’s the case then what should we base the starting point as? > >>> Currently in the KIP the starting point is retry.backoff.ms and it > >>> exponentially goes up to retry.backoff.max.ms. If retry.backoff.max.ms > >> is > >>> smaller than retry.backoff.ms then that could pose a bit of a problem > >>> there right? > >>> > >>> On Mar 19, 2020, 5:44 PM -0700, Guozhang Wang <wangg...@gmail.com>, > >> wrote: > >>>> Thanks Sanjana, I did not capture the part that Jason referred to, so > >>>> that's my bad :P > >>>> > >>>> Regarding your last statement, I actually feel that instead of take > the > >>>> larger of the two, we should respect "retry.backoff.max.ms" even if > it > >>> is > >>>> smaller than "retry.backoff.ms". I do not have a very strong > rationale > >>>> except it is logically more aligned to the config names. > >>>> > >>>> > >>>> Guozhang > >>>> > >>>> > >>>> On Thu, Mar 19, 2020 at 5:39 PM Sanjana Kaundinya < > >> skaundi...@gmail.com> > >>>> wrote: > >>>> > >>>>> Hey Jason and Guozhang, > >>>>> > >>>>> Jason is right, I took this inspiration from KIP-144 ( > >>>>> > >>>>> > >>> > >> > https://cwiki.apache.org/confluence/display/KAFKA/KIP-144%3A+Exponential+backoff+for+broker+reconnect+attempts > >>>>> ) > >>>>> which had the same logic in order to preserve the existing behavior. > >> In > >>>>> this case however, if we are thinking to completely eliminate the > >>> static > >>>>> backoff behavior, we can do that and as Jason mentioned put it in the > >>>>> release notes and not add any special logic. In addition I agree that > >>> we > >>>>> should take the larger of the two of `retry.backoff.ms` and ` > >>>>> retry.backoff.max.ms`. I'll update the KIP to reflect this and make > >> it > >>>>> clear that the old static retry backoff is getting replaced by the > >> new > >>>>> dynamic retry backoff. > >>>>> > >>>>> Thanks, > >>>>> Sanjana > >>>>> On Thu, Mar 19, 2020 at 4:23 PM Jason Gustafson <ja...@confluent.io> > >>>>> wrote: > >>>>> > >>>>>> Hey Guozhang, > >>>>>> > >>>>>> I was referring to this: > >>>>>> > >>>>>>> For users who have not set retry.backoff.ms explicitly, the > >>> default > >>>>>> behavior will change so that the backoff will grow up to 1000 ms. > >> For > >>>>> users > >>>>>> who have set retry.backoff.ms explicitly, the behavior will remain > >>> the > >>>>>> same > >>>>>> as they could have specific requirements. > >>>>>> > >>>>>> I took this to mean that for users who have overridden ` > >>> retry.backoff.ms > >>>>> ` > >>>>>> to 50ms (say), we will change the default `retry.backoff.max.ms` > >> to > >>> 50ms > >>>>>> as > >>>>>> well in order to preserve existing backoff behavior. Is that not > >>> right? > >>>>> In > >>>>>> any case, I agree that we can use the maximum of the two values as > >>> the > >>>>>> effective `retry.backoff.max.ms` to handle the case when the > >>> configured > >>>>>> value of `retry.backoff.ms` is larger than the default of 1s. > >>>>>> > >>>>>> -Jason > >>>>>> > >>>>>> > >>>>>> > >>>>>> > >>>>>> On Thu, Mar 19, 2020 at 3:29 PM Guozhang Wang <wangg...@gmail.com> > >>>>> wrote: > >>>>>> > >>>>>>> Hey Jason, > >>>>>>> > >>>>>>> My understanding is a bit different here: even if user has an > >>> explicit > >>>>>>> overridden "retry.backoff.ms", the exponential mechanism still > >>>>> triggers > >>>>>>> and > >>>>>>> the backoff would be increased till "retry.backoff.max.ms"; and > >>> if the > >>>>>>> specified "retry.backoff.ms" is already larger than the " > >>>>>>> retry.backoff.max.ms", we would still take "retry.backoff.max.ms > >> ". > >>>>>>> > >>>>>>> So if the user does override the "retry.backoff.ms" to a value > >>> larger > >>>>>> than > >>>>>>> 1s and is not aware of the new config, she would be surprised to > >>> see > >>>>> the > >>>>>>> specified value seemingly not being respected, but she could > >> still > >>>>> learn > >>>>>>> that afterwards by reading the release notes introducing this KIP > >>>>>> anyways. > >>>>>>> > >>>>>>> > >>>>>>> Guozhang > >>>>>>> > >>>>>>> On Thu, Mar 19, 2020 at 3:10 PM Jason Gustafson < > >>> ja...@confluent.io> > >>>>>>> wrote: > >>>>>>> > >>>>>>>> Hi Sanjana, > >>>>>>>> > >>>>>>>> The KIP looks good to me. I had just one question about the > >>> default > >>>>>>>> behavior. As I understand, if the user has specified ` > >>>>> retry.backoff.ms > >>>>>> ` > >>>>>>>> explicitly, then we will not apply the default max backoff. As > >>> such, > >>>>>>>> there's no way to get the benefit of this feature if you are > >>>>> providing > >>>>>> a > >>>>>>> ` > >>>>>>>> retry.backoff.ms` unless you also provide ` > >> retry.backoff.max.ms > >>> `. > >>>>> That > >>>>>>>> makes sense if you assume the user is unaware of the new > >>>>> configuration, > >>>>>>> but > >>>>>>>> it is surprising otherwise. Since it's not a semantic change > >> and > >>>>> since > >>>>>>> the > >>>>>>>> default you're proposing of 1s is fairly low already, I wonder > >> if > >>>>> it's > >>>>>>> good > >>>>>>>> enough to mention the new configuration in the release notes > >> and > >>> not > >>>>>> add > >>>>>>>> any special logic. What do you think? > >>>>>>>> > >>>>>>>> -Jason > >>>>>>>> > >>>>>>>> On Thu, Mar 19, 2020 at 1:56 PM Sanjana Kaundinya < > >>>>>> skaundi...@gmail.com> > >>>>>>>> wrote: > >>>>>>>> > >>>>>>>>> Thank you for the comments Guozhang. > >>>>>>>>> > >>>>>>>>> I’ll leave this KIP out for discussion till the end of the > >>> week and > >>>>>>> then > >>>>>>>>> start a vote for this early next week. > >>>>>>>>> > >>>>>>>>> Sanjana > >>>>>>>>> > >>>>>>>>> On Mar 18, 2020, 3:38 PM -0700, Guozhang Wang < > >>> wangg...@gmail.com > >>>>>> , > >>>>>>>> wrote: > >>>>>>>>>> Hello Sanjana, > >>>>>>>>>> > >>>>>>>>>> Thanks for the proposed KIP, I think that makes a lot of > >>> sense -- > >>>>>> as > >>>>>>>> you > >>>>>>>>>> mentioned in the motivation, we've indeed seen many issues > >>> with > >>>>>>> regard > >>>>>>>> to > >>>>>>>>>> the frequent retries, with bounded exponential backoff in > >> the > >>>>>>> scenario > >>>>>>>>>> where there's a long connectivity issue we would > >> effectively > >>>>> reduce > >>>>>>> the > >>>>>>>>>> request load by 10 given the default configs. > >>>>>>>>>> > >>>>>>>>>> For higher-level Streams client and Connect frameworks, > >>> today we > >>>>>> also > >>>>>>>>> have > >>>>>>>>>> a retry logic but that's used in a slightly different way. > >>> For > >>>>>>> example > >>>>>>>> in > >>>>>>>>>> Streams, we tend to handle the retry logic at the > >>> thread-level > >>>>> and > >>>>>>>> hence > >>>>>>>>>> very likely we'd like to change that mechanism in KIP-572 > >>>>> anyways. > >>>>>>> For > >>>>>>>>>> producer / consumer / admin clients, I think just applying > >>> this > >>>>>>>>> behavioral > >>>>>>>>>> change across these clients makes lot of sense. So I think > >>> can > >>>>> just > >>>>>>>> leave > >>>>>>>>>> the Streams / Connect out of the scope of this KIP to be > >>>>> addressed > >>>>>> in > >>>>>>>>>> separate discussions. > >>>>>>>>>> > >>>>>>>>>> I do not have further comments about this KIP :) LGTM. > >>>>>>>>>> > >>>>>>>>>> Guozhang > >>>>>>>>>> > >>>>>>>>>> > >>>>>>>>>> On Wed, Mar 18, 2020 at 12:09 AM Sanjana Kaundinya < > >>>>>>>> skaundi...@gmail.com > >>>>>>>>>> > >>>>>>>>>> wrote: > >>>>>>>>>> > >>>>>>>>>>> Thanks for the feedback Boyang. > >>>>>>>>>>> > >>>>>>>>>>> If there’s anyone else who has feedback regarding this > >> KIP, > >>>>> would > >>>>>>>>> really > >>>>>>>>>>> appreciate it hearing it! > >>>>>>>>>>> > >>>>>>>>>>> Thanks, > >>>>>>>>>>> Sanjana > >>>>>>>>>>> > >>>>>>>>>>> On Tue, Mar 17, 2020 at 11:38 PM Boyang Chen < > >>>>>> bche...@outlook.com> > >>>>>>>>> wrote: > >>>>>>>>>>> > >>>>>>>>>>>> Sounds great! > >>>>>>>>>>>> > >>>>>>>>>>>> Get Outlook for iOS<https://aka.ms/o0ukef> > >>>>>>>>>>>> ________________________________ > >>>>>>>>>>>> From: Sanjana Kaundinya <skaundi...@gmail.com> > >>>>>>>>>>>> Sent: Tuesday, March 17, 2020 5:54:35 PM > >>>>>>>>>>>> To: dev@kafka.apache.org <dev@kafka.apache.org> > >>>>>>>>>>>> Subject: Re: [DISCUSS] KIP-580: Exponential Backoff for > >>> Kafka > >>>>>>>> Clients > >>>>>>>>>>>> > >>>>>>>>>>>> Thanks for the explanation Boyang. One of the most > >> common > >>>>>>> problems > >>>>>>>>> that > >>>>>>>>>>> we > >>>>>>>>>>>> have in Kafka is with respect to metadata fetches. For > >>>>> example, > >>>>>>> if > >>>>>>>>> there > >>>>>>>>>>> is > >>>>>>>>>>>> a broker failure, all clients start to fetch metadata > >> at > >>> the > >>>>>> same > >>>>>>>>> time > >>>>>>>>>>> and > >>>>>>>>>>>> it often takes a while for the metadata to converge. > >> In a > >>>>> high > >>>>>>> load > >>>>>>>>>>>> cluster, there are also issues where the volume of > >>> metadata > >>>>> has > >>>>>>>> made > >>>>>>>>>>>> convergence of metadata slower. > >>>>>>>>>>>> > >>>>>>>>>>>> For this case, exponential backoff helps as it reduces > >>> the > >>>>>> retry > >>>>>>>>> rate and > >>>>>>>>>>>> spaces out how often clients will retry, thereby > >> bringing > >>>>> down > >>>>>>> the > >>>>>>>>> time > >>>>>>>>>>> for > >>>>>>>>>>>> convergence. Something that Jason mentioned that would > >>> be a > >>>>>> great > >>>>>>>>>>> addition > >>>>>>>>>>>> here would be if the backoff should be “jittered” as it > >>> was > >>>>> in > >>>>>>>>> KIP-144 > >>>>>>>>>>> with > >>>>>>>>>>>> respect to exponential reconnect backoff. This would > >> help > >>>>>> prevent > >>>>>>>> the > >>>>>>>>>>>> clients from being synchronized on when they retry, > >>> thereby > >>>>>>> spacing > >>>>>>>>> out > >>>>>>>>>>> the > >>>>>>>>>>>> number of requests being sent to the broker at the same > >>> time. > >>>>>>>>>>>> > >>>>>>>>>>>> I’ll add this example to the KIP and flush out more of > >>> the > >>>>>>> details > >>>>>>>> - > >>>>>>>>> so > >>>>>>>>>>>> it’s more clear. > >>>>>>>>>>>> > >>>>>>>>>>>> On Mar 17, 2020, 1:24 PM -0700, Boyang Chen < > >>>>>>>>> reluctanthero...@gmail.com > >>>>>>>>>>>> , > >>>>>>>>>>>> wrote: > >>>>>>>>>>>>> Thanks for the reply Sanjana. I guess I would like to > >>>>>> rephrase > >>>>>>> my > >>>>>>>>>>>> question > >>>>>>>>>>>>> 2 and 3 as my previous response is a little bit > >>>>> unactionable. > >>>>>>>>>>>>> > >>>>>>>>>>>>> My specific point is that exponential backoff is not > >> a > >>>>> silver > >>>>>>>>> bullet > >>>>>>>>>>> and > >>>>>>>>>>>> we > >>>>>>>>>>>>> should consider using it to solve known problems, > >>> instead > >>>>> of > >>>>>>>>> making the > >>>>>>>>>>>>> holistic changes to all clients in Kafka ecosystem. I > >>> do > >>>>> like > >>>>>>> the > >>>>>>>>>>>>> exponential backoff idea and believe this would be of > >>> great > >>>>>>>> value, > >>>>>>>>> but > >>>>>>>>>>>>> maybe we should focus on proposing some existing > >>> modules > >>>>> that > >>>>>>> are > >>>>>>>>>>>> suffering > >>>>>>>>>>>>> from static retry, and only change them in this first > >>> KIP. > >>>>> If > >>>>>>> in > >>>>>>>>> the > >>>>>>>>>>>>> future, some other component users believe they are > >>> also > >>>>>>>>> suffering, we > >>>>>>>>>>>>> could get more minor KIPs to change the behavior as > >>> well. > >>>>>>>>>>>>> > >>>>>>>>>>>>> Boyang > >>>>>>>>>>>>> > >>>>>>>>>>>>> On Sun, Mar 15, 2020 at 12:07 AM Sanjana Kaundinya < > >>>>>>>>>>> skaundi...@gmail.com > >>>>>>>>>>>>> > >>>>>>>>>>>>> wrote: > >>>>>>>>>>>>> > >>>>>>>>>>>>>> Thanks for the feedback Boyang, I will revise the > >> KIP > >>>>> with > >>>>>>> the > >>>>>>>>>>>>>> mathematical relations as per your suggestion. To > >>> address > >>>>>>> your > >>>>>>>>>>>> feedback: > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> 1. Currently, with the default of 100 ms per retry > >>>>> backoff, > >>>>>>> in > >>>>>>>> 1 > >>>>>>>>>>> second > >>>>>>>>>>>>>> we would have 10 retries. In the case of using an > >>>>>> exponential > >>>>>>>>>>> backoff, > >>>>>>>>>>>> we > >>>>>>>>>>>>>> would have a total of 4 retries in 1 second. Thus > >> we > >>> have > >>>>>>> less > >>>>>>>>> than > >>>>>>>>>>>> half of > >>>>>>>>>>>>>> the amount of retries in the same timeframe and can > >>>>> lessen > >>>>>>>> broker > >>>>>>>>>>>> pressure. > >>>>>>>>>>>>>> This calculation is done as following (using the > >>> formula > >>>>>> laid > >>>>>>>>> out in > >>>>>>>>>>>> the > >>>>>>>>>>>>>> KIP: > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> Try 1 at time 0 ms, failures = 0, next retry in 100 > >>> ms > >>>>>>> (default > >>>>>>>>> retry > >>>>>>>>>>>> ms > >>>>>>>>>>>>>> is initially 100 ms) > >>>>>>>>>>>>>> Try 2 at time 100 ms, failures = 1, next retry in > >>> 200 ms > >>>>>>>>>>>>>> Try 3 at time 300 ms, failures = 2, next retry in > >>> 400 ms > >>>>>>>>>>>>>> Try 4 at time 700 ms, failures = 3, next retry in > >>> 800 ms > >>>>>>>>>>>>>> Try 5 at time 1500 ms, failures = 4, next retry in > >>> 1000 > >>>>> ms > >>>>>>>>> (default > >>>>>>>>>>> max > >>>>>>>>>>>>>> retry ms is 1000 ms) > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> For 2 and 3, could you elaborate more about what > >> you > >>> mean > >>>>>>> with > >>>>>>>>>>> respect > >>>>>>>>>>>> to > >>>>>>>>>>>>>> client timeouts? I’m not very familiar with the > >>> Streams > >>>>>>>>> framework, so > >>>>>>>>>>>> would > >>>>>>>>>>>>>> love to get more insight to how that currently > >> works, > >>>>> with > >>>>>>>>> respect to > >>>>>>>>>>>>>> producer transactions, so I can appropriately > >> update > >>> the > >>>>>> KIP > >>>>>>> to > >>>>>>>>>>> address > >>>>>>>>>>>>>> these scenarios. > >>>>>>>>>>>>>> On Mar 13, 2020, 7:15 PM -0700, Boyang Chen < > >>>>>>>>>>>> reluctanthero...@gmail.com>, > >>>>>>>>>>>>>> wrote: > >>>>>>>>>>>>>>> Thanks for the KIP Sanjana. I think the > >> motivation > >>> is > >>>>>> good, > >>>>>>>> but > >>>>>>>>>>> lack > >>>>>>>>>>>> of > >>>>>>>>>>>>>>> more quantitative analysis. For instance: > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> 1. How much retries we are saving by applying the > >>>>>>> exponential > >>>>>>>>> retry > >>>>>>>>>>>> vs > >>>>>>>>>>>>>>> static retry? There should be some mathematical > >>>>> relations > >>>>>>>>> between > >>>>>>>>>>> the > >>>>>>>>>>>>>>> static retry ms, the initial exponential retry > >> ms, > >>> the > >>>>>> max > >>>>>>>>>>>> exponential > >>>>>>>>>>>>>>> retry ms in a given time interval. > >>>>>>>>>>>>>>> 2. How does this affect the client timeout? With > >>>>>>> exponential > >>>>>>>>> retry, > >>>>>>>>>>>> the > >>>>>>>>>>>>>>> client shall be getting easier to timeout on a > >>> parent > >>>>>> level > >>>>>>>>> caller, > >>>>>>>>>>>> for > >>>>>>>>>>>>>>> instance stream attempts to retry initializing > >>> producer > >>>>>>>>>>> transactions > >>>>>>>>>>>> with > >>>>>>>>>>>>>>> given 5 minute interval. With exponential retry > >>> this > >>>>>>>> mechanism > >>>>>>>>>>> could > >>>>>>>>>>>>>>> experience more frequent timeout which we should > >> be > >>>>>> careful > >>>>>>>>> with. > >>>>>>>>>>>>>>> 3. With regards to #2, we should have more > >> detailed > >>>>>>> checklist > >>>>>>>>> of > >>>>>>>>>>> all > >>>>>>>>>>>> the > >>>>>>>>>>>>>>> existing static retry scenarios, and adjust the > >>> initial > >>>>>>>>> exponential > >>>>>>>>>>>> retry > >>>>>>>>>>>>>>> ms to make sure we won't get easily timeout in > >> high > >>>>> level > >>>>>>> due > >>>>>>>>> to > >>>>>>>>>>> too > >>>>>>>>>>>> few > >>>>>>>>>>>>>>> attempts. > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> Boyang > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> On Fri, Mar 13, 2020 at 4:38 PM Sanjana > >> Kaundinya < > >>>>>>>>>>>> skaundi...@gmail.com> > >>>>>>>>>>>>>>> wrote: > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> Hi Everyone, > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> I’ve written a KIP about introducing > >> exponential > >>>>>> backoff > >>>>>>>> for > >>>>>>>>>>> Kafka > >>>>>>>>>>>>>>>> clients. Would appreciate any feedback on this. > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>> > >>>>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>> > >>>>>>>> > >>>>>>> > >>>>>> > >>>>> > >>> > >> > https://cwiki.apache.org/confluence/display/KAFKA/KIP-580%3A+Exponential+Backoff+for+Kafka+Clients > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> Thanks, > >>>>>>>>>>>>>>>> Sanjana > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>> > >>>>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>>> > >>>>>>>>>> > >>>>>>>>>> -- > >>>>>>>>>> -- Guozhang > >>>>>>>>> > >>>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> -- > >>>>>>> -- Guozhang > >>>>>>> > >>>>>> > >>>>> > >>>> > >>>> > >>>> -- > >>>> -- Guozhang > >>> > >> > >> > >> -- > >> -- Guozhang > >> > >