Following the formula you have in the KIP, if it is simply: MIN(retry.backoff.max.ms, (retry.backoff.ms * 2**(failures - 1)) * random( 0.8, 1.2))
then the behavior would stay consistent at retry.backoff.max.ms. Guozhang On Thu, Mar 19, 2020 at 5:46 PM Sanjana Kaundinya <skaundi...@gmail.com> wrote: > If that’s the case then what should we base the starting point as? > Currently in the KIP the starting point is retry.backoff.ms and it > exponentially goes up to retry.backoff.max.ms. If retry.backoff.max.ms is > smaller than retry.backoff.ms then that could pose a bit of a problem > there right? > > On Mar 19, 2020, 5:44 PM -0700, Guozhang Wang <wangg...@gmail.com>, wrote: > > Thanks Sanjana, I did not capture the part that Jason referred to, so > > that's my bad :P > > > > Regarding your last statement, I actually feel that instead of take the > > larger of the two, we should respect "retry.backoff.max.ms" even if it > is > > smaller than "retry.backoff.ms". I do not have a very strong rationale > > except it is logically more aligned to the config names. > > > > > > Guozhang > > > > > > On Thu, Mar 19, 2020 at 5:39 PM Sanjana Kaundinya <skaundi...@gmail.com> > > wrote: > > > > > Hey Jason and Guozhang, > > > > > > Jason is right, I took this inspiration from KIP-144 ( > > > > > > > https://cwiki.apache.org/confluence/display/KAFKA/KIP-144%3A+Exponential+backoff+for+broker+reconnect+attempts > > > ) > > > which had the same logic in order to preserve the existing behavior. In > > > this case however, if we are thinking to completely eliminate the > static > > > backoff behavior, we can do that and as Jason mentioned put it in the > > > release notes and not add any special logic. In addition I agree that > we > > > should take the larger of the two of `retry.backoff.ms` and ` > > > retry.backoff.max.ms`. I'll update the KIP to reflect this and make it > > > clear that the old static retry backoff is getting replaced by the new > > > dynamic retry backoff. > > > > > > Thanks, > > > Sanjana > > > On Thu, Mar 19, 2020 at 4:23 PM Jason Gustafson <ja...@confluent.io> > > > wrote: > > > > > > > Hey Guozhang, > > > > > > > > I was referring to this: > > > > > > > > > For users who have not set retry.backoff.ms explicitly, the > default > > > > behavior will change so that the backoff will grow up to 1000 ms. For > > > users > > > > who have set retry.backoff.ms explicitly, the behavior will remain > the > > > > same > > > > as they could have specific requirements. > > > > > > > > I took this to mean that for users who have overridden ` > retry.backoff.ms > > > ` > > > > to 50ms (say), we will change the default `retry.backoff.max.ms` to > 50ms > > > > as > > > > well in order to preserve existing backoff behavior. Is that not > right? > > > In > > > > any case, I agree that we can use the maximum of the two values as > the > > > > effective `retry.backoff.max.ms` to handle the case when the > configured > > > > value of `retry.backoff.ms` is larger than the default of 1s. > > > > > > > > -Jason > > > > > > > > > > > > > > > > > > > > On Thu, Mar 19, 2020 at 3:29 PM Guozhang Wang <wangg...@gmail.com> > > > wrote: > > > > > > > > > Hey Jason, > > > > > > > > > > My understanding is a bit different here: even if user has an > explicit > > > > > overridden "retry.backoff.ms", the exponential mechanism still > > > triggers > > > > > and > > > > > the backoff would be increased till "retry.backoff.max.ms"; and > if the > > > > > specified "retry.backoff.ms" is already larger than the " > > > > > retry.backoff.max.ms", we would still take "retry.backoff.max.ms". > > > > > > > > > > So if the user does override the "retry.backoff.ms" to a value > larger > > > > than > > > > > 1s and is not aware of the new config, she would be surprised to > see > > > the > > > > > specified value seemingly not being respected, but she could still > > > learn > > > > > that afterwards by reading the release notes introducing this KIP > > > > anyways. > > > > > > > > > > > > > > > Guozhang > > > > > > > > > > On Thu, Mar 19, 2020 at 3:10 PM Jason Gustafson < > ja...@confluent.io> > > > > > wrote: > > > > > > > > > > > Hi Sanjana, > > > > > > > > > > > > The KIP looks good to me. I had just one question about the > default > > > > > > behavior. As I understand, if the user has specified ` > > > retry.backoff.ms > > > > ` > > > > > > explicitly, then we will not apply the default max backoff. As > such, > > > > > > there's no way to get the benefit of this feature if you are > > > providing > > > > a > > > > > ` > > > > > > retry.backoff.ms` unless you also provide `retry.backoff.max.ms > `. > > > That > > > > > > makes sense if you assume the user is unaware of the new > > > configuration, > > > > > but > > > > > > it is surprising otherwise. Since it's not a semantic change and > > > since > > > > > the > > > > > > default you're proposing of 1s is fairly low already, I wonder if > > > it's > > > > > good > > > > > > enough to mention the new configuration in the release notes and > not > > > > add > > > > > > any special logic. What do you think? > > > > > > > > > > > > -Jason > > > > > > > > > > > > On Thu, Mar 19, 2020 at 1:56 PM Sanjana Kaundinya < > > > > skaundi...@gmail.com> > > > > > > wrote: > > > > > > > > > > > > > Thank you for the comments Guozhang. > > > > > > > > > > > > > > I’ll leave this KIP out for discussion till the end of the > week and > > > > > then > > > > > > > start a vote for this early next week. > > > > > > > > > > > > > > Sanjana > > > > > > > > > > > > > > On Mar 18, 2020, 3:38 PM -0700, Guozhang Wang < > wangg...@gmail.com > > > > , > > > > > > wrote: > > > > > > > > Hello Sanjana, > > > > > > > > > > > > > > > > Thanks for the proposed KIP, I think that makes a lot of > sense -- > > > > as > > > > > > you > > > > > > > > mentioned in the motivation, we've indeed seen many issues > with > > > > > regard > > > > > > to > > > > > > > > the frequent retries, with bounded exponential backoff in the > > > > > scenario > > > > > > > > where there's a long connectivity issue we would effectively > > > reduce > > > > > the > > > > > > > > request load by 10 given the default configs. > > > > > > > > > > > > > > > > For higher-level Streams client and Connect frameworks, > today we > > > > also > > > > > > > have > > > > > > > > a retry logic but that's used in a slightly different way. > For > > > > > example > > > > > > in > > > > > > > > Streams, we tend to handle the retry logic at the > thread-level > > > and > > > > > > hence > > > > > > > > very likely we'd like to change that mechanism in KIP-572 > > > anyways. > > > > > For > > > > > > > > producer / consumer / admin clients, I think just applying > this > > > > > > > behavioral > > > > > > > > change across these clients makes lot of sense. So I think > can > > > just > > > > > > leave > > > > > > > > the Streams / Connect out of the scope of this KIP to be > > > addressed > > > > in > > > > > > > > separate discussions. > > > > > > > > > > > > > > > > I do not have further comments about this KIP :) LGTM. > > > > > > > > > > > > > > > > Guozhang > > > > > > > > > > > > > > > > > > > > > > > > On Wed, Mar 18, 2020 at 12:09 AM Sanjana Kaundinya < > > > > > > skaundi...@gmail.com > > > > > > > > > > > > > > > > wrote: > > > > > > > > > > > > > > > > > Thanks for the feedback Boyang. > > > > > > > > > > > > > > > > > > If there’s anyone else who has feedback regarding this KIP, > > > would > > > > > > > really > > > > > > > > > appreciate it hearing it! > > > > > > > > > > > > > > > > > > Thanks, > > > > > > > > > Sanjana > > > > > > > > > > > > > > > > > > On Tue, Mar 17, 2020 at 11:38 PM Boyang Chen < > > > > bche...@outlook.com> > > > > > > > wrote: > > > > > > > > > > > > > > > > > > > Sounds great! > > > > > > > > > > > > > > > > > > > > Get Outlook for iOS<https://aka.ms/o0ukef> > > > > > > > > > > ________________________________ > > > > > > > > > > From: Sanjana Kaundinya <skaundi...@gmail.com> > > > > > > > > > > Sent: Tuesday, March 17, 2020 5:54:35 PM > > > > > > > > > > To: dev@kafka.apache.org <dev@kafka.apache.org> > > > > > > > > > > Subject: Re: [DISCUSS] KIP-580: Exponential Backoff for > Kafka > > > > > > Clients > > > > > > > > > > > > > > > > > > > > Thanks for the explanation Boyang. One of the most common > > > > > problems > > > > > > > that > > > > > > > > > we > > > > > > > > > > have in Kafka is with respect to metadata fetches. For > > > example, > > > > > if > > > > > > > there > > > > > > > > > is > > > > > > > > > > a broker failure, all clients start to fetch metadata at > the > > > > same > > > > > > > time > > > > > > > > > and > > > > > > > > > > it often takes a while for the metadata to converge. In a > > > high > > > > > load > > > > > > > > > > cluster, there are also issues where the volume of > metadata > > > has > > > > > > made > > > > > > > > > > convergence of metadata slower. > > > > > > > > > > > > > > > > > > > > For this case, exponential backoff helps as it reduces > the > > > > retry > > > > > > > rate and > > > > > > > > > > spaces out how often clients will retry, thereby bringing > > > down > > > > > the > > > > > > > time > > > > > > > > > for > > > > > > > > > > convergence. Something that Jason mentioned that would > be a > > > > great > > > > > > > > > addition > > > > > > > > > > here would be if the backoff should be “jittered” as it > was > > > in > > > > > > > KIP-144 > > > > > > > > > with > > > > > > > > > > respect to exponential reconnect backoff. This would help > > > > prevent > > > > > > the > > > > > > > > > > clients from being synchronized on when they retry, > thereby > > > > > spacing > > > > > > > out > > > > > > > > > the > > > > > > > > > > number of requests being sent to the broker at the same > time. > > > > > > > > > > > > > > > > > > > > I’ll add this example to the KIP and flush out more of > the > > > > > details > > > > > > - > > > > > > > so > > > > > > > > > > it’s more clear. > > > > > > > > > > > > > > > > > > > > On Mar 17, 2020, 1:24 PM -0700, Boyang Chen < > > > > > > > reluctanthero...@gmail.com > > > > > > > > > > , > > > > > > > > > > wrote: > > > > > > > > > > > Thanks for the reply Sanjana. I guess I would like to > > > > rephrase > > > > > my > > > > > > > > > > question > > > > > > > > > > > 2 and 3 as my previous response is a little bit > > > unactionable. > > > > > > > > > > > > > > > > > > > > > > My specific point is that exponential backoff is not a > > > silver > > > > > > > bullet > > > > > > > > > and > > > > > > > > > > we > > > > > > > > > > > should consider using it to solve known problems, > instead > > > of > > > > > > > making the > > > > > > > > > > > holistic changes to all clients in Kafka ecosystem. I > do > > > like > > > > > the > > > > > > > > > > > exponential backoff idea and believe this would be of > great > > > > > > value, > > > > > > > but > > > > > > > > > > > maybe we should focus on proposing some existing > modules > > > that > > > > > are > > > > > > > > > > suffering > > > > > > > > > > > from static retry, and only change them in this first > KIP. > > > If > > > > > in > > > > > > > the > > > > > > > > > > > future, some other component users believe they are > also > > > > > > > suffering, we > > > > > > > > > > > could get more minor KIPs to change the behavior as > well. > > > > > > > > > > > > > > > > > > > > > > Boyang > > > > > > > > > > > > > > > > > > > > > > On Sun, Mar 15, 2020 at 12:07 AM Sanjana Kaundinya < > > > > > > > > > skaundi...@gmail.com > > > > > > > > > > > > > > > > > > > > > > wrote: > > > > > > > > > > > > > > > > > > > > > > > Thanks for the feedback Boyang, I will revise the KIP > > > with > > > > > the > > > > > > > > > > > > mathematical relations as per your suggestion. To > address > > > > > your > > > > > > > > > > feedback: > > > > > > > > > > > > > > > > > > > > > > > > 1. Currently, with the default of 100 ms per retry > > > backoff, > > > > > in > > > > > > 1 > > > > > > > > > second > > > > > > > > > > > > we would have 10 retries. In the case of using an > > > > exponential > > > > > > > > > backoff, > > > > > > > > > > we > > > > > > > > > > > > would have a total of 4 retries in 1 second. Thus we > have > > > > > less > > > > > > > than > > > > > > > > > > half of > > > > > > > > > > > > the amount of retries in the same timeframe and can > > > lessen > > > > > > broker > > > > > > > > > > pressure. > > > > > > > > > > > > This calculation is done as following (using the > formula > > > > laid > > > > > > > out in > > > > > > > > > > the > > > > > > > > > > > > KIP: > > > > > > > > > > > > > > > > > > > > > > > > Try 1 at time 0 ms, failures = 0, next retry in 100 > ms > > > > > (default > > > > > > > retry > > > > > > > > > > ms > > > > > > > > > > > > is initially 100 ms) > > > > > > > > > > > > Try 2 at time 100 ms, failures = 1, next retry in > 200 ms > > > > > > > > > > > > Try 3 at time 300 ms, failures = 2, next retry in > 400 ms > > > > > > > > > > > > Try 4 at time 700 ms, failures = 3, next retry in > 800 ms > > > > > > > > > > > > Try 5 at time 1500 ms, failures = 4, next retry in > 1000 > > > ms > > > > > > > (default > > > > > > > > > max > > > > > > > > > > > > retry ms is 1000 ms) > > > > > > > > > > > > > > > > > > > > > > > > For 2 and 3, could you elaborate more about what you > mean > > > > > with > > > > > > > > > respect > > > > > > > > > > to > > > > > > > > > > > > client timeouts? I’m not very familiar with the > Streams > > > > > > > framework, so > > > > > > > > > > would > > > > > > > > > > > > love to get more insight to how that currently works, > > > with > > > > > > > respect to > > > > > > > > > > > > producer transactions, so I can appropriately update > the > > > > KIP > > > > > to > > > > > > > > > address > > > > > > > > > > > > these scenarios. > > > > > > > > > > > > On Mar 13, 2020, 7:15 PM -0700, Boyang Chen < > > > > > > > > > > reluctanthero...@gmail.com>, > > > > > > > > > > > > wrote: > > > > > > > > > > > > > Thanks for the KIP Sanjana. I think the motivation > is > > > > good, > > > > > > but > > > > > > > > > lack > > > > > > > > > > of > > > > > > > > > > > > > more quantitative analysis. For instance: > > > > > > > > > > > > > > > > > > > > > > > > > > 1. How much retries we are saving by applying the > > > > > exponential > > > > > > > retry > > > > > > > > > > vs > > > > > > > > > > > > > static retry? There should be some mathematical > > > relations > > > > > > > between > > > > > > > > > the > > > > > > > > > > > > > static retry ms, the initial exponential retry ms, > the > > > > max > > > > > > > > > > exponential > > > > > > > > > > > > > retry ms in a given time interval. > > > > > > > > > > > > > 2. How does this affect the client timeout? With > > > > > exponential > > > > > > > retry, > > > > > > > > > > the > > > > > > > > > > > > > client shall be getting easier to timeout on a > parent > > > > level > > > > > > > caller, > > > > > > > > > > for > > > > > > > > > > > > > instance stream attempts to retry initializing > producer > > > > > > > > > transactions > > > > > > > > > > with > > > > > > > > > > > > > given 5 minute interval. With exponential retry > this > > > > > > mechanism > > > > > > > > > could > > > > > > > > > > > > > experience more frequent timeout which we should be > > > > careful > > > > > > > with. > > > > > > > > > > > > > 3. With regards to #2, we should have more detailed > > > > > checklist > > > > > > > of > > > > > > > > > all > > > > > > > > > > the > > > > > > > > > > > > > existing static retry scenarios, and adjust the > initial > > > > > > > exponential > > > > > > > > > > retry > > > > > > > > > > > > > ms to make sure we won't get easily timeout in high > > > level > > > > > due > > > > > > > to > > > > > > > > > too > > > > > > > > > > few > > > > > > > > > > > > > attempts. > > > > > > > > > > > > > > > > > > > > > > > > > > Boyang > > > > > > > > > > > > > > > > > > > > > > > > > > On Fri, Mar 13, 2020 at 4:38 PM Sanjana Kaundinya < > > > > > > > > > > skaundi...@gmail.com> > > > > > > > > > > > > > wrote: > > > > > > > > > > > > > > > > > > > > > > > > > > > Hi Everyone, > > > > > > > > > > > > > > > > > > > > > > > > > > > > I’ve written a KIP about introducing exponential > > > > backoff > > > > > > for > > > > > > > > > Kafka > > > > > > > > > > > > > > clients. Would appreciate any feedback on this. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > https://cwiki.apache.org/confluence/display/KAFKA/KIP-580%3A+Exponential+Backoff+for+Kafka+Clients > > > > > > > > > > > > > > > > > > > > > > > > > > > > Thanks, > > > > > > > > > > > > > > Sanjana > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > > > -- Guozhang > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > -- Guozhang > > > > > > > > > > > > > > > > > > -- > > -- Guozhang > -- -- Guozhang