Re: [DISCUSS] KIP-580: Exponential Backoff for Kafka Clients

Guozhang Wang Thu, 19 Mar 2020 17:49:24 -0700

Following the formula you have in the KIP, if it is simply:

MIN(retry.backoff.max.ms, (retry.backoff.ms * 2**(failures - 1)) * random(
0.8, 1.2))


then the behavior would stay consistent at retry.backoff.max.ms.


Guozhang

On Thu, Mar 19, 2020 at 5:46 PM Sanjana Kaundinya <skaundi...@gmail.com>
wrote:

> If that’s the case then what should we base the starting point as?
> Currently in the KIP the starting point is retry.backoff.ms and it
> exponentially goes up to retry.backoff.max.ms. If retry.backoff.max.ms is
> smaller than retry.backoff.ms then that could pose a bit of a problem
> there right?
>
> On Mar 19, 2020, 5:44 PM -0700, Guozhang Wang <wangg...@gmail.com>, wrote:
> > Thanks Sanjana, I did not capture the part that Jason referred to, so
> > that's my bad :P
> >
> > Regarding your last statement, I actually feel that instead of take the
> > larger of the two, we should respect "retry.backoff.max.ms" even if it
> is
> > smaller than "retry.backoff.ms". I do not have a very strong rationale
> > except it is logically more aligned to the config names.
> >
> >
> > Guozhang
> >
> >
> > On Thu, Mar 19, 2020 at 5:39 PM Sanjana Kaundinya <skaundi...@gmail.com>
> > wrote:
> >
> > > Hey Jason and Guozhang,
> > >
> > > Jason is right, I took this inspiration from KIP-144 (
> > >
> > >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-144%3A+Exponential+backoff+for+broker+reconnect+attempts
> > > )
> > > which had the same logic in order to preserve the existing behavior. In
> > > this case however, if we are thinking to completely eliminate the
> static
> > > backoff behavior, we can do that and as Jason mentioned put it in the
> > > release notes and not add any special logic. In addition I agree that
> we
> > > should take the larger of the two of `retry.backoff.ms` and `
> > > retry.backoff.max.ms`. I'll update the KIP to reflect this and make it
> > > clear that the old static retry backoff is getting replaced by the new
> > > dynamic retry backoff.
> > >
> > > Thanks,
> > > Sanjana
> > > On Thu, Mar 19, 2020 at 4:23 PM Jason Gustafson <ja...@confluent.io>
> > > wrote:
> > >
> > > > Hey Guozhang,
> > > >
> > > > I was referring to this:
> > > >
> > > > > For users who have not set retry.backoff.ms explicitly, the
> default
> > > > behavior will change so that the backoff will grow up to 1000 ms. For
> > > users
> > > > who have set retry.backoff.ms explicitly, the behavior will remain
> the
> > > > same
> > > > as they could have specific requirements.
> > > >
> > > > I took this to mean that for users who have overridden `
> retry.backoff.ms
> > > `
> > > > to 50ms (say), we will change the default `retry.backoff.max.ms` to
> 50ms
> > > > as
> > > > well in order to preserve existing backoff behavior. Is that not
> right?
> > > In
> > > > any case, I agree that we can use the maximum of the two values as
> the
> > > > effective `retry.backoff.max.ms` to handle the case when the
> configured
> > > > value of `retry.backoff.ms` is larger than the default of 1s.
> > > >
> > > > -Jason
> > > >
> > > >
> > > >
> > > >
> > > > On Thu, Mar 19, 2020 at 3:29 PM Guozhang Wang <wangg...@gmail.com>
> > > wrote:
> > > >
> > > > > Hey Jason,
> > > > >
> > > > > My understanding is a bit different here: even if user has an
> explicit
> > > > > overridden "retry.backoff.ms", the exponential mechanism still
> > > triggers
> > > > > and
> > > > > the backoff would be increased till "retry.backoff.max.ms"; and
> if the
> > > > > specified "retry.backoff.ms" is already larger than the "
> > > > > retry.backoff.max.ms", we would still take "retry.backoff.max.ms".
> > > > >
> > > > > So if the user does override the "retry.backoff.ms" to a value
> larger
> > > > than
> > > > > 1s and is not aware of the new config, she would be surprised to
> see
> > > the
> > > > > specified value seemingly not being respected, but she could still
> > > learn
> > > > > that afterwards by reading the release notes introducing this KIP
> > > > anyways.
> > > > >
> > > > >
> > > > > Guozhang
> > > > >
> > > > > On Thu, Mar 19, 2020 at 3:10 PM Jason Gustafson <
> ja...@confluent.io>
> > > > > wrote:
> > > > >
> > > > > > Hi Sanjana,
> > > > > >
> > > > > > The KIP looks good to me. I had just one question about the
> default
> > > > > > behavior. As I understand, if the user has specified `
> > > retry.backoff.ms
> > > > `
> > > > > > explicitly, then we will not apply the default max backoff. As
> such,
> > > > > > there's no way to get the benefit of this feature if you are
> > > providing
> > > > a
> > > > > `
> > > > > > retry.backoff.ms` unless you also provide `retry.backoff.max.ms
> `.
> > > That
> > > > > > makes sense if you assume the user is unaware of the new
> > > configuration,
> > > > > but
> > > > > > it is surprising otherwise. Since it's not a semantic change and
> > > since
> > > > > the
> > > > > > default you're proposing of 1s is fairly low already, I wonder if
> > > it's
> > > > > good
> > > > > > enough to mention the new configuration in the release notes and
> not
> > > > add
> > > > > > any special logic. What do you think?
> > > > > >
> > > > > > -Jason
> > > > > >
> > > > > > On Thu, Mar 19, 2020 at 1:56 PM Sanjana Kaundinya <
> > > > skaundi...@gmail.com>
> > > > > > wrote:
> > > > > >
> > > > > > > Thank you for the comments Guozhang.
> > > > > > >
> > > > > > > I’ll leave this KIP out for discussion till the end of the
> week and
> > > > > then
> > > > > > > start a vote for this early next week.
> > > > > > >
> > > > > > > Sanjana
> > > > > > >
> > > > > > > On Mar 18, 2020, 3:38 PM -0700, Guozhang Wang <
> wangg...@gmail.com
> > > > ,
> > > > > > wrote:
> > > > > > > > Hello Sanjana,
> > > > > > > >
> > > > > > > > Thanks for the proposed KIP, I think that makes a lot of
> sense --
> > > > as
> > > > > > you
> > > > > > > > mentioned in the motivation, we've indeed seen many issues
> with
> > > > > regard
> > > > > > to
> > > > > > > > the frequent retries, with bounded exponential backoff in the
> > > > > scenario
> > > > > > > > where there's a long connectivity issue we would effectively
> > > reduce
> > > > > the
> > > > > > > > request load by 10 given the default configs.
> > > > > > > >
> > > > > > > > For higher-level Streams client and Connect frameworks,
> today we
> > > > also
> > > > > > > have
> > > > > > > > a retry logic but that's used in a slightly different way.
> For
> > > > > example
> > > > > > in
> > > > > > > > Streams, we tend to handle the retry logic at the
> thread-level
> > > and
> > > > > > hence
> > > > > > > > very likely we'd like to change that mechanism in KIP-572
> > > anyways.
> > > > > For
> > > > > > > > producer / consumer / admin clients, I think just applying
> this
> > > > > > > behavioral
> > > > > > > > change across these clients makes lot of sense. So I think
> can
> > > just
> > > > > > leave
> > > > > > > > the Streams / Connect out of the scope of this KIP to be
> > > addressed
> > > > in
> > > > > > > > separate discussions.
> > > > > > > >
> > > > > > > > I do not have further comments about this KIP :) LGTM.
> > > > > > > >
> > > > > > > > Guozhang
> > > > > > > >
> > > > > > > >
> > > > > > > > On Wed, Mar 18, 2020 at 12:09 AM Sanjana Kaundinya <
> > > > > > skaundi...@gmail.com
> > > > > > > >
> > > > > > > > wrote:
> > > > > > > >
> > > > > > > > > Thanks for the feedback Boyang.
> > > > > > > > >
> > > > > > > > > If there’s anyone else who has feedback regarding this KIP,
> > > would
> > > > > > > really
> > > > > > > > > appreciate it hearing it!
> > > > > > > > >
> > > > > > > > > Thanks,
> > > > > > > > > Sanjana
> > > > > > > > >
> > > > > > > > > On Tue, Mar 17, 2020 at 11:38 PM Boyang Chen <
> > > > bche...@outlook.com>
> > > > > > > wrote:
> > > > > > > > >
> > > > > > > > > > Sounds great!
> > > > > > > > > >
> > > > > > > > > > Get Outlook for iOS<https://aka.ms/o0ukef>
> > > > > > > > > > ________________________________
> > > > > > > > > > From: Sanjana Kaundinya <skaundi...@gmail.com>
> > > > > > > > > > Sent: Tuesday, March 17, 2020 5:54:35 PM
> > > > > > > > > > To: dev@kafka.apache.org <dev@kafka.apache.org>
> > > > > > > > > > Subject: Re: [DISCUSS] KIP-580: Exponential Backoff for
> Kafka
> > > > > > Clients
> > > > > > > > > >
> > > > > > > > > > Thanks for the explanation Boyang. One of the most common
> > > > > problems
> > > > > > > that
> > > > > > > > > we
> > > > > > > > > > have in Kafka is with respect to metadata fetches. For
> > > example,
> > > > > if
> > > > > > > there
> > > > > > > > > is
> > > > > > > > > > a broker failure, all clients start to fetch metadata at
> the
> > > > same
> > > > > > > time
> > > > > > > > > and
> > > > > > > > > > it often takes a while for the metadata to converge. In a
> > > high
> > > > > load
> > > > > > > > > > cluster, there are also issues where the volume of
> metadata
> > > has
> > > > > > made
> > > > > > > > > > convergence of metadata slower.
> > > > > > > > > >
> > > > > > > > > > For this case, exponential backoff helps as it reduces
> the
> > > > retry
> > > > > > > rate and
> > > > > > > > > > spaces out how often clients will retry, thereby bringing
> > > down
> > > > > the
> > > > > > > time
> > > > > > > > > for
> > > > > > > > > > convergence. Something that Jason mentioned that would
> be a
> > > > great
> > > > > > > > > addition
> > > > > > > > > > here would be if the backoff should be “jittered” as it
> was
> > > in
> > > > > > > KIP-144
> > > > > > > > > with
> > > > > > > > > > respect to exponential reconnect backoff. This would help
> > > > prevent
> > > > > > the
> > > > > > > > > > clients from being synchronized on when they retry,
> thereby
> > > > > spacing
> > > > > > > out
> > > > > > > > > the
> > > > > > > > > > number of requests being sent to the broker at the same
> time.
> > > > > > > > > >
> > > > > > > > > > I’ll add this example to the KIP and flush out more of
> the
> > > > > details
> > > > > > -
> > > > > > > so
> > > > > > > > > > it’s more clear.
> > > > > > > > > >
> > > > > > > > > > On Mar 17, 2020, 1:24 PM -0700, Boyang Chen <
> > > > > > > reluctanthero...@gmail.com
> > > > > > > > > > ,
> > > > > > > > > > wrote:
> > > > > > > > > > > Thanks for the reply Sanjana. I guess I would like to
> > > > rephrase
> > > > > my
> > > > > > > > > > question
> > > > > > > > > > > 2 and 3 as my previous response is a little bit
> > > unactionable.
> > > > > > > > > > >
> > > > > > > > > > > My specific point is that exponential backoff is not a
> > > silver
> > > > > > > bullet
> > > > > > > > > and
> > > > > > > > > > we
> > > > > > > > > > > should consider using it to solve known problems,
> instead
> > > of
> > > > > > > making the
> > > > > > > > > > > holistic changes to all clients in Kafka ecosystem. I
> do
> > > like
> > > > > the
> > > > > > > > > > > exponential backoff idea and believe this would be of
> great
> > > > > > value,
> > > > > > > but
> > > > > > > > > > > maybe we should focus on proposing some existing
> modules
> > > that
> > > > > are
> > > > > > > > > > suffering
> > > > > > > > > > > from static retry, and only change them in this first
> KIP.
> > > If
> > > > > in
> > > > > > > the
> > > > > > > > > > > future, some other component users believe they are
> also
> > > > > > > suffering, we
> > > > > > > > > > > could get more minor KIPs to change the behavior as
> well.
> > > > > > > > > > >
> > > > > > > > > > > Boyang
> > > > > > > > > > >
> > > > > > > > > > > On Sun, Mar 15, 2020 at 12:07 AM Sanjana Kaundinya <
> > > > > > > > > skaundi...@gmail.com
> > > > > > > > > > >
> > > > > > > > > > > wrote:
> > > > > > > > > > >
> > > > > > > > > > > > Thanks for the feedback Boyang, I will revise the KIP
> > > with
> > > > > the
> > > > > > > > > > > > mathematical relations as per your suggestion. To
> address
> > > > > your
> > > > > > > > > > feedback:
> > > > > > > > > > > >
> > > > > > > > > > > > 1. Currently, with the default of 100 ms per retry
> > > backoff,
> > > > > in
> > > > > > 1
> > > > > > > > > second
> > > > > > > > > > > > we would have 10 retries. In the case of using an
> > > > exponential
> > > > > > > > > backoff,
> > > > > > > > > > we
> > > > > > > > > > > > would have a total of 4 retries in 1 second. Thus we
> have
> > > > > less
> > > > > > > than
> > > > > > > > > > half of
> > > > > > > > > > > > the amount of retries in the same timeframe and can
> > > lessen
> > > > > > broker
> > > > > > > > > > pressure.
> > > > > > > > > > > > This calculation is done as following (using the
> formula
> > > > laid
> > > > > > > out in
> > > > > > > > > > the
> > > > > > > > > > > > KIP:
> > > > > > > > > > > >
> > > > > > > > > > > > Try 1 at time 0 ms, failures = 0, next retry in 100
> ms
> > > > > (default
> > > > > > > retry
> > > > > > > > > > ms
> > > > > > > > > > > > is initially 100 ms)
> > > > > > > > > > > > Try 2 at time 100 ms, failures = 1, next retry in
> 200 ms
> > > > > > > > > > > > Try 3 at time 300 ms, failures = 2, next retry in
> 400 ms
> > > > > > > > > > > > Try 4 at time 700 ms, failures = 3, next retry in
> 800 ms
> > > > > > > > > > > > Try 5 at time 1500 ms, failures = 4, next retry in
> 1000
> > > ms
> > > > > > > (default
> > > > > > > > > max
> > > > > > > > > > > > retry ms is 1000 ms)
> > > > > > > > > > > >
> > > > > > > > > > > > For 2 and 3, could you elaborate more about what you
> mean
> > > > > with
> > > > > > > > > respect
> > > > > > > > > > to
> > > > > > > > > > > > client timeouts? I’m not very familiar with the
> Streams
> > > > > > > framework, so
> > > > > > > > > > would
> > > > > > > > > > > > love to get more insight to how that currently works,
> > > with
> > > > > > > respect to
> > > > > > > > > > > > producer transactions, so I can appropriately update
> the
> > > > KIP
> > > > > to
> > > > > > > > > address
> > > > > > > > > > > > these scenarios.
> > > > > > > > > > > > On Mar 13, 2020, 7:15 PM -0700, Boyang Chen <
> > > > > > > > > > reluctanthero...@gmail.com>,
> > > > > > > > > > > > wrote:
> > > > > > > > > > > > > Thanks for the KIP Sanjana. I think the motivation
> is
> > > > good,
> > > > > > but
> > > > > > > > > lack
> > > > > > > > > > of
> > > > > > > > > > > > > more quantitative analysis. For instance:
> > > > > > > > > > > > >
> > > > > > > > > > > > > 1. How much retries we are saving by applying the
> > > > > exponential
> > > > > > > retry
> > > > > > > > > > vs
> > > > > > > > > > > > > static retry? There should be some mathematical
> > > relations
> > > > > > > between
> > > > > > > > > the
> > > > > > > > > > > > > static retry ms, the initial exponential retry ms,
> the
> > > > max
> > > > > > > > > > exponential
> > > > > > > > > > > > > retry ms in a given time interval.
> > > > > > > > > > > > > 2. How does this affect the client timeout? With
> > > > > exponential
> > > > > > > retry,
> > > > > > > > > > the
> > > > > > > > > > > > > client shall be getting easier to timeout on a
> parent
> > > > level
> > > > > > > caller,
> > > > > > > > > > for
> > > > > > > > > > > > > instance stream attempts to retry initializing
> producer
> > > > > > > > > transactions
> > > > > > > > > > with
> > > > > > > > > > > > > given 5 minute interval. With exponential retry
> this
> > > > > > mechanism
> > > > > > > > > could
> > > > > > > > > > > > > experience more frequent timeout which we should be
> > > > careful
> > > > > > > with.
> > > > > > > > > > > > > 3. With regards to #2, we should have more detailed
> > > > > checklist
> > > > > > > of
> > > > > > > > > all
> > > > > > > > > > the
> > > > > > > > > > > > > existing static retry scenarios, and adjust the
> initial
> > > > > > > exponential
> > > > > > > > > > retry
> > > > > > > > > > > > > ms to make sure we won't get easily timeout in high
> > > level
> > > > > due
> > > > > > > to
> > > > > > > > > too
> > > > > > > > > > few
> > > > > > > > > > > > > attempts.
> > > > > > > > > > > > >
> > > > > > > > > > > > > Boyang
> > > > > > > > > > > > >
> > > > > > > > > > > > > On Fri, Mar 13, 2020 at 4:38 PM Sanjana Kaundinya <
> > > > > > > > > > skaundi...@gmail.com>
> > > > > > > > > > > > > wrote:
> > > > > > > > > > > > >
> > > > > > > > > > > > > > Hi Everyone,
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > I’ve written a KIP about introducing exponential
> > > > backoff
> > > > > > for
> > > > > > > > > Kafka
> > > > > > > > > > > > > > clients. Would appreciate any feedback on this.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-580%3A+Exponential+Backoff+for+Kafka+Clients
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Thanks,
> > > > > > > > > > > > > > Sanjana
> > > > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > --
> > > > > > > > -- Guozhang
> > > > > > >
> > > > > >
> > > > >
> > > > >
> > > > > --
> > > > > -- Guozhang
> > > > >
> > > >
> > >
> >
> >
> > --
> > -- Guozhang
>


-- 
-- Guozhang

Re: [DISCUSS] KIP-580: Exponential Backoff for Kafka Clients

Reply via email to