Hi Konstantine,

Thanks for the insightful feedback. I’ll address it here as well as update the 
KIP accordingly.

I think it is important to call out the fact that we are leaving out Connect 
and Streams in the proposed changes, so that it can be addressed in future 
KIP/changes. As you pointed out, Kafka Connect does utilize 
ConsumerNetworkClient and Metadata for its rebalancing protocol, and as a 
result the changes made to exponential backoff would affect the 
WorkerGroupMember that utilizes these classes. Any Kafka client that utilizes 
these classes would be making use of exponential backoff instead of the current 
static backoff.

That being said, although Kafka Connect will be affected with respect to those 
two things, not all of the backoff configs are being replaced here. As you 
correctly stated, classes such as AbstractCoordinator, ConsumerCoordinator, and 
the Heartbeat thread would be utilizing the static backoff behavior - no 
changes will be made with respect to rebalancing.

With respect to Compatibility, I will add into that section the things I’ve 
mentioned above - affects to Kafka Connect as well as no changes to anything 
related to the rebalance protocol. In addition, the reason why 
retry.backoff.max.ms shouldn’t default to the same value as retry.backoff.ms is 
that then if a user isn’t aware of this feature and doesn’t set this, they 
wouldn’t enjoy the exponential backoff. Instead it’s important to ensure that 
we provide this as a default feature for all Kafka clients. In addition 
defaulting the retry.backoff.max.ms to 1000 ms unconditionally wouldn’t give 
users the flexibility to tune their clients to their environments.

Finally, yes you are correct, in order to have exponential backoff, we actually 
do need both configs, with retry.backkoff.ms < retry.backoff.max.ms. I will 
update the KIP to reflect that as well as incorporate the wording change you 
have suggested.

Thanks,
Sanjana

On Mar 25, 2020, 10:50 AM -0700, Konstantine Karantasis 
<konstant...@confluent.io>, wrote:
> Hi Sanjana and thanks for the KIP!
>
> Sorry for the late response, but I still have a few questions that you
> might find useful.
>
> The KIP currently does not mention Kafka Connect at all. I have read
> the discussion above where it'd been decided to leave Connect and Streams
> out of the proposed changes, but I feel this should be called out
> explicitly. At the same time, Kafka Connect is also a Kafka client that
> uses ConsumerNetworkClient and Metadata for its rebalancing protocol. It's
> not clear to me whether changes in those classes will affect Connect
> workers. Do you think it's worth clarifying that?
>
> Additionally, you might also want to add a section specifically to mention
> how this new config affects the places where the current config
> retry.backoff.ms is used today to back-off during rebalancing. Is
> exponential backoff going to replace the old config in those places as
> well? And if it does, should we add a mention that a very high value of the
> new retry.backoff.max.ms might affect how quickly a consumer or worker
> rejoins their group after it experiences a temporary network partitioning
> from the broker coordinator?
>
> Places that explicitly use retry.backoff.ms at the moment include the
> AbstractCoordinator, the ConsumerCoordinator and the Heartbeat thread. By
> reading the previous discussion, I understand that these classes might keep
> using the old static backoff. Even if that's the case, I think it's worth
> mentioning that in the KIP for reference.
>
> In the rejected alternatives section, you mention that "existing behavior
> is always maintained: for reasons explained in the compatibility section.".
> However, the Compatibility section says that there are no compatibility
> concerns. I'd suggest extending the compatibility section to help a bit
> more in explaining why the alternatives were rejected. Also, in the
> compatibility section you mention that the new config (retry.backoff.max.ms)
> will replace the old one (retry.backoff.ms), but from reading at the
> beginning, I understand that in order to have exponential increments, you
> actually need both configs, with retry.backoff.ms < retry.backoff.max.ms.
> Should the mention around replacement be removed?
>
> Finally, I have a minor suggestion that might help explain the following
> sentence better:
>
> "If retry.backoff.ms is set to be greater than retry.backoff.max.ms, then
> retry.backoff.max.ms will be used as a **constant backoff from the
> beginning without exponential increase**." (highlighting the difference
> only for reference here). Unless I misunderstood how the new backoff will
> be used when it's smaller than the value of the old config, in which case
> it might help clarifying a bit more as well.
>
>
> Thanks for the KIP!
> Really looking forward to more robust retries in Kafka clients
>
> Konstantine
>
>
> On Tue, Mar 24, 2020 at 9:56 AM Guozhang Wang <wangg...@gmail.com> wrote:
>
> > In Kafka clients, there are cases where we log a warning when overriding
> > some conflicting configs and in some other cases we throw and let the
> > brokers to die during startup --- you can check the
> > `postProcessParsedConfig` function in Producer/ConsumerConfig for such
> > logic.
> >
> > I think for this case, it is sufficient to log a warning if we find the
> > `max` < `backoff`.
> >
> >
> > Guozhang
> >
> > On Mon, Mar 23, 2020 at 9:18 PM Boyang Chen <reluctanthero...@gmail.com>
> > wrote:
> >
> > > Got it, although I would still like to be aware of the actual backoff I
> > > will be using in production, having the app crash seems like an
> > > over-reaction. I don't think I have further questions :)
> > >
> > > On Mon, Mar 23, 2020 at 7:36 PM Sanjana Kaundinya <skaundi...@gmail.com>
> > > wrote:
> > >
> > > > Hey Sanjana,
> > > >
> > > > Hey Boyang,
> > > >
> > > > If a user provides no config at all then as you mentioned they will be
> > > > default be able to make use of the exponential back off feature
> > > introduced
> > > > by the KIP. If the backoff.ms is overriden to 2000 ms, the lesser of
> > > > either
> > > > the max or the computed back off will be chosen, so in this case the
> > max
> > > > will be chosen as it is 1000 ms. As Guozhang mentioned if the user
> > > > configures something like this then they would notice the behavior to
> > not
> > > > be in line what they expect and would see the KIP + Release notes and
> > > know
> > > > to configure it to be backoff.ms < max backoff.ms. I’m not sure if its
> > > as
> > > > big of an error to reject the configuration if it’s configured like
> > this,
> > > > as the clients could still run in either case.
> > > >
> > > > To answer your second question, we are making the dynamic backoff the
> > > > default and not allowing for static backoff (unless they set
> > backoff.ms
> > > >
> > > > max.backof.ms, then that would in a sense be static) We will include
> > > this
> > > > information in the release notes to make sure users are aware of this
> > > > behavior change.
> > > >
> > > > Thanks,
> > > > Sanjana
> > > >
> > > > On Mon, Mar 23, 2020 at 6:37 PM Boyang Chen <
> > reluctanthero...@gmail.com>
> > > > wrote:
> > > >
> > > > > Hey Sanjana,
> > > > >
> > > > > my understanding with the update is that if a user provides no config
> > > at
> > > > > all, a Producer/Consumer/Admin client user would by default enjoying
> > a
> > > > > starting backoff.ms as 100 ms and max.backoff.ms as 1000 ms? If I
> > > > already
> > > > > override the backoff.ms to 2000 ms for instance, will I be choosing
> > > the
> > > > > default max.backoff here?
> > > > >
> > > > > I guess my question would be whether we should just reject a config
> > > with
> > > > > backoff.ms > max.backoff.ms in the first place, as this looks like
> > > > > mis-configuration to me.
> > > > >
> > > > > Second question is whether we allow fallback to static backoffs if
> > the
> > > > user
> > > > > wants to do so, or we should just ship this as an opt-in feature?
> > > > >
> > > > > Let me know your thoughts.
> > > > >
> > > > > Boyang
> > > > >
> > > > > On Mon, Mar 23, 2020 at 11:38 AM Cheng Tan <c...@confluent.io>
> > wrote:
> > > > >
> > > > > > +1 (non-binding)
> > > > > >
> > > > > > > On Mar 19, 2020, at 7:27 PM, Sanjana Kaundinya <
> > > skaundi...@gmail.com
> > > > >
> > > > > > wrote:
> > > > > > >
> > > > > > > Ah yes that makes sense. I’ll update the KIP to reflect this.
> > > > > > >
> > > > > > > Thanks,
> > > > > > > Sanjana
> > > > > > >
> > > > > > > On Thu, Mar 19, 2020 at 5:48 PM Guozhang Wang <
> > wangg...@gmail.com>
> > > > > > wrote:
> > > > > > >
> > > > > > > > Following the formula you have in the KIP, if it is simply:
> > > > > > > >
> > > > > > > > MIN(retry.backoff.max.ms, (retry.backoff.ms * 2**(failures -
> > 1))
> > > *
> > > > > > random(
> > > > > > > > 0.8, 1.2))
> > > > > > > >
> > > > > > > > then the behavior would stay consistent at retry.backoff.max.ms
> > .
> > > > > > > >
> > > > > > > >
> > > > > > > > Guozhang
> > > > > > > >
> > > > > > > > On Thu, Mar 19, 2020 at 5:46 PM Sanjana Kaundinya <
> > > > > skaundi...@gmail.com
> > > > > > >
> > > > > > > > wrote:
> > > > > > > >
> > > > > > > > > If that’s the case then what should we base the starting point
> > > as?
> > > > > > > > > Currently in the KIP the starting point is retry.backoff.ms
> > and
> > > it
> > > > > > > > > exponentially goes up to retry.backoff.max.ms. If
> > > > > retry.backoff.max.ms
> > > > > > > > is
> > > > > > > > > smaller than retry.backoff.ms then that could pose a bit of a
> > > > > problem
> > > > > > > > > there right?
> > > > > > > > >
> > > > > > > > > On Mar 19, 2020, 5:44 PM -0700, Guozhang Wang <
> > > wangg...@gmail.com
> > > > > ,
> > > > > > > > wrote:
> > > > > > > > > > Thanks Sanjana, I did not capture the part that Jason 
> > > > > > > > > > referred
> > > to,
> > > > > so
> > > > > > > > > > that's my bad :P
> > > > > > > > > >
> > > > > > > > > > Regarding your last statement, I actually feel that instead 
> > > > > > > > > > of
> > > > take
> > > > > > the
> > > > > > > > > > larger of the two, we should respect "retry.backoff.max.ms"
> > > even
> > > > if
> > > > > > it
> > > > > > > > > is
> > > > > > > > > > smaller than "retry.backoff.ms". I do not have a very strong
> > > > > > rationale
> > > > > > > > > > except it is logically more aligned to the config names.
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > Guozhang
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > On Thu, Mar 19, 2020 at 5:39 PM Sanjana Kaundinya <
> > > > > > > > skaundi...@gmail.com>
> > > > > > > > > > wrote:
> > > > > > > > > >
> > > > > > > > > > > Hey Jason and Guozhang,
> > > > > > > > > > >
> > > > > > > > > > > Jason is right, I took this inspiration from KIP-144 (
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> > https://cwiki.apache.org/confluence/display/KAFKA/KIP-144%3A+Exponential+backoff+for+broker+reconnect+attempts
> > > > > > > > > > > )
> > > > > > > > > > > which had the same logic in order to preserve the existing
> > > > > behavior.
> > > > > > > > In
> > > > > > > > > > > this case however, if we are thinking to completely 
> > > > > > > > > > > eliminate
> > > the
> > > > > > > > > static
> > > > > > > > > > > backoff behavior, we can do that and as Jason mentioned 
> > > > > > > > > > > put
> > it
> > > in
> > > > > the
> > > > > > > > > > > release notes and not add any special logic. In addition I
> > > agree
> > > > > that
> > > > > > > > > we
> > > > > > > > > > > should take the larger of the two of `retry.backoff.ms` 
> > > > > > > > > > > and
> > `
> > > > > > > > > > > retry.backoff.max.ms`. I'll update the KIP to reflect this
> > and
> > > > > make
> > > > > > > > it
> > > > > > > > > > > clear that the old static retry backoff is getting 
> > > > > > > > > > > replaced
> > by
> > > > the
> > > > > > > > new
> > > > > > > > > > > dynamic retry backoff.
> > > > > > > > > > >
> > > > > > > > > > > Thanks,
> > > > > > > > > > > Sanjana
> > > > > > > > > > > On Thu, Mar 19, 2020 at 4:23 PM Jason Gustafson <
> > > > > ja...@confluent.io>
> > > > > > > > > > > wrote:
> > > > > > > > > > >
> > > > > > > > > > > > Hey Guozhang,
> > > > > > > > > > > >
> > > > > > > > > > > > I was referring to this:
> > > > > > > > > > > >
> > > > > > > > > > > > > For users who have not set retry.backoff.ms 
> > > > > > > > > > > > > explicitly,
> > the
> > > > > > > > > default
> > > > > > > > > > > > behavior will change so that the backoff will grow up to
> > 1000
> > > > ms.
> > > > > > > > For
> > > > > > > > > > > users
> > > > > > > > > > > > who have set retry.backoff.ms explicitly, the behavior 
> > > > > > > > > > > > will
> > > > > remain
> > > > > > > > > the
> > > > > > > > > > > > same
> > > > > > > > > > > > as they could have specific requirements.
> > > > > > > > > > > >
> > > > > > > > > > > > I took this to mean that for users who have overridden `
> > > > > > > > > retry.backoff.ms
> > > > > > > > > > > `
> > > > > > > > > > > > to 50ms (say), we will change the default `
> > > retry.backoff.max.ms
> > > > `
> > > > > > > > to
> > > > > > > > > 50ms
> > > > > > > > > > > > as
> > > > > > > > > > > > well in order to preserve existing backoff behavior. Is 
> > > > > > > > > > > > that
> > > not
> > > > > > > > > right?
> > > > > > > > > > > In
> > > > > > > > > > > > any case, I agree that we can use the maximum of the two
> > > values
> > > > as
> > > > > > > > > the
> > > > > > > > > > > > effective `retry.backoff.max.ms` to handle the case when
> > the
> > > > > > > > > configured
> > > > > > > > > > > > value of `retry.backoff.ms` is larger than the default 
> > > > > > > > > > > > of
> > 1s.
> > > > > > > > > > > >
> > > > > > > > > > > > -Jason
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > On Thu, Mar 19, 2020 at 3:29 PM Guozhang Wang <
> > > > wangg...@gmail.com
> > > > > >
> > > > > > > > > > > wrote:
> > > > > > > > > > > >
> > > > > > > > > > > > > Hey Jason,
> > > > > > > > > > > > >
> > > > > > > > > > > > > My understanding is a bit different here: even if 
> > > > > > > > > > > > > user has
> > an
> > > > > > > > > explicit
> > > > > > > > > > > > > overridden "retry.backoff.ms", the exponential 
> > > > > > > > > > > > > mechanism
> > > still
> > > > > > > > > > > triggers
> > > > > > > > > > > > > and
> > > > > > > > > > > > > the backoff would be increased till 
> > > > > > > > > > > > > "retry.backoff.max.ms
> > ";
> > > > and
> > > > > > > > > if the
> > > > > > > > > > > > > specified "retry.backoff.ms" is already larger than 
> > > > > > > > > > > > > the "
> > > > > > > > > > > > > retry.backoff.max.ms", we would still take "
> > > > retry.backoff.max.ms
> > > > > > > > ".
> > > > > > > > > > > > >
> > > > > > > > > > > > > So if the user does override the "retry.backoff.ms" 
> > > > > > > > > > > > > to a
> > > value
> > > > > > > > > larger
> > > > > > > > > > > > than
> > > > > > > > > > > > > 1s and is not aware of the new config, she would be
> > surprised
> > > > to
> > > > > > > > > see
> > > > > > > > > > > the
> > > > > > > > > > > > > specified value seemingly not being respected, but she
> > could
> > > > > > > > still
> > > > > > > > > > > learn
> > > > > > > > > > > > > that afterwards by reading the release notes 
> > > > > > > > > > > > > introducing
> > this
> > > > KIP
> > > > > > > > > > > > anyways.
> > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > Guozhang
> > > > > > > > > > > > >
> > > > > > > > > > > > > On Thu, Mar 19, 2020 at 3:10 PM Jason Gustafson <
> > > > > > > > > ja...@confluent.io>
> > > > > > > > > > > > > wrote:
> > > > > > > > > > > > >
> > > > > > > > > > > > > > Hi Sanjana,
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > The KIP looks good to me. I had just one question 
> > > > > > > > > > > > > > about
> > the
> > > > > > > > > default
> > > > > > > > > > > > > > behavior. As I understand, if the user has 
> > > > > > > > > > > > > > specified `
> > > > > > > > > > > retry.backoff.ms
> > > > > > > > > > > > `
> > > > > > > > > > > > > > explicitly, then we will not apply the default max
> > backoff.
> > > As
> > > > > > > > > such,
> > > > > > > > > > > > > > there's no way to get the benefit of this feature 
> > > > > > > > > > > > > > if you
> > are
> > > > > > > > > > > providing
> > > > > > > > > > > > a
> > > > > > > > > > > > > `
> > > > > > > > > > > > > > retry.backoff.ms` unless you also provide `
> > > > > > > > retry.backoff.max.ms
> > > > > > > > > `.
> > > > > > > > > > > That
> > > > > > > > > > > > > > makes sense if you assume the user is unaware of 
> > > > > > > > > > > > > > the new
> > > > > > > > > > > configuration,
> > > > > > > > > > > > > but
> > > > > > > > > > > > > > it is surprising otherwise. Since it's not a 
> > > > > > > > > > > > > > semantic
> > change
> > > > > > > > and
> > > > > > > > > > > since
> > > > > > > > > > > > > the
> > > > > > > > > > > > > > default you're proposing of 1s is fairly low 
> > > > > > > > > > > > > > already, I
> > > wonder
> > > > > > > > if
> > > > > > > > > > > it's
> > > > > > > > > > > > > good
> > > > > > > > > > > > > > enough to mention the new configuration in the 
> > > > > > > > > > > > > > release
> > notes
> > > > > > > > and
> > > > > > > > > not
> > > > > > > > > > > > add
> > > > > > > > > > > > > > any special logic. What do you think?
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > -Jason
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > On Thu, Mar 19, 2020 at 1:56 PM Sanjana Kaundinya <
> > > > > > > > > > > > skaundi...@gmail.com>
> > > > > > > > > > > > > > wrote:
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Thank you for the comments Guozhang.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > I’ll leave this KIP out for discussion till the 
> > > > > > > > > > > > > > > end of
> > the
> > > > > > > > > week and
> > > > > > > > > > > > > then
> > > > > > > > > > > > > > > start a vote for this early next week.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Sanjana
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > On Mar 18, 2020, 3:38 PM -0700, Guozhang Wang <
> > > > > > > > > wangg...@gmail.com
> > > > > > > > > > > > ,
> > > > > > > > > > > > > > wrote:
> > > > > > > > > > > > > > > > Hello Sanjana,
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > Thanks for the proposed KIP, I think that makes 
> > > > > > > > > > > > > > > > a lot of
> > > > > > > > > sense --
> > > > > > > > > > > > as
> > > > > > > > > > > > > > you
> > > > > > > > > > > > > > > > mentioned in the motivation, we've indeed seen 
> > > > > > > > > > > > > > > > many
> > issues
> > > > > > > > > with
> > > > > > > > > > > > > regard
> > > > > > > > > > > > > > to
> > > > > > > > > > > > > > > > the frequent retries, with bounded exponential 
> > > > > > > > > > > > > > > > backoff
> > in
> > > > > > > > the
> > > > > > > > > > > > > scenario
> > > > > > > > > > > > > > > > where there's a long connectivity issue we would
> > > > > > > > effectively
> > > > > > > > > > > reduce
> > > > > > > > > > > > > the
> > > > > > > > > > > > > > > > request load by 10 given the default configs.
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > For higher-level Streams client and Connect 
> > > > > > > > > > > > > > > > frameworks,
> > > > > > > > > today we
> > > > > > > > > > > > also
> > > > > > > > > > > > > > > have
> > > > > > > > > > > > > > > > a retry logic but that's used in a slightly 
> > > > > > > > > > > > > > > > different
> > way.
> > > > > > > > > For
> > > > > > > > > > > > > example
> > > > > > > > > > > > > > in
> > > > > > > > > > > > > > > > Streams, we tend to handle the retry logic at 
> > > > > > > > > > > > > > > > the
> > > > > > > > > thread-level
> > > > > > > > > > > and
> > > > > > > > > > > > > > hence
> > > > > > > > > > > > > > > > very likely we'd like to change that mechanism 
> > > > > > > > > > > > > > > > in
> > KIP-572
> > > > > > > > > > > anyways.
> > > > > > > > > > > > > For
> > > > > > > > > > > > > > > > producer / consumer / admin clients, I think 
> > > > > > > > > > > > > > > > just
> > applying
> > > > > > > > > this
> > > > > > > > > > > > > > > behavioral
> > > > > > > > > > > > > > > > change across these clients makes lot of sense. 
> > > > > > > > > > > > > > > > So I
> > think
> > > > > > > > > can
> > > > > > > > > > > just
> > > > > > > > > > > > > > leave
> > > > > > > > > > > > > > > > the Streams / Connect out of the scope of this 
> > > > > > > > > > > > > > > > KIP to be
> > > > > > > > > > > addressed
> > > > > > > > > > > > in
> > > > > > > > > > > > > > > > separate discussions.
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > I do not have further comments about this KIP 
> > > > > > > > > > > > > > > > :) LGTM.
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > Guozhang
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > On Wed, Mar 18, 2020 at 12:09 AM Sanjana 
> > > > > > > > > > > > > > > > Kaundinya <
> > > > > > > > > > > > > > skaundi...@gmail.com
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > wrote:
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > Thanks for the feedback Boyang.
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > If there’s anyone else who has feedback 
> > > > > > > > > > > > > > > > > regarding this
> > > > > > > > KIP,
> > > > > > > > > > > would
> > > > > > > > > > > > > > > really
> > > > > > > > > > > > > > > > > appreciate it hearing it!
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > Thanks,
> > > > > > > > > > > > > > > > > Sanjana
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > On Tue, Mar 17, 2020 at 11:38 PM Boyang Chen <
> > > > > > > > > > > > bche...@outlook.com>
> > > > > > > > > > > > > > > wrote:
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > Sounds great!
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > Get Outlook for iOS<https://aka.ms/o0ukef>
> > > > > > > > > > > > > > > > > > ________________________________
> > > > > > > > > > > > > > > > > > From: Sanjana Kaundinya 
> > > > > > > > > > > > > > > > > > <skaundi...@gmail.com>
> > > > > > > > > > > > > > > > > > Sent: Tuesday, March 17, 2020 5:54:35 PM
> > > > > > > > > > > > > > > > > > To: dev@kafka.apache.org 
> > > > > > > > > > > > > > > > > > <dev@kafka.apache.org>
> > > > > > > > > > > > > > > > > > Subject: Re: [DISCUSS] KIP-580: Exponential 
> > > > > > > > > > > > > > > > > > Backoff
> > for
> > > > > > > > > Kafka
> > > > > > > > > > > > > > Clients
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > Thanks for the explanation Boyang. One of 
> > > > > > > > > > > > > > > > > > the most
> > > > > > > > common
> > > > > > > > > > > > > problems
> > > > > > > > > > > > > > > that
> > > > > > > > > > > > > > > > > we
> > > > > > > > > > > > > > > > > > have in Kafka is with respect to metadata 
> > > > > > > > > > > > > > > > > > fetches. For
> > > > > > > > > > > example,
> > > > > > > > > > > > > if
> > > > > > > > > > > > > > > there
> > > > > > > > > > > > > > > > > is
> > > > > > > > > > > > > > > > > > a broker failure, all clients start to 
> > > > > > > > > > > > > > > > > > fetch metadata
> > > > > > > > at
> > > > > > > > > the
> > > > > > > > > > > > same
> > > > > > > > > > > > > > > time
> > > > > > > > > > > > > > > > > and
> > > > > > > > > > > > > > > > > > it often takes a while for the metadata to 
> > > > > > > > > > > > > > > > > > converge.
> > > > > > > > In a
> > > > > > > > > > > high
> > > > > > > > > > > > > load
> > > > > > > > > > > > > > > > > > cluster, there are also issues where the 
> > > > > > > > > > > > > > > > > > volume of
> > > > > > > > > metadata
> > > > > > > > > > > has
> > > > > > > > > > > > > > made
> > > > > > > > > > > > > > > > > > convergence of metadata slower.
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > For this case, exponential backoff helps as 
> > > > > > > > > > > > > > > > > > it reduces
> > > > > > > > > the
> > > > > > > > > > > > retry
> > > > > > > > > > > > > > > rate and
> > > > > > > > > > > > > > > > > > spaces out how often clients will retry, 
> > > > > > > > > > > > > > > > > > thereby
> > > > > > > > bringing
> > > > > > > > > > > down
> > > > > > > > > > > > > the
> > > > > > > > > > > > > > > time
> > > > > > > > > > > > > > > > > for
> > > > > > > > > > > > > > > > > > convergence. Something that Jason mentioned 
> > > > > > > > > > > > > > > > > > that would
> > > > > > > > > be a
> > > > > > > > > > > > great
> > > > > > > > > > > > > > > > > addition
> > > > > > > > > > > > > > > > > > here would be if the backoff should be 
> > > > > > > > > > > > > > > > > > “jittered” as
> > it
> > > > > > > > > was
> > > > > > > > > > > in
> > > > > > > > > > > > > > > KIP-144
> > > > > > > > > > > > > > > > > with
> > > > > > > > > > > > > > > > > > respect to exponential reconnect backoff. 
> > > > > > > > > > > > > > > > > > This would
> > > > > > > > help
> > > > > > > > > > > > prevent
> > > > > > > > > > > > > > the
> > > > > > > > > > > > > > > > > > clients from being synchronized on when 
> > > > > > > > > > > > > > > > > > they retry,
> > > > > > > > > thereby
> > > > > > > > > > > > > spacing
> > > > > > > > > > > > > > > out
> > > > > > > > > > > > > > > > > the
> > > > > > > > > > > > > > > > > > number of requests being sent to the broker 
> > > > > > > > > > > > > > > > > > at the
> > same
> > > > > > > > > time.
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > I’ll add this example to the KIP and flush 
> > > > > > > > > > > > > > > > > > out more of
> > > > > > > > > the
> > > > > > > > > > > > > details
> > > > > > > > > > > > > > -
> > > > > > > > > > > > > > > so
> > > > > > > > > > > > > > > > > > it’s more clear.
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > On Mar 17, 2020, 1:24 PM -0700, Boyang Chen 
> > > > > > > > > > > > > > > > > > <
> > > > > > > > > > > > > > > reluctanthero...@gmail.com
> > > > > > > > > > > > > > > > > > ,
> > > > > > > > > > > > > > > > > > wrote:
> > > > > > > > > > > > > > > > > > > Thanks for the reply Sanjana. I guess I 
> > > > > > > > > > > > > > > > > > > would like to
> > > > > > > > > > > > rephrase
> > > > > > > > > > > > > my
> > > > > > > > > > > > > > > > > > question
> > > > > > > > > > > > > > > > > > > 2 and 3 as my previous response is a 
> > > > > > > > > > > > > > > > > > > little bit
> > > > > > > > > > > unactionable.
> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > My specific point is that exponential 
> > > > > > > > > > > > > > > > > > > backoff is not
> > > > > > > > a
> > > > > > > > > > > silver
> > > > > > > > > > > > > > > bullet
> > > > > > > > > > > > > > > > > and
> > > > > > > > > > > > > > > > > > we
> > > > > > > > > > > > > > > > > > > should consider using it to solve known 
> > > > > > > > > > > > > > > > > > > problems,
> > > > > > > > > instead
> > > > > > > > > > > of
> > > > > > > > > > > > > > > making the
> > > > > > > > > > > > > > > > > > > holistic changes to all clients in Kafka 
> > > > > > > > > > > > > > > > > > > ecosystem. I
> > > > > > > > > do
> > > > > > > > > > > like
> > > > > > > > > > > > > the
> > > > > > > > > > > > > > > > > > > exponential backoff idea and believe this 
> > > > > > > > > > > > > > > > > > > would be of
> > > > > > > > > great
> > > > > > > > > > > > > > value,
> > > > > > > > > > > > > > > but
> > > > > > > > > > > > > > > > > > > maybe we should focus on proposing some 
> > > > > > > > > > > > > > > > > > > existing
> > > > > > > > > modules
> > > > > > > > > > > that
> > > > > > > > > > > > > are
> > > > > > > > > > > > > > > > > > suffering
> > > > > > > > > > > > > > > > > > > from static retry, and only change them 
> > > > > > > > > > > > > > > > > > > in this first
> > > > > > > > > KIP.
> > > > > > > > > > > If
> > > > > > > > > > > > > in
> > > > > > > > > > > > > > > the
> > > > > > > > > > > > > > > > > > > future, some other component users 
> > > > > > > > > > > > > > > > > > > believe they are
> > > > > > > > > also
> > > > > > > > > > > > > > > suffering, we
> > > > > > > > > > > > > > > > > > > could get more minor KIPs to change the 
> > > > > > > > > > > > > > > > > > > behavior as
> > > > > > > > > well.
> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > Boyang
> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > On Sun, Mar 15, 2020 at 12:07 AM Sanjana 
> > > > > > > > > > > > > > > > > > > Kaundinya <
> > > > > > > > > > > > > > > > > skaundi...@gmail.com
> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > wrote:
> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > Thanks for the feedback Boyang, I will 
> > > > > > > > > > > > > > > > > > > > revise the
> > > > > > > > KIP
> > > > > > > > > > > with
> > > > > > > > > > > > > the
> > > > > > > > > > > > > > > > > > > > mathematical relations as per your 
> > > > > > > > > > > > > > > > > > > > suggestion. To
> > > > > > > > > address
> > > > > > > > > > > > > your
> > > > > > > > > > > > > > > > > > feedback:
> > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > 1. Currently, with the default of 100 
> > > > > > > > > > > > > > > > > > > > ms per retry
> > > > > > > > > > > backoff,
> > > > > > > > > > > > > in
> > > > > > > > > > > > > > 1
> > > > > > > > > > > > > > > > > second
> > > > > > > > > > > > > > > > > > > > we would have 10 retries. In the case 
> > > > > > > > > > > > > > > > > > > > of using an
> > > > > > > > > > > > exponential
> > > > > > > > > > > > > > > > > backoff,
> > > > > > > > > > > > > > > > > > we
> > > > > > > > > > > > > > > > > > > > would have a total of 4 retries in 1 
> > > > > > > > > > > > > > > > > > > > second. Thus
> > > > > > > > we
> > > > > > > > > have
> > > > > > > > > > > > > less
> > > > > > > > > > > > > > > than
> > > > > > > > > > > > > > > > > > half of
> > > > > > > > > > > > > > > > > > > > the amount of retries in the same 
> > > > > > > > > > > > > > > > > > > > timeframe and can
> > > > > > > > > > > lessen
> > > > > > > > > > > > > > broker
> > > > > > > > > > > > > > > > > > pressure.
> > > > > > > > > > > > > > > > > > > > This calculation is done as following 
> > > > > > > > > > > > > > > > > > > > (using the
> > > > > > > > > formula
> > > > > > > > > > > > laid
> > > > > > > > > > > > > > > out in
> > > > > > > > > > > > > > > > > > the
> > > > > > > > > > > > > > > > > > > > KIP:
> > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > Try 1 at time 0 ms, failures = 0, next 
> > > > > > > > > > > > > > > > > > > > retry in 100
> > > > > > > > > ms
> > > > > > > > > > > > > (default
> > > > > > > > > > > > > > > retry
> > > > > > > > > > > > > > > > > > ms
> > > > > > > > > > > > > > > > > > > > is initially 100 ms)
> > > > > > > > > > > > > > > > > > > > Try 2 at time 100 ms, failures = 1, 
> > > > > > > > > > > > > > > > > > > > next retry in
> > > > > > > > > 200 ms
> > > > > > > > > > > > > > > > > > > > Try 3 at time 300 ms, failures = 2, 
> > > > > > > > > > > > > > > > > > > > next retry in
> > > > > > > > > 400 ms
> > > > > > > > > > > > > > > > > > > > Try 4 at time 700 ms, failures = 3, 
> > > > > > > > > > > > > > > > > > > > next retry in
> > > > > > > > > 800 ms
> > > > > > > > > > > > > > > > > > > > Try 5 at time 1500 ms, failures = 4, 
> > > > > > > > > > > > > > > > > > > > next retry in
> > > > > > > > > 1000
> > > > > > > > > > > ms
> > > > > > > > > > > > > > > (default
> > > > > > > > > > > > > > > > > max
> > > > > > > > > > > > > > > > > > > > retry ms is 1000 ms)
> > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > For 2 and 3, could you elaborate more 
> > > > > > > > > > > > > > > > > > > > about what
> > > > > > > > you
> > > > > > > > > mean
> > > > > > > > > > > > > with
> > > > > > > > > > > > > > > > > respect
> > > > > > > > > > > > > > > > > > to
> > > > > > > > > > > > > > > > > > > > client timeouts? I’m not very familiar 
> > > > > > > > > > > > > > > > > > > > with the
> > > > > > > > > Streams
> > > > > > > > > > > > > > > framework, so
> > > > > > > > > > > > > > > > > > would
> > > > > > > > > > > > > > > > > > > > love to get more insight to how that 
> > > > > > > > > > > > > > > > > > > > currently
> > > > > > > > works,
> > > > > > > > > > > with
> > > > > > > > > > > > > > > respect to
> > > > > > > > > > > > > > > > > > > > producer transactions, so I can 
> > > > > > > > > > > > > > > > > > > > appropriately
> > > > > > > > update
> > > > > > > > > the
> > > > > > > > > > > > KIP
> > > > > > > > > > > > > to
> > > > > > > > > > > > > > > > > address
> > > > > > > > > > > > > > > > > > > > these scenarios.
> > > > > > > > > > > > > > > > > > > > On Mar 13, 2020, 7:15 PM -0700, Boyang 
> > > > > > > > > > > > > > > > > > > > Chen <
> > > > > > > > > > > > > > > > > > reluctanthero...@gmail.com>,
> > > > > > > > > > > > > > > > > > > > wrote:
> > > > > > > > > > > > > > > > > > > > > Thanks for the KIP Sanjana. I think 
> > > > > > > > > > > > > > > > > > > > > the
> > > > > > > > motivation
> > > > > > > > > is
> > > > > > > > > > > > good,
> > > > > > > > > > > > > > but
> > > > > > > > > > > > > > > > > lack
> > > > > > > > > > > > > > > > > > of
> > > > > > > > > > > > > > > > > > > > > more quantitative analysis. For 
> > > > > > > > > > > > > > > > > > > > > instance:
> > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > 1. How much retries we are saving by 
> > > > > > > > > > > > > > > > > > > > > applying the
> > > > > > > > > > > > > exponential
> > > > > > > > > > > > > > > retry
> > > > > > > > > > > > > > > > > > vs
> > > > > > > > > > > > > > > > > > > > > static retry? There should be some 
> > > > > > > > > > > > > > > > > > > > > mathematical
> > > > > > > > > > > relations
> > > > > > > > > > > > > > > between
> > > > > > > > > > > > > > > > > the
> > > > > > > > > > > > > > > > > > > > > static retry ms, the initial 
> > > > > > > > > > > > > > > > > > > > > exponential retry
> > > > > > > > ms,
> > > > > > > > > the
> > > > > > > > > > > > max
> > > > > > > > > > > > > > > > > > exponential
> > > > > > > > > > > > > > > > > > > > > retry ms in a given time interval.
> > > > > > > > > > > > > > > > > > > > > 2. How does this affect the client 
> > > > > > > > > > > > > > > > > > > > > timeout? With
> > > > > > > > > > > > > exponential
> > > > > > > > > > > > > > > retry,
> > > > > > > > > > > > > > > > > > the
> > > > > > > > > > > > > > > > > > > > > client shall be getting easier to 
> > > > > > > > > > > > > > > > > > > > > timeout on a
> > > > > > > > > parent
> > > > > > > > > > > > level
> > > > > > > > > > > > > > > caller,
> > > > > > > > > > > > > > > > > > for
> > > > > > > > > > > > > > > > > > > > > instance stream attempts to retry 
> > > > > > > > > > > > > > > > > > > > > initializing
> > > > > > > > > producer
> > > > > > > > > > > > > > > > > transactions
> > > > > > > > > > > > > > > > > > with
> > > > > > > > > > > > > > > > > > > > > given 5 minute interval. With 
> > > > > > > > > > > > > > > > > > > > > exponential retry
> > > > > > > > > this
> > > > > > > > > > > > > > mechanism
> > > > > > > > > > > > > > > > > could
> > > > > > > > > > > > > > > > > > > > > experience more frequent timeout 
> > > > > > > > > > > > > > > > > > > > > which we should
> > > > > > > > be
> > > > > > > > > > > > careful
> > > > > > > > > > > > > > > with.
> > > > > > > > > > > > > > > > > > > > > 3. With regards to #2, we should have 
> > > > > > > > > > > > > > > > > > > > > more
> > > > > > > > detailed
> > > > > > > > > > > > > checklist
> > > > > > > > > > > > > > > of
> > > > > > > > > > > > > > > > > all
> > > > > > > > > > > > > > > > > > the
> > > > > > > > > > > > > > > > > > > > > existing static retry scenarios, and 
> > > > > > > > > > > > > > > > > > > > > adjust the
> > > > > > > > > initial
> > > > > > > > > > > > > > > exponential
> > > > > > > > > > > > > > > > > > retry
> > > > > > > > > > > > > > > > > > > > > ms to make sure we won't get easily 
> > > > > > > > > > > > > > > > > > > > > timeout in
> > > > > > > > high
> > > > > > > > > > > level
> > > > > > > > > > > > > due
> > > > > > > > > > > > > > > to
> > > > > > > > > > > > > > > > > too
> > > > > > > > > > > > > > > > > > few
> > > > > > > > > > > > > > > > > > > > > attempts.
> > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > Boyang
> > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > On Fri, Mar 13, 2020 at 4:38 PM 
> > > > > > > > > > > > > > > > > > > > > Sanjana
> > > > > > > > Kaundinya <
> > > > > > > > > > > > > > > > > > skaundi...@gmail.com>
> > > > > > > > > > > > > > > > > > > > > wrote:
> > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > Hi Everyone,
> > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > I’ve written a KIP about introducing
> > > > > > > > exponential
> > > > > > > > > > > > backoff
> > > > > > > > > > > > > > for
> > > > > > > > > > > > > > > > > Kafka
> > > > > > > > > > > > > > > > > > > > > > clients. Would appreciate any 
> > > > > > > > > > > > > > > > > > > > > > feedback on this.
> > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> > https://cwiki.apache.org/confluence/display/KAFKA/KIP-580%3A+Exponential+Backoff+for+Kafka+Clients
> > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > Thanks,
> > > > > > > > > > > > > > > > > > > > > > Sanjana
> > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > --
> > > > > > > > > > > > > > > > -- Guozhang
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > --
> > > > > > > > > > > > > -- Guozhang
> > > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > --
> > > > > > > > > > -- Guozhang
> > > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > --
> > > > > > > > -- Guozhang
> > > > > > > >
> > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> >
> > --
> > -- Guozhang
> >

Reply via email to