Re: [DISCUSS] KIP-526: Reduce Producer Metadata Lookups for Large Number of Topics

Brian Byrne Thu, 19 Dec 2019 15:38:06 -0800

Hello everyone,

For all interested, please take a look at the proposed algorithm as I'd
like to get more feedback. I'll call for a vote once the break is over.


Thanks,
Brian

On Mon, Dec 9, 2019 at 10:18 PM Guozhang Wang <[email protected]> wrote:

> Sounds good, I agree that should not make a big difference in practice.
>
> On Mon, Dec 9, 2019 at 2:07 PM Brian Byrne <[email protected]> wrote:
>
> > Hi Guozhang,
> >
> > I see, we agree on the topic threshold not applying to urgent topics, but
> > differ slightly on what should be considered urgent. I would argue that
> we
> > should consider topics nearing the metadata.max.age.ms to be urgent
> since
> > they may still be well within the metadata.expiry.ms. That is, the
> client
> > still considers these topics to be relevant (not expired), but doesn't
> want
> > to incur the latency bubble of having to wait for the metadata to be
> > re-fetched if it's stale. This could be a frequent case if the
> > metadata.max.age.ms << metadata.expiry.ms.
> >
> > In practice, I wouldn't expect this to make a noticeable difference so I
> > don't have a strong leaning, but the current behavior today is to
> > aggressively refresh the metadata of stale topics by ensuring a refresh
> is
> > triggered before that metadata.max.age.ms duration elapses.
> >
> > Thanks,
> > Brian
> >
> >
> > On Mon, Dec 9, 2019 at 11:57 AM Guozhang Wang <[email protected]>
> wrote:
> >
> > > Hello Brian,
> > >
> > > Thanks for your explanation, could you then update the wiki page for
> the
> > > algorithm part since when I read it, I thought it was different from
> the
> > > above, e.g. urgent topics should not be added just because of max.age
> > > expiration, but should only be added if there are sending data pending.
> > >
> > >
> > > Guozhang
> > >
> > > On Mon, Dec 9, 2019 at 10:57 AM Brian Byrne <[email protected]>
> wrote:
> > >
> > > > Hi Guozhang,
> > > >
> > > > Thanks for the feedback!
> > > >
> > > > On Sun, Dec 8, 2019 at 6:25 PM Guozhang Wang <[email protected]>
> > wrote:
> > > >
> > > > > 1. The addition of *metadata.expiry.ms <http://metadata.expiry.ms>
> > > > *should
> > > > > be included in the public interface. Also its semantics needs more
> > > > > clarification (since previously it is hard-coded internally we do
> not
> > > > need
> > > > > to explain it publicly, but now with the configurable value we do
> > > need).
> > > > >
> > > >
> > > > This was an oversight. Done.
> > > >
> > > >
> > > > > 2. There are a couple of hard-coded parameters like 25 and 0.5 in
> the
> > > > > proposal, maybe we need to explain why these magic values makes
> sense
> > > in
> > > > > common scenarios.
> > > > >
> > > >
> > > > So these are pretty fuzzy numbers, and seemed to be a decent balance
> > > > between trade-offs. I've updated the target size to account for
> setups
> > > with
> > > > a large number of topics or a shorter refresh time, as well as added
> > some
> > > > light rationale.
> > > >
> > > >
> > > > > 3. In the Urgent set condition, do you actually mean "with no
> cached
> > > > > metadata AND there are existing data buffered for the topic"?
> > > > >
> > > >
> > > > Yes, fixed.
> > > >
> > > >
> > > >
> > > > > One concern I have is whether or not we may introduce a regression,
> > > > > especially during producer startup such that since we only require
> up
> > > to
> > > > 25
> > > > > topics each request, it may cause the send data to be buffered more
> > > time
> > > > > than now due to metadata not available. I understand this is a
> > > > acknowledged
> > > > > trade-off in our design but any regression that may surface to
> users
> > > need
> > > > > to be very carefully considered. I'm wondering, e.g. if we can
> tweak
> > > our
> > > > > algorithm for the Urgent set, e.g. to consider those with non
> cached
> > > > > metadata have higher priority than those who have elapsed max.age
> but
> > > not
> > > > > yet have been called for sending. More specifically:
> > > > >
> > > > > Urgent: topics that have been requested for sending but no cached
> > > > metadata,
> > > > > and topics that have send request failed with e.g. NOT_LEADER.
> > > > > Non-urgent: topics that are not in Urgent but have expired max.age.
> > > > >
> > > > > Then when sending metadata, we always send ALL in the urgent (i.e.
> > > ignore
> > > > > the size limit), and only when they do not exceed the size limit,
> > > > consider
> > > > > fill in more topics from Non-urgent up to the size limit.
> > > > >
> > > >
> > > > I think we're on the same page here. Urgent topics ignore the target
> > > > metadata RPC size, and are not bound by it in any way, i.e. if
> there's
> > > 100
> > > > urgent topics, we'll fetch all 100 in a single RPC. Like you say,
> > > however,
> > > > if a topic transitions to urgent and there's several non-urgent ones,
> > > we'll
> > > > piggyback the non-urgent updates up to the target size.
> > > >
> > > > Thanks,
> > > >  Brian
> > > >
> > > >
> > > > On Wed, Nov 20, 2019 at 7:00 PM deng ziming <
> [email protected]>
> > > > > wrote:
> > > > >
> > > > > > I think it's ok, and you can add another issue about
> `asynchronous
> > > > > > metadata` if `topic expiry` is not enough.
> > > > > >
> > > > > >
> > > > > > On Thu, Nov 21, 2019 at 6:20 AM Brian Byrne <[email protected]
> >
> > > > wrote:
> > > > > >
> > > > > > > Hello all,
> > > > > > >
> > > > > > > I've refactored the KIP to remove implementing asynchronous
> > > metadata
> > > > > > > fetching in the producer during send(). It's now exclusively
> > > focused
> > > > on
> > > > > > > reducing the topic metadata fetch payload and proposes adding a
> > new
> > > > > > > configuration flag to control topic expiry behavior. Please
> take
> > a
> > > > look
> > > > > > > when possible.
> > > > > > >
> > > > > > > https://cwiki.apache.org/confluence/display/KAFKA/KIP-526
> > > > > > > %3A+Reduce+Producer+Metadata+Lookups+for+Large+Number+of+Topics
> > > > > > >
> > > > > > > Thanks,
> > > > > > > Brian
> > > > > > >
> > > > > > > On Fri, Oct 4, 2019 at 10:04 AM Brian Byrne <
> [email protected]
> > >
> > > > > wrote:
> > > > > > >
> > > > > > > > Lucas, Guozhang,
> > > > > > > >
> > > > > > > > Thank you for the comments. Good point on
> > > METADATA_MAX_AGE_CONFIG -
> > > > > it
> > > > > > > > looks like the ProducerMetadata was differentiating between
> > > expiry
> > > > > and
> > > > > > > > refresh, but it should be unnecessary to do so once the cost
> of
> > > > > > fetching
> > > > > > > a
> > > > > > > > single topic's metadata is reduced.
> > > > > > > >
> > > > > > > > I've updated the rejected alternatives and removed the config
> > > > > > variables.
> > > > > > > >
> > > > > > > > Brian
> > > > > > > >
> > > > > > > > On Fri, Oct 4, 2019 at 9:20 AM Guozhang Wang <
> > [email protected]
> > > >
> > > > > > wrote:
> > > > > > > >
> > > > > > > >> Hello Brian,
> > > > > > > >>
> > > > > > > >> Thanks for the KIP.
> > > > > > > >>
> > > > > > > >> I think using asynchronous metadata update to address 1)
> > > metadata
> > > > > > update
> > > > > > > >> blocking send, but for other issues, currently at producer
> we
> > do
> > > > > have
> > > > > > a
> > > > > > > >> configurable `METADATA_MAX_AGE_CONFIG` similar to consumer,
> by
> > > > > default
> > > > > > > is
> > > > > > > >> 5min. So maybe we do not need to introduce new configs here,
> > but
> > > > > only
> > > > > > > >> change the semantics of that config from global expiry
> (today
> > we
> > > > > just
> > > > > > > >> enforce a full metadata update for the whole cluster) to
> > > > > single-topic
> > > > > > > >> expiry, and we can also extend its expiry deadline whenever
> > that
> > > > > > > metadata
> > > > > > > >> is successfully used to send a produce request.
> > > > > > > >>
> > > > > > > >>
> > > > > > > >> Guozhang
> > > > > > > >>
> > > > > > > >>
> > > > > > > >>
> > > > > > > >> On Thu, Oct 3, 2019 at 6:51 PM Lucas Bradstreet <
> > > > [email protected]
> > > > > >
> > > > > > > >> wrote:
> > > > > > > >>
> > > > > > > >> > Hi Brian,
> > > > > > > >> >
> > > > > > > >> > This looks great, and should help reduce blocking and high
> > > > > metadata
> > > > > > > >> request
> > > > > > > >> > volumes when the producer is sending to large numbers of
> > > topics,
> > > > > > > >> especially
> > > > > > > >> > at low volumes. I think the approach to make metadata
> > fetching
> > > > > > > >> asynchronous
> > > > > > > >> > and batch metadata requests together will help
> > significantly.
> > > > > > > >> >
> > > > > > > >> > The only other approach I can think of is to allow users
> to
> > > > supply
> > > > > > the
> > > > > > > >> > producer with the expected topics upfront, allowing the
> > > producer
> > > > > to
> > > > > > > >> perform
> > > > > > > >> > a single initial metadata request before any sends occur.
> I
> > > see
> > > > no
> > > > > > > real
> > > > > > > >> > advantages to this approach compared to the async method
> > > you’ve
> > > > > > > >> proposed,
> > > > > > > >> > but maybe we could add it to the rejected alternatives
> > > section?
> > > > > > > >> >
> > > > > > > >> > Thanks,
> > > > > > > >> >
> > > > > > > >> > Lucas
> > > > > > > >> >
> > > > > > > >> > On Fri, 20 Sep 2019 at 11:46, Brian Byrne <
> > > [email protected]>
> > > > > > > wrote:
> > > > > > > >> >
> > > > > > > >> > > I've updated the 'Proposed Changes' to include two new
> > > > producer
> > > > > > > >> > > configuration variables: topic.expiry.ms and
> > > topic.refresh.ms
> > > > .
> > > > > > > Please
> > > > > > > >> > take
> > > > > > > >> > > a look.
> > > > > > > >> > >
> > > > > > > >> > > Thanks,
> > > > > > > >> > > Brian
> > > > > > > >> > >
> > > > > > > >> > > On Tue, Sep 17, 2019 at 12:59 PM Brian Byrne <
> > > > > [email protected]
> > > > > > >
> > > > > > > >> > wrote:
> > > > > > > >> > >
> > > > > > > >> > > > Dev team,
> > > > > > > >> > > >
> > > > > > > >> > > > Requesting discussion for improvement to the producer
> > when
> > > > > > dealing
> > > > > > > >> > with a
> > > > > > > >> > > > large number of topics.
> > > > > > > >> > > >
> > > > > > > >> > > > KIP:
> > > > > > > >> > > >
> > > > > > > >> > >
> > > > > > > >> >
> > > > > > > >>
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-526%3A+Reduce+Producer+Metadata+Lookups+for+Large+Number+of+Topics
> > > > > > > >> > > >
> > > > > > > >> > > > JIRA:
> https://issues.apache.org/jira/browse/KAFKA-8904
> > > > > > > >> > > >
> > > > > > > >> > > > Thoughts and feedback would be appreciated.
> > > > > > > >> > > >
> > > > > > > >> > > > Thanks,
> > > > > > > >> > > > Brian
> > > > > > > >> > > >
> > > > > > > >> > >
> > > > > > > >> >
> > > > > > > >>
> > > > > > > >>
> > > > > > > >> --
> > > > > > > >> -- Guozhang
> > > > > > > >>
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > > >
> > > > > --
> > > > > -- Guozhang
> > > > >
> > > >
> > >
> > >
> > > --
> > > -- Guozhang
> > >
> >
>
>
> --
> -- Guozhang
>

Re: [DISCUSS] KIP-526: Reduce Producer Metadata Lookups for Large Number of Topics

Reply via email to