Hello everyone, For all interested, please take a look at the proposed algorithm as I'd like to get more feedback. I'll call for a vote once the break is over.
Thanks, Brian On Mon, Dec 9, 2019 at 10:18 PM Guozhang Wang <wangg...@gmail.com> wrote: > Sounds good, I agree that should not make a big difference in practice. > > On Mon, Dec 9, 2019 at 2:07 PM Brian Byrne <bby...@confluent.io> wrote: > > > Hi Guozhang, > > > > I see, we agree on the topic threshold not applying to urgent topics, but > > differ slightly on what should be considered urgent. I would argue that > we > > should consider topics nearing the metadata.max.age.ms to be urgent > since > > they may still be well within the metadata.expiry.ms. That is, the > client > > still considers these topics to be relevant (not expired), but doesn't > want > > to incur the latency bubble of having to wait for the metadata to be > > re-fetched if it's stale. This could be a frequent case if the > > metadata.max.age.ms << metadata.expiry.ms. > > > > In practice, I wouldn't expect this to make a noticeable difference so I > > don't have a strong leaning, but the current behavior today is to > > aggressively refresh the metadata of stale topics by ensuring a refresh > is > > triggered before that metadata.max.age.ms duration elapses. > > > > Thanks, > > Brian > > > > > > On Mon, Dec 9, 2019 at 11:57 AM Guozhang Wang <wangg...@gmail.com> > wrote: > > > > > Hello Brian, > > > > > > Thanks for your explanation, could you then update the wiki page for > the > > > algorithm part since when I read it, I thought it was different from > the > > > above, e.g. urgent topics should not be added just because of max.age > > > expiration, but should only be added if there are sending data pending. > > > > > > > > > Guozhang > > > > > > On Mon, Dec 9, 2019 at 10:57 AM Brian Byrne <bby...@confluent.io> > wrote: > > > > > > > Hi Guozhang, > > > > > > > > Thanks for the feedback! > > > > > > > > On Sun, Dec 8, 2019 at 6:25 PM Guozhang Wang <wangg...@gmail.com> > > wrote: > > > > > > > > > 1. The addition of *metadata.expiry.ms <http://metadata.expiry.ms> > > > > *should > > > > > be included in the public interface. Also its semantics needs more > > > > > clarification (since previously it is hard-coded internally we do > not > > > > need > > > > > to explain it publicly, but now with the configurable value we do > > > need). > > > > > > > > > > > > > This was an oversight. Done. > > > > > > > > > > > > > 2. There are a couple of hard-coded parameters like 25 and 0.5 in > the > > > > > proposal, maybe we need to explain why these magic values makes > sense > > > in > > > > > common scenarios. > > > > > > > > > > > > > So these are pretty fuzzy numbers, and seemed to be a decent balance > > > > between trade-offs. I've updated the target size to account for > setups > > > with > > > > a large number of topics or a shorter refresh time, as well as added > > some > > > > light rationale. > > > > > > > > > > > > > 3. In the Urgent set condition, do you actually mean "with no > cached > > > > > metadata AND there are existing data buffered for the topic"? > > > > > > > > > > > > > Yes, fixed. > > > > > > > > > > > > > > > > > One concern I have is whether or not we may introduce a regression, > > > > > especially during producer startup such that since we only require > up > > > to > > > > 25 > > > > > topics each request, it may cause the send data to be buffered more > > > time > > > > > than now due to metadata not available. I understand this is a > > > > acknowledged > > > > > trade-off in our design but any regression that may surface to > users > > > need > > > > > to be very carefully considered. I'm wondering, e.g. if we can > tweak > > > our > > > > > algorithm for the Urgent set, e.g. to consider those with non > cached > > > > > metadata have higher priority than those who have elapsed max.age > but > > > not > > > > > yet have been called for sending. More specifically: > > > > > > > > > > Urgent: topics that have been requested for sending but no cached > > > > metadata, > > > > > and topics that have send request failed with e.g. NOT_LEADER. > > > > > Non-urgent: topics that are not in Urgent but have expired max.age. > > > > > > > > > > Then when sending metadata, we always send ALL in the urgent (i.e. > > > ignore > > > > > the size limit), and only when they do not exceed the size limit, > > > > consider > > > > > fill in more topics from Non-urgent up to the size limit. > > > > > > > > > > > > > I think we're on the same page here. Urgent topics ignore the target > > > > metadata RPC size, and are not bound by it in any way, i.e. if > there's > > > 100 > > > > urgent topics, we'll fetch all 100 in a single RPC. Like you say, > > > however, > > > > if a topic transitions to urgent and there's several non-urgent ones, > > > we'll > > > > piggyback the non-urgent updates up to the target size. > > > > > > > > Thanks, > > > > Brian > > > > > > > > > > > > On Wed, Nov 20, 2019 at 7:00 PM deng ziming < > dengziming1...@gmail.com> > > > > > wrote: > > > > > > > > > > > I think it's ok, and you can add another issue about > `asynchronous > > > > > > metadata` if `topic expiry` is not enough. > > > > > > > > > > > > > > > > > > On Thu, Nov 21, 2019 at 6:20 AM Brian Byrne <bby...@confluent.io > > > > > > wrote: > > > > > > > > > > > > > Hello all, > > > > > > > > > > > > > > I've refactored the KIP to remove implementing asynchronous > > > metadata > > > > > > > fetching in the producer during send(). It's now exclusively > > > focused > > > > on > > > > > > > reducing the topic metadata fetch payload and proposes adding a > > new > > > > > > > configuration flag to control topic expiry behavior. Please > take > > a > > > > look > > > > > > > when possible. > > > > > > > > > > > > > > https://cwiki.apache.org/confluence/display/KAFKA/KIP-526 > > > > > > > %3A+Reduce+Producer+Metadata+Lookups+for+Large+Number+of+Topics > > > > > > > > > > > > > > Thanks, > > > > > > > Brian > > > > > > > > > > > > > > On Fri, Oct 4, 2019 at 10:04 AM Brian Byrne < > bby...@confluent.io > > > > > > > > wrote: > > > > > > > > > > > > > > > Lucas, Guozhang, > > > > > > > > > > > > > > > > Thank you for the comments. Good point on > > > METADATA_MAX_AGE_CONFIG - > > > > > it > > > > > > > > looks like the ProducerMetadata was differentiating between > > > expiry > > > > > and > > > > > > > > refresh, but it should be unnecessary to do so once the cost > of > > > > > > fetching > > > > > > > a > > > > > > > > single topic's metadata is reduced. > > > > > > > > > > > > > > > > I've updated the rejected alternatives and removed the config > > > > > > variables. > > > > > > > > > > > > > > > > Brian > > > > > > > > > > > > > > > > On Fri, Oct 4, 2019 at 9:20 AM Guozhang Wang < > > wangg...@gmail.com > > > > > > > > > > wrote: > > > > > > > > > > > > > > > >> Hello Brian, > > > > > > > >> > > > > > > > >> Thanks for the KIP. > > > > > > > >> > > > > > > > >> I think using asynchronous metadata update to address 1) > > > metadata > > > > > > update > > > > > > > >> blocking send, but for other issues, currently at producer > we > > do > > > > > have > > > > > > a > > > > > > > >> configurable `METADATA_MAX_AGE_CONFIG` similar to consumer, > by > > > > > default > > > > > > > is > > > > > > > >> 5min. So maybe we do not need to introduce new configs here, > > but > > > > > only > > > > > > > >> change the semantics of that config from global expiry > (today > > we > > > > > just > > > > > > > >> enforce a full metadata update for the whole cluster) to > > > > > single-topic > > > > > > > >> expiry, and we can also extend its expiry deadline whenever > > that > > > > > > > metadata > > > > > > > >> is successfully used to send a produce request. > > > > > > > >> > > > > > > > >> > > > > > > > >> Guozhang > > > > > > > >> > > > > > > > >> > > > > > > > >> > > > > > > > >> On Thu, Oct 3, 2019 at 6:51 PM Lucas Bradstreet < > > > > lu...@confluent.io > > > > > > > > > > > > > >> wrote: > > > > > > > >> > > > > > > > >> > Hi Brian, > > > > > > > >> > > > > > > > > >> > This looks great, and should help reduce blocking and high > > > > > metadata > > > > > > > >> request > > > > > > > >> > volumes when the producer is sending to large numbers of > > > topics, > > > > > > > >> especially > > > > > > > >> > at low volumes. I think the approach to make metadata > > fetching > > > > > > > >> asynchronous > > > > > > > >> > and batch metadata requests together will help > > significantly. > > > > > > > >> > > > > > > > > >> > The only other approach I can think of is to allow users > to > > > > supply > > > > > > the > > > > > > > >> > producer with the expected topics upfront, allowing the > > > producer > > > > > to > > > > > > > >> perform > > > > > > > >> > a single initial metadata request before any sends occur. > I > > > see > > > > no > > > > > > > real > > > > > > > >> > advantages to this approach compared to the async method > > > you’ve > > > > > > > >> proposed, > > > > > > > >> > but maybe we could add it to the rejected alternatives > > > section? > > > > > > > >> > > > > > > > > >> > Thanks, > > > > > > > >> > > > > > > > > >> > Lucas > > > > > > > >> > > > > > > > > >> > On Fri, 20 Sep 2019 at 11:46, Brian Byrne < > > > bby...@confluent.io> > > > > > > > wrote: > > > > > > > >> > > > > > > > > >> > > I've updated the 'Proposed Changes' to include two new > > > > producer > > > > > > > >> > > configuration variables: topic.expiry.ms and > > > topic.refresh.ms > > > > . > > > > > > > Please > > > > > > > >> > take > > > > > > > >> > > a look. > > > > > > > >> > > > > > > > > > >> > > Thanks, > > > > > > > >> > > Brian > > > > > > > >> > > > > > > > > > >> > > On Tue, Sep 17, 2019 at 12:59 PM Brian Byrne < > > > > > bby...@confluent.io > > > > > > > > > > > > > > >> > wrote: > > > > > > > >> > > > > > > > > > >> > > > Dev team, > > > > > > > >> > > > > > > > > > > >> > > > Requesting discussion for improvement to the producer > > when > > > > > > dealing > > > > > > > >> > with a > > > > > > > >> > > > large number of topics. > > > > > > > >> > > > > > > > > > > >> > > > KIP: > > > > > > > >> > > > > > > > > > > >> > > > > > > > > > >> > > > > > > > > >> > > > > > > > > > > > > > > > > > > > > > > > > > > > > https://cwiki.apache.org/confluence/display/KAFKA/KIP-526%3A+Reduce+Producer+Metadata+Lookups+for+Large+Number+of+Topics > > > > > > > >> > > > > > > > > > > >> > > > JIRA: > https://issues.apache.org/jira/browse/KAFKA-8904 > > > > > > > >> > > > > > > > > > > >> > > > Thoughts and feedback would be appreciated. > > > > > > > >> > > > > > > > > > > >> > > > Thanks, > > > > > > > >> > > > Brian > > > > > > > >> > > > > > > > > > > >> > > > > > > > > > >> > > > > > > > > >> > > > > > > > >> > > > > > > > >> -- > > > > > > > >> -- Guozhang > > > > > > > >> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > -- Guozhang > > > > > > > > > > > > > > > > > > -- > > > -- Guozhang > > > > > > > > -- > -- Guozhang >