Sounds good. +1 from me. On Thu, Jan 23, 2020 at 9:00 AM Brian Byrne <bby...@confluent.io> wrote:
> Thanks Jason, > > I'm in favor of the latter: metadata.max.idle.ms. I agree that describing > it as a "period" is inaccurate. With metadata.max.idle.ms, it also aligns > with metadata.max.age.ms for determining refresh period (which is an > actual > period). > > I've updated the docs. > > Thanks, > Brian > > On Wed, Jan 22, 2020 at 6:19 PM Jason Gustafson <ja...@confluent.io> > wrote: > > > Thanks for the proposal. Looks good overall. I wanted to suggest a > possible > > name change. I was considering something like ` > idle.metadata.expiration.ms > > ` > > or maybe `metadata.max.idle.ms`. Thoughts? > > > > -Jason > > > > > > On Tue, Jan 21, 2020 at 11:38 AM Guozhang Wang <wangg...@gmail.com> > wrote: > > > > > Got it. > > > > > > I was proposing that we do the "delayed async batch" but I think your > > > argument for complexity and pushing it out of the scope is convincing, > so > > > instead I propose we do the synchronous mini batching still but > obviously > > > it is already there :) I'm +1 on the current proposal scope. > > > > > > Guozhang > > > > > > On Tue, Jan 21, 2020 at 10:16 AM Brian Byrne <bby...@confluent.io> > > wrote: > > > > > > > Hi Guozhang, > > > > > > > > Ah, sorry, I misunderstood. Actually, this is solved for us today. > How > > > the > > > > producer works is that it maintains at most one inflight metadata > fetch > > > > request at any time, where each request is tagged with the current > > > > (monotonically increasing) request version. This version is bumped > > > whenever > > > > a new topic is encountered, and metadata fetching will continue to > > > process > > > > while the latest metadata response's version is below the current > > > version. > > > > > > > > So if a metadata request is in flight, and a number of threads > produce > > to > > > > new topics, they'll be added to the working set but the next metadata > > > > request won't take place until the outstanding one returns. So their > > > > updates will be batched together. As you suggest, we can have a > simple > > > list > > > > that tracks unknown topics to isolate new vs. old topics. > > > > > > > > Thanks, > > > > Brian > > > > > > > > > > > > > > > > On Tue, Jan 21, 2020 at 10:04 AM Guozhang Wang <wangg...@gmail.com> > > > wrote: > > > > > > > > > Hi Brian, > > > > > > > > > > I think I buy the complexity and extra end-to-end-latency argument > :) > > > I'm > > > > > fine with delaying the asynchronous tech fetching to future works > and > > > > keep > > > > > the current KIP's scope as-is for now. Under that case can we > > consider > > > > just > > > > > a minor implementation detail (since it is not affecting public > APIs > > we > > > > > probably do not even need to list it, but just thinking loud here): > > > > > > > > > > In your proposal when we request for a topic of unknown metadata, > we > > > are > > > > > going to directly set the topic name as that singleton in the > > request. > > > > I'm > > > > > wondering for the scenario that KAFKA-8904 described, if the > > > > producer#send > > > > > for thousands of new topics are triggered sequentially by a single > > > thread > > > > > or concurrent threads? If it's the latter, and we expect in such > > > > scenarios > > > > > we may have multiple topics being requests within a very short > time, > > > then > > > > > we can probably do sth. like this internally in a synchronized > > manner: > > > > > > > > > > 1) put the topic name into a list, as "unknown topics", then > > > > > 2) exhaust the list, and put all topics from that list to the > > request; > > > if > > > > > the list is empty, it means it has been emptied by another thread > so > > we > > > > > skip sending a new request and just wait for the returned metadata > > > > refresh. > > > > > > > > > > In most cases the list would just be a singleton with the one that > > > thread > > > > > has just enqueued, but under extreme scenarios it can help > batching a > > > few > > > > > topic names probably (of course, I'm thinking about very extreme > > cases > > > > > here, assuming that's was what we've seen in 8904). Since these two > > > steps > > > > > are very light-weighted, doing that in a synchronized block would > not > > > > hurt > > > > > the concurrency too much. > > > > > > > > > > > > > > > Guozhang > > > > > > > > > > > > > > > On Tue, Jan 21, 2020 at 9:39 AM Brian Byrne <bby...@confluent.io> > > > wrote: > > > > > > > > > > > Hi Guozhang, > > > > > > > > > > > > Your understanding of the rationale is accurate, and what you > > suggest > > > > is > > > > > > completely plausible, however I have a slightly different take on > > the > > > > > > situation. > > > > > > > > > > > > When the KIP was originally drafted, making KafkaProducer#send > > > > > asynchronous > > > > > > was one element of the changes (this is a little more general > than > > > (a), > > > > > but > > > > > > has similar implications). As you're aware, doing so would allow > > new > > > > > topics > > > > > > to aggregate since the producer could continue to push new > records, > > > > > whereas > > > > > > today the producer thread is blocked waiting for resolution. > > > > > > > > > > > > However, there were concerns about changing client behavior > > > > unexpectedly > > > > > in > > > > > > this manner, and the change isn't as trivial as one would hope. > For > > > > > > example, we'd have to introduce an intermediate queue of records > > for > > > > > topics > > > > > > without metadata, and have that play well with the buffer pool > > which > > > > > > ensures the memory limit isn't exceeded. A side effect is that a > > > > producer > > > > > > could hit 'memory full' conditions easier, which could have > > > unintended > > > > > > consequences if, say, the model was setup such that different > > > producer > > > > > > threads produced to a disjoint set of topics. Where one producer > > > thread > > > > > was > > > > > > blocked waiting for new metadata, it could now push enough data > to > > > > block > > > > > > all producer threads due to memory limits, so we'd need to be > > careful > > > > > here. > > > > > > > > > > > > For case (a) described, another concern would be adding > additional > > a > > > > new > > > > > > source of latency (possibly seconds) for new topics. Not a huge > > > issue, > > > > > but > > > > > > it is new behavior to existing clients and adds to the complexity > > of > > > > > > verifying no major regressions. > > > > > > > > > > > > It also wouldn't resolve all cases we're interested in. One > > behavior > > > > > we're > > > > > > witnessing is the following: a producer generates to a very large > > > > number > > > > > of > > > > > > topics (several thousand), however the period of consecutive > > records > > > > for > > > > > a > > > > > > topic can often be beyond the current hard-coded expiry of 5 > > minutes. > > > > > > Therefore, when the producer does submit a request for this topic > > > > after 5 > > > > > > minutes, it has to request all of the topic metadata again. While > > > > > batching > > > > > > new topics in the start-up case would definitely help, here it > > would > > > > > likely > > > > > > be lost effort. > > > > > > > > > > > > Being able to increase the metadata eviction for the above case > > would > > > > > > improve things, but there's no perfect value when consecutive > > produce > > > > > times > > > > > > can be modelled as a probability distribution. Rather, the better > > > > > solution > > > > > > could be what we discussed earlier, where we'd lower the eviction > > > > > timeout, > > > > > > but make the cost of fetching uncached topic metadata much > smaller. > > > > > > > > > > > > The changes in the KIP were meant to improve the general case > > without > > > > > > affecting external client behavior, and then the plan was to fix > > the > > > > > > asynchronous send in the next iteration, if necessary. Point (b) > is > > > > along > > > > > > the lines of the latest revision: only send requests for uncached > > > > topics, > > > > > > if they exist, otherwise request the full working set. > > Piggy-backing > > > > was > > > > > > originally included, but removed because its utility was in > doubt. > > > > > > > > > > > > So to summarize, you're correct in that asynchronous topic > fetching > > > > would > > > > > > be a big improvement, and should be an item of future work. > However > > > > what > > > > > > this KIP proposes should be the safest/easiest set of changes to > > > > resolve > > > > > > the current pain points. Please let me know if you agree/disagree > > > with > > > > > this > > > > > > assessment. > > > > > > > > > > > > Thanks, > > > > > > Brian > > > > > > > > > > > > > > > > > > On Mon, Jan 20, 2020 at 10:52 AM Guozhang Wang < > wangg...@gmail.com > > > > > > > > wrote: > > > > > > > > > > > > > Hello Brian, > > > > > > > > > > > > > > I looked at the new proposal again, and I'd like to reason > about > > > its > > > > > > > rationale from the listed motivations in your wiki: > > > > > > > > > > > > > > 1) more RPCs: we may send metadata requests more frequently > than > > > > > > > appropriate. This is especially the case during producer > > start-up, > > > > > where > > > > > > > the more topics it needs to send to, the more metadata requests > > it > > > > > needs > > > > > > to > > > > > > > send. This the original reported issue as in KAFKA-8904. > > > > > > > > > > > > > > 2) large RPCs: we including all topics in the work set when > > sending > > > > > > > metadata request. But I think our conjecture (as Colin has > > pointed > > > > out) > > > > > > is > > > > > > > that this alone is fine most of the time, assuming e.g. you are > > > > sending > > > > > > > such large RPC only once every 10 minutes. It is only because > of > > 1) > > > > > where > > > > > > > you are sending large RPC too frequently which is a common > issue. > > > > > > > > > > > > > > 3) we want to have a configurable eviction period than > hard-coded > > > > > > values. I > > > > > > > consider it as a semi-orthogonal motivation compared with 2) / > 3) > > > but > > > > > we > > > > > > > wanted to piggy-back this fix along with the KIP. > > > > > > > > > > > > > > So from there, 1) and 2) does not contradict to each other > since > > > our > > > > > > belief > > > > > > > is that large RPCs is usually okay as long as it is not > > > > > > large-and-frequent > > > > > > > RPCs, and we actually prefer large-infrequent RPC > > > > smaller-frequent > > > > > RPC > > > > > > > > > > > > > > large-and-frequent RPC (of course). > > > > > > > > > > > > > > The current proposal basically tries to un-tangle 2) from 1), > > i.e. > > > > for > > > > > > the > > > > > > > scenario of KAFKA-8904 it would result in smaller-frequent RPC > > > during > > > > > > > startup than large-and-frequent RPC. But I'm wondering why > don't > > we > > > > > just > > > > > > do > > > > > > > even better and make it large-infrequent RPC? More > specifically, > > > just > > > > > > like > > > > > > > Lucas suggested in the ticket: > > > > > > > > > > > > > > a. when there's new topic with unknown metadata enqueued, > instead > > > of > > > > > > > requesting a metadata immediately just delay it for a short > > period > > > > (no > > > > > > more > > > > > > > than seconds) hoping that more unknown topics would be > requested > > in > > > > the > > > > > > > period; during this period we would not know which partition it > > > would > > > > > go > > > > > > to > > > > > > > of course, so we buffer it in a different manner. > > > > > > > > > > > > > > b. when we are about to send metadata, if there are unknown > > > topic(s) > > > > -- > > > > > > > consider them "urgent topics" -- just send them without other > > > topics; > > > > > > > otherwise, send the work set in the request. If we want to go > > even > > > > > > fancier, > > > > > > > we can still piggy-back some non-urgent along with urgent ones > > but > > > it > > > > > is > > > > > > > more complicated to reason about the trade-off so a simpler > > > approach > > > > is > > > > > > > fine too. > > > > > > > > > > > > > > c. fixing 3) with a new config, which is relatively orthogonal > to > > > a) > > > > > and > > > > > > > b). > > > > > > > > > > > > > > > > > > > > > > > > > > > > Guozhang > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > On Tue, Jan 14, 2020 at 10:39 AM Brian Byrne < > > bby...@confluent.io> > > > > > > wrote: > > > > > > > > > > > > > > > Hello all, > > > > > > > > > > > > > > > > After further offline discussion, I've removed any efforts to > > > > control > > > > > > > > metadata RPC sizes. There are now only two items proposed in > > this > > > > > KIP: > > > > > > > > > > > > > > > > (1) When encountering a new topic, only issue a metadata > > request > > > > for > > > > > > that > > > > > > > > particular topic. For all other cases, continue as it does > > today > > > > > with a > > > > > > > > full working set refresh. > > > > > > > > > > > > > > > > (2) Introduces client configuration flag " > > > > > metadata.eviction.period.ms" > > > > > > > to > > > > > > > > control cache eviction duration. I've reset the default back > to > > > the > > > > > > > current > > > > > > > > (hard-coded) value of 5 minutes since we can identify cases > > where > > > > > > > changing > > > > > > > > it would cause surprises. > > > > > > > > > > > > > > > > The votes have been cleared. My apologies for continually > > > > > interrupting > > > > > > > and > > > > > > > > making changes to the KIP, but hopefully this is an agreeable > > > > minimum > > > > > > > > solution to move forward. > > > > > > > > > > > > > > > > Thanks, > > > > > > > > Brian > > > > > > > > > > > > > > > > On Mon, Jan 6, 2020 at 5:23 PM Colin McCabe < > > cmcc...@apache.org> > > > > > > wrote: > > > > > > > > > > > > > > > > > On Mon, Jan 6, 2020, at 14:40, Brian Byrne wrote: > > > > > > > > > > So the performance of a metadata RPC that occurs every > once > > > > every > > > > > > 10 > > > > > > > > > > seconds should not be measured strictly in CPU cost, but > > > rather > > > > > the > > > > > > > > > effect > > > > > > > > > > on the 95-99%. The larger the request is, the more > > > opportunity > > > > > > there > > > > > > > is > > > > > > > > > to > > > > > > > > > > put a burst stress on the producer and broker, and the > > larger > > > > the > > > > > > > > > response > > > > > > > > > > payload to push through the control plane socket. Maybe > > > that's > > > > > not > > > > > > at > > > > > > > > 5k > > > > > > > > > > topics, but there are groups that are 10k+ topics and > > pushing > > > > > > > further. > > > > > > > > > > > > > > > > > > KAFKA-7019 made reading the metadata lock-free. There is > no > > a > > > > > priori > > > > > > > > > reason to prefer lots of small requests to a few big > requests > > > > > (within > > > > > > > > > reason!) In fact, it's quite the opposite: when we make > lots > > > of > > > > > > small > > > > > > > > > requests, it uses more network bandwidth than when we make > a > > > few > > > > > big > > > > > > > > ones. > > > > > > > > > There are a few reasons for this: the request and response > > > > headers > > > > > > > have a > > > > > > > > > fixed overhead, one big array takes less space when > > serialized > > > > than > > > > > > > > several > > > > > > > > > small ones, etc. There is also TCP and IP overhead, etc. > > > > > > > > > > > > > > > > > > The broker can only push a few tens of thousands of > metadata > > > > > > requests a > > > > > > > > > second, due to the overhead of message processing. This is > > why > > > > > > almost > > > > > > > > all > > > > > > > > > of the admin commands support batching. So if you need to > > > create > > > > > > 1,000 > > > > > > > > > topics, you make one request, not 1,000 requests, for > > example. > > > > > > > > > > > > > > > > > > It's definitely reasonable to limit the number of topics > made > > > per > > > > > > > > metadata > > > > > > > > > request. But the reason for this is not improving > > performance, > > > > but > > > > > > > > > preventing certain bad corner cases that happen when RPCs > get > > > too > > > > > > big. > > > > > > > > For > > > > > > > > > example, one problem that can happen when a metadata > response > > > > gets > > > > > > too > > > > > > > > big > > > > > > > > > is that the client could time out before it finishes > reading > > > the > > > > > > > > response. > > > > > > > > > Or if the response got way too big, it could even exceed > the > > > > > maximum > > > > > > > > > response size. > > > > > > > > > > > > > > > > > > So I think the limit should be pretty high here. We might > > also > > > > > > > consider > > > > > > > > > putting the limit in terms of number of partitions rather > > than > > > > > number > > > > > > > of > > > > > > > > > topics, since that's what really matters here (this is > harder > > > to > > > > > > > > implement, > > > > > > > > > I realize...) If I had to put a rough number on it, I'd > say > > we > > > > > don't > > > > > > > > want > > > > > > > > > more than like 50 MB of response data. This is vaguely in > > line > > > > > with > > > > > > > how > > > > > > > > we > > > > > > > > > do fetch responses as well (although I think the limit > there > > is > > > > > > > higher). > > > > > > > > > > > > > > > > > > We should also keep in mind that anyone with a wildcard > > > > > subscription > > > > > > is > > > > > > > > > making full metadata requests, which will return back > > > information > > > > > > about > > > > > > > > > every topic in the system. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > There's definitely weight to the metadata RPCs. Looking > at > > a > > > > > > previous > > > > > > > > > > local, non-loaded test I ran, I calculate about 2 > > > microseconds > > > > > per > > > > > > > > > > partition latency to the producer. At 10,000 topics with > > 100 > > > > > > > partitions > > > > > > > > > > each, that's a full 2-second bubble in the best case. I > can > > > > > rerun a > > > > > > > > more > > > > > > > > > > targeted performance test, but I feel that's missing the > > > point. > > > > > > > > > > > > > > > > > > > > > > > > > > > > If the metadata is fetched in the background, there should > be > > > no > > > > > > impact > > > > > > > > on > > > > > > > > > producer latency, right? > > > > > > > > > > > > > > > > > > It would be good to talk more about the importance of > > > background > > > > > > > metadata > > > > > > > > > fetching in the KIP. The fact that we don't do this is > > > actually > > > > a > > > > > > big > > > > > > > > > problem with the current implementation. As I understand > it, > > > > when > > > > > > the > > > > > > > > > metadata gets too old, we slam on the brakes and wait for a > > > > > metadata > > > > > > > > fetch > > > > > > > > > to complete, rather than starting the metadata fetch BEFORE > > we > > > > need > > > > > > it. > > > > > > > > > It's just bad scheduling. > > > > > > > > > > > > > > > > > > best, > > > > > > > > > Colin > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Brian > > > > > > > > > > > > > > > > > > > > On Mon, Jan 6, 2020 at 1:31 PM Colin McCabe < > > > > cmcc...@apache.org> > > > > > > > > wrote: > > > > > > > > > > > > > > > > > > > > > On Mon, Jan 6, 2020, at 13:07, Brian Byrne wrote: > > > > > > > > > > > > Hi Colin, > > > > > > > > > > > > > > > > > > > > > > > > Thanks again for the feedback! > > > > > > > > > > > > > > > > > > > > > > > > On Mon, Jan 6, 2020 at 12:07 PM Colin McCabe < > > > > > > cmcc...@apache.org > > > > > > > > > > > > > > > > > wrote: > > > > > > > > > > > > > > > > > > > > > > > > > Metadata requests don't (always) go to the > > controller, > > > > > right? > > > > > > > We > > > > > > > > > > > should > > > > > > > > > > > > > fix the wording in this section. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > You're correct, s/controller/broker(s)/. > > > > > > > > > > > > > > > > > > > > > > > > I feel like "Proposed Changes" should come before > > "Public > > > > > > > > Interfaces" > > > > > > > > > > > > > here. The new configuration won't make sense to > the > > > > reader > > > > > > > until > > > > > > > > > he > > > > > > > > > > > or she > > > > > > > > > > > > > has read the "changes" section. Also, it's not > clear > > > > from > > > > > > the > > > > > > > > name > > > > > > > > > > > that > > > > > > > > > > > > > "metadata evict" refers to a span of time. What do > > you > > > > > think > > > > > > > > > about " > > > > > > > > > > > > > metadata.eviction.period.ms" as a configuration > > name? > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Sure, makes sense. Updated order and config name. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Where is the "10 seconds" coming from here? The > > > current > > > > > > > default > > > > > > > > > for > > > > > > > > > > > > > metadata.max.age.ms is 5 * 60 * 1000 ms, which > > implies > > > > > that > > > > > > we > > > > > > > > > want to > > > > > > > > > > > > > refresh every 5 minutes. Definitely not every 10 > > > > seconds. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > The 10 seconds is another arbitrary value to prevent > a > > > > large > > > > > > > number > > > > > > > > > of > > > > > > > > > > > RPCs > > > > > > > > > > > > if the target batch size were fixed at 20. For > example, > > > if > > > > > > > there's > > > > > > > > > 5,000 > > > > > > > > > > > > topics with a 5-minute interval, then instead of > > issuing > > > an > > > > > RPC > > > > > > > > every > > > > > > > > > > > > 1.2 seconds with batch size of 20, it would issue an > > RPC > > > > > every > > > > > > 10 > > > > > > > > > seconds > > > > > > > > > > > > with batch size of 167. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Hmm. This will lead to many more RPCs compared to the > > > > current > > > > > > > > > situation > > > > > > > > > > > of issuing an RPC every 5 minutes with 5,000 topics, > > right? > > > > > See > > > > > > > > below > > > > > > > > > for > > > > > > > > > > > more discussion. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Stepping back a little bit, it seems like the big > > > problem > > > > > you > > > > > > > > > > > identified > > > > > > > > > > > > > is the O(N^2) behavior of producing to X, then Y, > > then > > > Z, > > > > > > etc. > > > > > > > > etc. > > > > > > > > > > > where > > > > > > > > > > > > > each new produce to a fresh topic triggers a > metadata > > > > > request > > > > > > > > with > > > > > > > > > all > > > > > > > > > > > the > > > > > > > > > > > > > preceding topics included. > > > > > > > > > > > > > > > > > > > > > > > > > > Of course we need to send out a metadata request > > before > > > > > > > producing > > > > > > > > > to X, > > > > > > > > > > > > > then Y, then Z, but that request could just specify > > X, > > > or > > > > > > just > > > > > > > > > specify > > > > > > > > > > > Y, > > > > > > > > > > > > > etc. etc. In other words, we could decouple > decouple > > > the > > > > > > > routine > > > > > > > > > > > metadata > > > > > > > > > > > > > fetch which happens on a 5 minute timer from the > need > > > to > > > > > > fetch > > > > > > > > > > > metadata for > > > > > > > > > > > > > something specific right now. > > > > > > > > > > > > > > > > > > > > > > > > > > I guess my question is, do we really need to allow > > > > routine > > > > > > > > metadata > > > > > > > > > > > > > fetches to "piggyback" on the emergency metadata > > > fetches? > > > > > It > > > > > > > > adds > > > > > > > > > a > > > > > > > > > > > lot of > > > > > > > > > > > > > complexity, and we don't have any benchmarks > proving > > > that > > > > > > it's > > > > > > > > > better. > > > > > > > > > > > > > Also, as I understand it, whether we piggyback or > > not, > > > > the > > > > > > > number > > > > > > > > > of > > > > > > > > > > > > > metadata fetches is the same, right? > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > So it's possible to do as you suggest, but I would > > argue > > > > that > > > > > > > it'd > > > > > > > > be > > > > > > > > > > > more > > > > > > > > > > > > complex with how the code is structured and wouldn't > > add > > > > any > > > > > > > extra > > > > > > > > > > > > complexity. The derived metadata class effectively > > > respond > > > > > to a > > > > > > > > > > > > notification that a metadata RPC is going to be > issued, > > > > where > > > > > > > they > > > > > > > > > return > > > > > > > > > > > > the metadata request structure with topics to > refresh, > > > > which > > > > > is > > > > > > > > > decoupled > > > > > > > > > > > > from what generated the event (new topic, stale > > metadata, > > > > > > > periodic > > > > > > > > > > > refresh > > > > > > > > > > > > alarm). There is also a strict implementation detail > > that > > > > > only > > > > > > > one > > > > > > > > > > > metadata > > > > > > > > > > > > request can be outstanding at any time, which lends > > > itself > > > > to > > > > > > > > > consolidate > > > > > > > > > > > > complexity in the base metadata class and have the > > > derived > > > > > use > > > > > > > the > > > > > > > > > > > "listen > > > > > > > > > > > > for next update" model. > > > > > > > > > > > > > > > > > > > > > > > > By maintaining an ordered list of topics by their > last > > > > > metadata > > > > > > > > > refresh > > > > > > > > > > > > time (0 for new topics), it's a matter of pulling > from > > > the > > > > > > front > > > > > > > of > > > > > > > > > the > > > > > > > > > > > > list to see which topics should be included in the > next > > > > > update. > > > > > > > > > Always > > > > > > > > > > > > include all urgent topics, then include non-urgent > > (i.e. > > > > need > > > > > > to > > > > > > > be > > > > > > > > > > > > refreshed soon-ish) up to the target batch size. > > > > > > > > > > > > > > > > > > > > > > > > The number of metadata fetches could be reduced. > > > Assuming a > > > > > > > target > > > > > > > > > batch > > > > > > > > > > > > size of 20, a new topic might also refresh 19 other > > > > "refresh > > > > > > > soon" > > > > > > > > > topics > > > > > > > > > > > > in the same RPC, as opposed to those 19 being handled > > in > > > a > > > > > > > > subsequent > > > > > > > > > > > RPC. > > > > > > > > > > > > > > > > > > > > > > > > Although to counter the above, the > > batching/piggybacking > > > > > logic > > > > > > > > isn't > > > > > > > > > > > > necessarily about reducing the total number of RPCs, > > but > > > > > rather > > > > > > > to > > > > > > > > > > > > distribute the load more evenly over time. For > example, > > > if > > > > a > > > > > > > large > > > > > > > > > number > > > > > > > > > > > > of topics need to be refreshed at the approximate > same > > > time > > > > > > > (common > > > > > > > > > for > > > > > > > > > > > > startups cases that hit a large number of topics), > the > > > > > updates > > > > > > > are > > > > > > > > > more > > > > > > > > > > > > evenly distributed to avoid a flood. > > > > > > > > > > > > > > > > > > > > > > It wouldn't be a flood in the current case, right? It > > > would > > > > > just > > > > > > > be > > > > > > > > a > > > > > > > > > > > single metadata request for a lot of topics. > > > > > > > > > > > > > > > > > > > > > > Let's compare the two cases. In the current scenario, > we > > > > have > > > > > 1 > > > > > > > > > metadata > > > > > > > > > > > request every 5 minutes. This request is for 5,000 > > topics > > > > > (let's > > > > > > > > > say). In > > > > > > > > > > > the new scenario, we have a request every 10 seconds > for > > > 167 > > > > > > topics > > > > > > > > > each. > > > > > > > > > > > > > > > > > > > > > > Which do you think will be more expensive? I think the > > > > second > > > > > > > > scenario > > > > > > > > > > > certainly will because of the overhead of 30x as many > > > > requests > > > > > > send > > > > > > > > > over > > > > > > > > > > > the wire. Metadata accesses are now lockless, so the > big > > > > > > metadata > > > > > > > > > request > > > > > > > > > > > just isn't that much of a problem. I bet if you > > benchmark > > > > it, > > > > > > > > sending > > > > > > > > > back > > > > > > > > > > > metadata for 167 topics won't be that much cheaper than > > > > sending > > > > > > > back > > > > > > > > > > > metadata for 5k. Certainly not 30x cheaper. There > will > > > > > > eventually > > > > > > > > be > > > > > > > > > a > > > > > > > > > > > point where we need to split metadata requests, but > it's > > > > > > definitely > > > > > > > > > not at > > > > > > > > > > > 5,000 topics. > > > > > > > > > > > > > > > > > > > > > > regards, > > > > > > > > > > > Colin > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Brian > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > On Mon, Jan 6, 2020, at 10:26, Lucas Bradstreet > > wrote: > > > > > > > > > > > > > > +1 (non-binding) > > > > > > > > > > > > > > > > > > > > > > > > > > > > On Thu, Jan 2, 2020 at 11:15 AM Brian Byrne < > > > > > > > > bby...@confluent.io > > > > > > > > > > > > > > > > > > > > > wrote: > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Hello all, > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > After further discussion and improvements, I'd > > like > > > > to > > > > > > > > > reinstate > > > > > > > > > > > the > > > > > > > > > > > > > voting > > > > > > > > > > > > > > > process. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > The updated KIP: > > > > > > > > > > > > > > > > > https://cwiki.apache.org/confluence/display/KAFKA/KIP-526 > > > > > > > > > > > > > > > > > > > > > > > > %3A+Reduce+Producer+Metadata+Lookups+for+Large+Number+of+Topics > > > > > > > > > > > > > > > < > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > https://cwiki.apache.org/confluence/display/KAFKA/KIP-526%3A+Reduce+Producer+Metadata+Lookups+for+Large+Number+of+Topics > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > The continued discussion: > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > https://lists.apache.org/thread.html/b2f8f830ef04587144cf0840c7d4811bbf0a14f3c459723dbc5acf9e@%3Cdev.kafka.apache.org%3E > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > I'd be happy to address any further > > > > comments/feedback. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Thanks, > > > > > > > > > > > > > > > Brian > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > On Mon, Dec 9, 2019 at 11:02 PM Guozhang Wang < > > > > > > > > > wangg...@gmail.com> > > > > > > > > > > > > > wrote: > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > With the concluded summary on the other > > > discussion > > > > > > > thread, > > > > > > > > > I'm > > > > > > > > > > > +1 on > > > > > > > > > > > > > the > > > > > > > > > > > > > > > > proposal. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Thanks Brian! > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > On Tue, Nov 19, 2019 at 8:00 PM deng ziming < > > > > > > > > > > > > > dengziming1...@gmail.com> > > > > > > > > > > > > > > > > wrote: > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > For new (uncached) topics, one problem > here > > > is > > > > > that > > > > > > > we > > > > > > > > > don't > > > > > > > > > > > know > > > > > > > > > > > > > > > which > > > > > > > > > > > > > > > > > > partition to map a record to in the event > > > that > > > > it > > > > > > > has a > > > > > > > > > key > > > > > > > > > > > or > > > > > > > > > > > > > custom > > > > > > > > > > > > > > > > > > partitioner, so the RecordAccumulator > > > wouldn't > > > > > know > > > > > > > > which > > > > > > > > > > > > > > > batch/broker > > > > > > > > > > > > > > > > it > > > > > > > > > > > > > > > > > > belongs. We'd need an intermediate record > > > queue > > > > > > that > > > > > > > > > > > subsequently > > > > > > > > > > > > > > > moved > > > > > > > > > > > > > > > > > the > > > > > > > > > > > > > > > > > > records into RecordAccumulators once > > metadata > > > > > > > > resolution > > > > > > > > > was > > > > > > > > > > > > > > > complete. > > > > > > > > > > > > > > > > > For > > > > > > > > > > > > > > > > > > known topics, we don't currently block at > > all > > > > in > > > > > > > > > > > waitOnMetadata. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > You are right, I forget this fact, and the > > > > > > intermediate > > > > > > > > > record > > > > > > > > > > > > > queue > > > > > > > > > > > > > > > will > > > > > > > > > > > > > > > > > help, but I have some questions > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > if we add an intermediate record queue in > > > > > > > KafkaProducer, > > > > > > > > > when > > > > > > > > > > > > > should we > > > > > > > > > > > > > > > > > move the records into RecordAccumulators? > > > > > > > > > > > > > > > > > only NetworkClient is aware of the > > > > > MetadataResponse, > > > > > > > here > > > > > > > > > is > > > > > > > > > > > the > > > > > > > > > > > > > > > > > hierarchical structure of the related > > classes: > > > > > > > > > > > > > > > > > KafkaProducer > > > > > > > > > > > > > > > > > Accumulator > > > > > > > > > > > > > > > > > Sender > > > > > > > > > > > > > > > > > NetworkClient > > > > > > > > > > > > > > > > > > > > > > > > > metadataUpdater.handleCompletedMetadataResponse > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > so > > > > > > > > > > > > > > > > > 1. we should also add a metadataUpdater to > > > > > > > KafkaProducer? > > > > > > > > > > > > > > > > > 2. if the topic really does not exists? the > > > > > > > intermediate > > > > > > > > > record > > > > > > > > > > > > > queue > > > > > > > > > > > > > > > > will > > > > > > > > > > > > > > > > > become too large? > > > > > > > > > > > > > > > > > 3. and should we `block` when the > > intermediate > > > > > record > > > > > > > > > queue is > > > > > > > > > > > too > > > > > > > > > > > > > > > large? > > > > > > > > > > > > > > > > > and this will again bring the blocking > > problem? > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > On Wed, Nov 20, 2019 at 12:40 AM Brian > Byrne > > < > > > > > > > > > > > bby...@confluent.io> > > > > > > > > > > > > > > > > wrote: > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Hi Deng, > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Thanks for the feedback. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > On Mon, Nov 18, 2019 at 6:56 PM deng > > ziming < > > > > > > > > > > > > > > > dengziming1...@gmail.com> > > > > > > > > > > > > > > > > > > wrote: > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > hi, I reviewed the current code, the > > > > > > > ProduceMetadata > > > > > > > > > > > maintains > > > > > > > > > > > > > an > > > > > > > > > > > > > > > > > expiry > > > > > > > > > > > > > > > > > > > threshold for every topic, every time > > when > > > we > > > > > > write > > > > > > > > to > > > > > > > > > a > > > > > > > > > > > topic > > > > > > > > > > > > > we > > > > > > > > > > > > > > > > will > > > > > > > > > > > > > > > > > > set > > > > > > > > > > > > > > > > > > > the expiry time to -1 to indicate it > > should > > > > be > > > > > > > > updated, > > > > > > > > > > > this > > > > > > > > > > > > > does > > > > > > > > > > > > > > > > work > > > > > > > > > > > > > > > > > to > > > > > > > > > > > > > > > > > > > reduce the size of the topic working > set, > > > but > > > > > the > > > > > > > > > producer > > > > > > > > > > > will > > > > > > > > > > > > > > > > > continue > > > > > > > > > > > > > > > > > > > fetching metadata for these topics in > > every > > > > > > > metadata > > > > > > > > > > > request > > > > > > > > > > > > > for > > > > > > > > > > > > > > > the > > > > > > > > > > > > > > > > > full > > > > > > > > > > > > > > > > > > > expiry duration. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Indeed, you are correct, I terribly > misread > > > the > > > > > > code > > > > > > > > > here. > > > > > > > > > > > > > > > Fortunately > > > > > > > > > > > > > > > > > this > > > > > > > > > > > > > > > > > > was only a minor optimization in the KIP > > > that's > > > > > no > > > > > > > > longer > > > > > > > > > > > > > necessary. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > and we can improve the situation by 2 > > means: > > > > > > > > > > > > > > > > > > > 1. we maintain a refresh threshold > > for > > > > > every > > > > > > > > topic > > > > > > > > > > > which > > > > > > > > > > > > > is for > > > > > > > > > > > > > > > > > > example > > > > > > > > > > > > > > > > > > > 0.8 * expiry_threshold, and when we > send > > > > > > > > > `MetadataRequest` > > > > > > > > > > > to > > > > > > > > > > > > > > > brokers > > > > > > > > > > > > > > > > > we > > > > > > > > > > > > > > > > > > > just request unknownLeaderTopics + > > > > > > > > > unknownPartitionTopics + > > > > > > > > > > > > > topics > > > > > > > > > > > > > > > > > > > reach refresh threshold. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Right, this is similar to what I > suggested, > > > > with > > > > > a > > > > > > > > larger > > > > > > > > > > > window > > > > > > > > > > > > > on > > > > > > > > > > > > > > > the > > > > > > > > > > > > > > > > > > "staleness" that permits for batching to > an > > > > > > > appropriate > > > > > > > > > size > > > > > > > > > > > > > (except > > > > > > > > > > > > > > > if > > > > > > > > > > > > > > > > > > there's any unknown topics, you'd want to > > > issue > > > > > the > > > > > > > > > request > > > > > > > > > > > > > > > > immediately). > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > 2. we don't invoke > > > > > > KafkaProducer#waitOnMetadata > > > > > > > > > when we > > > > > > > > > > > > > call > > > > > > > > > > > > > > > > > > > KafkaProducer#send because of we just > > send > > > > data > > > > > > to > > > > > > > > > > > > > > > RecordAccumulator, > > > > > > > > > > > > > > > > > and > > > > > > > > > > > > > > > > > > > before we send data to brokers we will > > > invoke > > > > > > > > > > > > > > > > > RecordAccumulator#ready(), > > > > > > > > > > > > > > > > > > so > > > > > > > > > > > > > > > > > > > we can only invoke waitOnMetadata to > > block > > > > when > > > > > > > > (number > > > > > > > > > > > topics > > > > > > > > > > > > > > > > > > > reach refresh threshold)>(number of all > > > known > > > > > > > > > topics)*0.2. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > For new (uncached) topics, one problem > here > > > is > > > > > that > > > > > > > we > > > > > > > > > don't > > > > > > > > > > > know > > > > > > > > > > > > > > > which > > > > > > > > > > > > > > > > > > partition to map a record to in the event > > > that > > > > it > > > > > > > has a > > > > > > > > > key > > > > > > > > > > > or > > > > > > > > > > > > > custom > > > > > > > > > > > > > > > > > > partitioner, so the RecordAccumulator > > > wouldn't > > > > > know > > > > > > > > which > > > > > > > > > > > > > > > batch/broker > > > > > > > > > > > > > > > > it > > > > > > > > > > > > > > > > > > belongs. We'd need an intermediate record > > > queue > > > > > > that > > > > > > > > > > > subsequently > > > > > > > > > > > > > > > moved > > > > > > > > > > > > > > > > > the > > > > > > > > > > > > > > > > > > records into RecordAccumulators once > > metadata > > > > > > > > resolution > > > > > > > > > was > > > > > > > > > > > > > > > complete. > > > > > > > > > > > > > > > > > For > > > > > > > > > > > > > > > > > > known topics, we don't currently block at > > all > > > > in > > > > > > > > > > > waitOnMetadata. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > The last major point of minimizing > producer > > > > > startup > > > > > > > > > metadata > > > > > > > > > > > > > RPCs may > > > > > > > > > > > > > > > > > still > > > > > > > > > > > > > > > > > > need to be improved, but this would be a > > > large > > > > > > > > > improvement > > > > > > > > > > > on the > > > > > > > > > > > > > > > > current > > > > > > > > > > > > > > > > > > situation. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Thanks, > > > > > > > > > > > > > > > > > > Brian > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > I think the above 2 ways are enough to > > > solve > > > > > the > > > > > > > > > current > > > > > > > > > > > > > problem. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > On Tue, Nov 19, 2019 at 3:20 AM Colin > > > McCabe > > > > < > > > > > > > > > > > > > cmcc...@apache.org> > > > > > > > > > > > > > > > > > wrote: > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > On Mon, Nov 18, 2019, at 10:05, Brian > > > Byrne > > > > > > > wrote: > > > > > > > > > > > > > > > > > > > > > On Fri, Nov 15, 2019 at 5:08 PM > Colin > > > > > McCabe > > > > > > < > > > > > > > > > > > > > > > cmcc...@apache.org > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > wrote: > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Two seconds doesn't seem like a > > > > > reasonable > > > > > > > > > amount of > > > > > > > > > > > > > time to > > > > > > > > > > > > > > > > > leave > > > > > > > > > > > > > > > > > > > for > > > > > > > > > > > > > > > > > > > > the > > > > > > > > > > > > > > > > > > > > > > metadata fetch. Fetching halfway > > > > through > > > > > > the > > > > > > > > > > > expiration > > > > > > > > > > > > > > > period > > > > > > > > > > > > > > > > > > seems > > > > > > > > > > > > > > > > > > > > more > > > > > > > > > > > > > > > > > > > > > > reasonable. It also doesn't > > require > > > us > > > > > to > > > > > > > > > create a > > > > > > > > > > > new > > > > > > > > > > > > > > > > > > configuration > > > > > > > > > > > > > > > > > > > > key, > > > > > > > > > > > > > > > > > > > > > > which is nice. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Another option is to just do the > > > > metadata > > > > > > > fetch > > > > > > > > > every > > > > > > > > > > > > > > > > > > > > metadata.max.age.ms, > > > > > > > > > > > > > > > > > > > > > > but not expire the topic until we > > > can't > > > > > > fetch > > > > > > > > the > > > > > > > > > > > > > metadata > > > > > > > > > > > > > > > for > > > > > > > > > > > > > > > > 2 > > > > > > > > > > > > > > > > > * > > > > > > > > > > > > > > > > > > > > > > metadata.max.age.ms. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > I'd expect two seconds to be > > reasonable > > > > in > > > > > > the > > > > > > > > > common > > > > > > > > > > > case. > > > > > > > > > > > > > > > Keep > > > > > > > > > > > > > > > > in > > > > > > > > > > > > > > > > > > > mind > > > > > > > > > > > > > > > > > > > > > that this doesn't affect > correctness, > > > > and a > > > > > > > > control > > > > > > > > > > > > > operation > > > > > > > > > > > > > > > > > > returning > > > > > > > > > > > > > > > > > > > > > cached metadata should be on the > > order > > > of > > > > > > > > > milliseconds. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Hi Brian, > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Thanks again for the KIP. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > I think the issue here is not the > > common > > > > > case, > > > > > > > but > > > > > > > > > the > > > > > > > > > > > > > uncommon > > > > > > > > > > > > > > > > case > > > > > > > > > > > > > > > > > > > where > > > > > > > > > > > > > > > > > > > > the metadata fetch takes longer than > > > > > expected. > > > > > > > In > > > > > > > > > that > > > > > > > > > > > > > case, we > > > > > > > > > > > > > > > > > don't > > > > > > > > > > > > > > > > > > > want > > > > > > > > > > > > > > > > > > > > to be in the position of having our > > > > metadata > > > > > > > expire > > > > > > > > > > > because > > > > > > > > > > > > > we > > > > > > > > > > > > > > > > waited > > > > > > > > > > > > > > > > > > too > > > > > > > > > > > > > > > > > > > > long to renew it. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > This is one reason why I think that > the > > > > > > metadata > > > > > > > > > > > expiration > > > > > > > > > > > > > time > > > > > > > > > > > > > > > > > should > > > > > > > > > > > > > > > > > > > be > > > > > > > > > > > > > > > > > > > > longer than the metadata refresh > time. > > > In > > > > > > fact, > > > > > > > it > > > > > > > > > > > might be > > > > > > > > > > > > > > > worth > > > > > > > > > > > > > > > > > > having > > > > > > > > > > > > > > > > > > > > two separate configuration keys for > > these > > > > two > > > > > > > > > values. I > > > > > > > > > > > > > could > > > > > > > > > > > > > > > > > imagine > > > > > > > > > > > > > > > > > > a > > > > > > > > > > > > > > > > > > > > user who is having trouble with > > metadata > > > > > > > expiration > > > > > > > > > > > wanting > > > > > > > > > > > > > to > > > > > > > > > > > > > > > > > increase > > > > > > > > > > > > > > > > > > > the > > > > > > > > > > > > > > > > > > > > metadata expiration time, but without > > > > > > increasing > > > > > > > > the > > > > > > > > > > > metadata > > > > > > > > > > > > > > > > refresh > > > > > > > > > > > > > > > > > > > > period. In a sense, the metadata > > > > expiration > > > > > > time > > > > > > > > is > > > > > > > > > like > > > > > > > > > > > > > the ZK > > > > > > > > > > > > > > > > > > session > > > > > > > > > > > > > > > > > > > > expiration time. You might want to > > turn > > > it > > > > > up > > > > > > if > > > > > > > > the > > > > > > > > > > > > > cluster is > > > > > > > > > > > > > > > > > > > > experiencing load spikes. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > But to the general > > > > > > > > > > > > > > > > > > > > > point, defining the algorithm would > > > mean > > > > > > > > enforcing > > > > > > > > > it > > > > > > > > > > > to > > > > > > > > > > > > > fair > > > > > > > > > > > > > > > > > > accuracy, > > > > > > > > > > > > > > > > > > > > > whereas if the suggestion is that > > it'll > > > > be > > > > > > > > > performed > > > > > > > > > > > at a > > > > > > > > > > > > > > > > > reasonable > > > > > > > > > > > > > > > > > > > > time, > > > > > > > > > > > > > > > > > > > > > it allows for batching and other > > > > > > optimizations. > > > > > > > > > > > Perhaps I > > > > > > > > > > > > > > > > shouldn't > > > > > > > > > > > > > > > > > > be > > > > > > > > > > > > > > > > > > > > > regarding what's defined in a KIP > to > > be > > > > > > > > > contractual in > > > > > > > > > > > > > these > > > > > > > > > > > > > > > > cases, > > > > > > > > > > > > > > > > > > but > > > > > > > > > > > > > > > > > > > > you > > > > > > > > > > > > > > > > > > > > > could consider a first > implementation > > > to > > > > > > > collect > > > > > > > > > topics > > > > > > > > > > > > > whose > > > > > > > > > > > > > > > > > > metadata > > > > > > > > > > > > > > > > > > > > has > > > > > > > > > > > > > > > > > > > > > exceeded (metadata.max.age.ms / > 2), > > > and > > > > > > > sending > > > > > > > > > the > > > > > > > > > > > batch > > > > > > > > > > > > > > > once a > > > > > > > > > > > > > > > > > > > > > constituent topic's metadata is > near > > > the > > > > > > > expiry, > > > > > > > > > or a > > > > > > > > > > > > > > > sufficient > > > > > > > > > > > > > > > > > > number > > > > > > > > > > > > > > > > > > > > of > > > > > > > > > > > > > > > > > > > > > topics have been collected (10? > 100? > > > > > 1000?). > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > I'm concerned that if we change the > > > > metadata > > > > > > > > caching > > > > > > > > > > > strategy > > > > > > > > > > > > > > > > without > > > > > > > > > > > > > > > > > > > > discussing it first, it may improve > > > certain > > > > > > > > > workloads but > > > > > > > > > > > > > make > > > > > > > > > > > > > > > > others > > > > > > > > > > > > > > > > > > > > worse. We need to be concrete about > > what > > > > the > > > > > > > > > proposed > > > > > > > > > > > > > strategy > > > > > > > > > > > > > > > is > > > > > > > > > > > > > > > > so > > > > > > > > > > > > > > > > > > > that > > > > > > > > > > > > > > > > > > > > we can really evaluate it. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > We should be specific about what > > > > happens > > > > > if > > > > > > > the > > > > > > > > > > > first few > > > > > > > > > > > > > > > > > metadata > > > > > > > > > > > > > > > > > > > > fetches > > > > > > > > > > > > > > > > > > > > > > fail. Do we use exponential > > backoff > > > to > > > > > > > decide > > > > > > > > > when > > > > > > > > > > > to > > > > > > > > > > > > > > > resend? > > > > > > > > > > > > > > > > > It > > > > > > > > > > > > > > > > > > > > seems > > > > > > > > > > > > > > > > > > > > > > like we really should, for all > the > > > > usual > > > > > > > > reasons > > > > > > > > > > > (reduce > > > > > > > > > > > > > the > > > > > > > > > > > > > > > > load > > > > > > > > > > > > > > > > > > on > > > > > > > > > > > > > > > > > > > > > > brokers, ride out temporary > service > > > > > > > > disruptions, > > > > > > > > > > > etc.) > > > > > > > > > > > > > Maybe > > > > > > > > > > > > > > > > we > > > > > > > > > > > > > > > > > > > could > > > > > > > > > > > > > > > > > > > > have > > > > > > > > > > > > > > > > > > > > > > an exponential retry backoff for > > each > > > > > > broker > > > > > > > > (in > > > > > > > > > > > other > > > > > > > > > > > > > words, > > > > > > > > > > > > > > > > we > > > > > > > > > > > > > > > > > > > > should try > > > > > > > > > > > > > > > > > > > > > > to contact a different broker > > before > > > > > > applying > > > > > > > > the > > > > > > > > > > > > > backoff.) > > > > > > > > > > > > > > > I > > > > > > > > > > > > > > > > > > think > > > > > > > > > > > > > > > > > > > > this > > > > > > > > > > > > > > > > > > > > > > already sort of happens with the > > > > > disconnect > > > > > > > > > timeout, > > > > > > > > > > > but > > > > > > > > > > > > > we > > > > > > > > > > > > > > > > might > > > > > > > > > > > > > > > > > > > need > > > > > > > > > > > > > > > > > > > > a > > > > > > > > > > > > > > > > > > > > > > more general solution. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > I don't plan to change this > behavior. > > > > > > Currently > > > > > > > > it > > > > > > > > > > > retries > > > > > > > > > > > > > > > after > > > > > > > > > > > > > > > > a > > > > > > > > > > > > > > > > > > > fixed > > > > > > > > > > > > > > > > > > > > > value of 'retry.backoff.ms' > > (defaults > > > to > > > > > 100 > > > > > > > > ms). > > > > > > > > > It's > > > > > > > > > > > > > > > possible > > > > > > > > > > > > > > > > > that > > > > > > > > > > > > > > > > > > > > > different brokers are attempted, > but > > I > > > > > > haven't > > > > > > > > dug > > > > > > > > > > > into it. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > I think it's critical to understand > > what > > > > the > > > > > > > > current > > > > > > > > > > > > > behavior is > > > > > > > > > > > > > > > > > before > > > > > > > > > > > > > > > > > > > we > > > > > > > > > > > > > > > > > > > > try to change it. The difference > > between > > > > > > > retrying > > > > > > > > > the > > > > > > > > > > > same > > > > > > > > > > > > > > > broker > > > > > > > > > > > > > > > > > and > > > > > > > > > > > > > > > > > > > > trying a different one has a large > > impact > > > > it > > > > > > has > > > > > > > on > > > > > > > > > > > cluster > > > > > > > > > > > > > load > > > > > > > > > > > > > > > > and > > > > > > > > > > > > > > > > > > > > latency. For what it's worth, I > > believe > > > > the > > > > > > > > > behavior is > > > > > > > > > > > the > > > > > > > > > > > > > > > second > > > > > > > > > > > > > > > > > > one, > > > > > > > > > > > > > > > > > > > > but it has been a while since I > > checked. > > > > > Let's > > > > > > > > > figure > > > > > > > > > > > this > > > > > > > > > > > > > out. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Thanks for the clarification. > > Fully > > > > > > > > > asynchronous is > > > > > > > > > > > the > > > > > > > > > > > > > way > > > > > > > > > > > > > > > to > > > > > > > > > > > > > > > > > > go, I > > > > > > > > > > > > > > > > > > > > > > agree. I'm having trouble > > > > understanding > > > > > > how > > > > > > > > > > > timeouts are > > > > > > > > > > > > > > > > handled > > > > > > > > > > > > > > > > > > in > > > > > > > > > > > > > > > > > > > > the > > > > > > > > > > > > > > > > > > > > > > KIP. It seems like if we can't > > fetch > > > > the > > > > > > > > > metadata > > > > > > > > > > > > > within the > > > > > > > > > > > > > > > > > > > > designated > > > > > > > > > > > > > > > > > > > > > > metadata timeout, the future / > > > callback > > > > > > > should > > > > > > > > > > > receive a > > > > > > > > > > > > > > > > > > > > TimeoutException > > > > > > > > > > > > > > > > > > > > > > right? We do not want the send > > call > > > to > > > > > be > > > > > > > > > deferred > > > > > > > > > > > > > forever > > > > > > > > > > > > > > > if > > > > > > > > > > > > > > > > > > > metadata > > > > > > > > > > > > > > > > > > > > > > can't be fetched. Eventually it > > > should > > > > > > fail > > > > > > > if > > > > > > > > > it > > > > > > > > > > > can't > > > > > > > > > > > > > be > > > > > > > > > > > > > > > > > > > performed. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > I do think this is something that > > > will > > > > > have > > > > > > > to > > > > > > > > be > > > > > > > > > > > > > mentioned > > > > > > > > > > > > > > > in > > > > > > > > > > > > > > > > > the > > > > > > > > > > > > > > > > > > > > > > compatibility section. There is > > some > > > > > code > > > > > > > out > > > > > > > > > there > > > > > > > > > > > > > that is > > > > > > > > > > > > > > > > > > probably > > > > > > > > > > > > > > > > > > > > > > prepared to handle a timeout > > > exception > > > > > from > > > > > > > the > > > > > > > > > send > > > > > > > > > > > > > > > function, > > > > > > > > > > > > > > > > > > which > > > > > > > > > > > > > > > > > > > > may > > > > > > > > > > > > > > > > > > > > > > need to be updated to check for a > > > > timeout > > > > > > > from > > > > > > > > > the > > > > > > > > > > > > > future or > > > > > > > > > > > > > > > > > > > callback. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Correct, a timeout exception would > be > > > > > > delivered > > > > > > > > in > > > > > > > > > the > > > > > > > > > > > > > future. > > > > > > > > > > > > > > > > > Sure, > > > > > > > > > > > > > > > > > > I > > > > > > > > > > > > > > > > > > > > can > > > > > > > > > > > > > > > > > > > > > add that note to the KIP. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Thanks. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > best, > > > > > > > > > > > > > > > > > > > > Colin > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > It seems like this is an existing > > > > > problem. > > > > > > > You > > > > > > > > > may > > > > > > > > > > > fire > > > > > > > > > > > > > off > > > > > > > > > > > > > > > a > > > > > > > > > > > > > > > > > lot > > > > > > > > > > > > > > > > > > of > > > > > > > > > > > > > > > > > > > > send > > > > > > > > > > > > > > > > > > > > > > calls that get blocked because > the > > > > broker > > > > > > > that > > > > > > > > > is the > > > > > > > > > > > > > leader > > > > > > > > > > > > > > > > for > > > > > > > > > > > > > > > > > a > > > > > > > > > > > > > > > > > > > > certain > > > > > > > > > > > > > > > > > > > > > > partition is not responding. I'm > > not > > > > > sure > > > > > > > that > > > > > > > > > we > > > > > > > > > > > need > > > > > > > > > > > > > to do > > > > > > > > > > > > > > > > > > > anything > > > > > > > > > > > > > > > > > > > > > > special here. On the other hand, > > we > > > > > could > > > > > > > make > > > > > > > > > the > > > > > > > > > > > case > > > > > > > > > > > > > for > > > > > > > > > > > > > > > a > > > > > > > > > > > > > > > > > > > generic > > > > > > > > > > > > > > > > > > > > "max > > > > > > > > > > > > > > > > > > > > > > number of outstanding sends" > > > > > configuration > > > > > > to > > > > > > > > > prevent > > > > > > > > > > > > > > > surprise > > > > > > > > > > > > > > > > > OOMs > > > > > > > > > > > > > > > > > > > in > > > > > > > > > > > > > > > > > > > > the > > > > > > > > > > > > > > > > > > > > > > existing cases, plus the new one > > > we're > > > > > > > adding. > > > > > > > > > But > > > > > > > > > > > this > > > > > > > > > > > > > > > feels > > > > > > > > > > > > > > > > > > like a > > > > > > > > > > > > > > > > > > > > bit > > > > > > > > > > > > > > > > > > > > > > of a scope expansion. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Right, this is an existing problem, > > > > however > > > > > > the > > > > > > > > > > > > > asynchronous > > > > > > > > > > > > > > > send > > > > > > > > > > > > > > > > > > could > > > > > > > > > > > > > > > > > > > > > cause unexpected behavior. For > > example, > > > > if > > > > > a > > > > > > > > client > > > > > > > > > > > pinned > > > > > > > > > > > > > > > > > > > > > topics/partitions to individual > send > > > > > threads, > > > > > > > > then > > > > > > > > > > > memory > > > > > > > > > > > > > > > > couldn't > > > > > > > > > > > > > > > > > be > > > > > > > > > > > > > > > > > > > > > exhausted by a single topic since a > > > > > blocking > > > > > > > send > > > > > > > > > would > > > > > > > > > > > > > prevent > > > > > > > > > > > > > > > > > > further > > > > > > > > > > > > > > > > > > > > > records from being buffered on that > > > > topic. > > > > > > The > > > > > > > > > > > compromise > > > > > > > > > > > > > could > > > > > > > > > > > > > > > > be > > > > > > > > > > > > > > > > > > that > > > > > > > > > > > > > > > > > > > > we > > > > > > > > > > > > > > > > > > > > > only ever permit one outstanding > > record > > > > > batch > > > > > > > > for a > > > > > > > > > > > topic, > > > > > > > > > > > > > > > which > > > > > > > > > > > > > > > > > > would > > > > > > > > > > > > > > > > > > > > keep > > > > > > > > > > > > > > > > > > > > > the code simple and wouldn't > permit a > > > > > single > > > > > > > > topic > > > > > > > > > to > > > > > > > > > > > > > consume > > > > > > > > > > > > > > > all > > > > > > > > > > > > > > > > > > > > available > > > > > > > > > > > > > > > > > > > > > memory. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > They may be connected, but I'm > not > > > sure > > > > > > they > > > > > > > > > should > > > > > > > > > > > be > > > > > > > > > > > > > the > > > > > > > > > > > > > > > > same. > > > > > > > > > > > > > > > > > > > > Perhaps > > > > > > > > > > > > > > > > > > > > > > expiry should be 4x the typical > > fetch > > > > > rate, > > > > > > > for > > > > > > > > > > > example. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > That's true. You could also make > the > > > case > > > > > > for a > > > > > > > > > faster > > > > > > > > > > > > > expiry > > > > > > > > > > > > > > > > than > > > > > > > > > > > > > > > > > > > > refresh > > > > > > > > > > > > > > > > > > > > > as well. Makes sense to separate > this > > > > out. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Hmm.... are you sure this is an > N^2 > > > > > > problem? > > > > > > > > If > > > > > > > > > you > > > > > > > > > > > > > have T1 > > > > > > > > > > > > > > > > and > > > > > > > > > > > > > > > > > > T2, > > > > > > > > > > > > > > > > > > > > you > > > > > > > > > > > > > > > > > > > > > > fetch metadata for T1 and T2, > > > right? I > > > > > > guess > > > > > > > > you > > > > > > > > > > > could > > > > > > > > > > > > > argue > > > > > > > > > > > > > > > > > that > > > > > > > > > > > > > > > > > > we > > > > > > > > > > > > > > > > > > > > often > > > > > > > > > > > > > > > > > > > > > > fetch metadata for partitions we > > > don't > > > > > care > > > > > > > > > about, > > > > > > > > > > > but > > > > > > > > > > > > > that > > > > > > > > > > > > > > > > > doesn't > > > > > > > > > > > > > > > > > > > > make it > > > > > > > > > > > > > > > > > > > > > > O(N^2). Maybe there's something > > > about > > > > > the > > > > > > > > > > > implementation > > > > > > > > > > > > > > > that > > > > > > > > > > > > > > > > > I'm > > > > > > > > > > > > > > > > > > > > missing. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > My apologies, I left out the > context. > > > One > > > > > > issue > > > > > > > > the > > > > > > > > > > > KIP is > > > > > > > > > > > > > > > trying > > > > > > > > > > > > > > > > > to > > > > > > > > > > > > > > > > > > > > > resolve is the metadata storm > that's > > > > caused > > > > > > by > > > > > > > > > > > producers > > > > > > > > > > > > > > > starting > > > > > > > > > > > > > > > > > up. > > > > > > > > > > > > > > > > > > > In > > > > > > > > > > > > > > > > > > > > > the worst case, where the send call > > is > > > > only > > > > > > > > > performed > > > > > > > > > > > from > > > > > > > > > > > > > a > > > > > > > > > > > > > > > > single > > > > > > > > > > > > > > > > > > > > thread > > > > > > > > > > > > > > > > > > > > > (i.e. no possible batching), > fetching > > > > > > metadata > > > > > > > > for > > > > > > > > > 1K > > > > > > > > > > > > > topics > > > > > > > > > > > > > > > will > > > > > > > > > > > > > > > > > > > > generate > > > > > > > > > > > > > > > > > > > > > 1K RPCs, with payload 1+2+...+1K > > > topics' > > > > > > > > metadata. > > > > > > > > > > > Being > > > > > > > > > > > > > smart > > > > > > > > > > > > > > > > > about > > > > > > > > > > > > > > > > > > > the > > > > > > > > > > > > > > > > > > > > > topics being refreshed would still > > > > generate > > > > > > 1K > > > > > > > > RPCs > > > > > > > > > > > for 1 > > > > > > > > > > > > > topic > > > > > > > > > > > > > > > > > each, > > > > > > > > > > > > > > > > > > > and > > > > > > > > > > > > > > > > > > > > > asynchronous behavior would permit > > > > batching > > > > > > > (note > > > > > > > > > > > > > steady-state > > > > > > > > > > > > > > > > > > > refreshing > > > > > > > > > > > > > > > > > > > > > doesn't require the asynchronous > > > behavior > > > > > to > > > > > > > > > batch). > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > In general, we need to take > > advantage > > > > of > > > > > > > > > batching to > > > > > > > > > > > do > > > > > > > > > > > > > this > > > > > > > > > > > > > > > > well > > > > > > > > > > > > > > > > > > > (one > > > > > > > > > > > > > > > > > > > > > > reason why I think we should > steer > > > > clear > > > > > of > > > > > > > > > > > > > ultra-granular > > > > > > > > > > > > > > > > > timeout > > > > > > > > > > > > > > > > > > > > > > tracking). It's best to do 1 RPC > > > > asking > > > > > > for > > > > > > > 10 > > > > > > > > > > > topics, > > > > > > > > > > > > > not > > > > > > > > > > > > > > > 10 > > > > > > > > > > > > > > > > > RPCs > > > > > > > > > > > > > > > > > > > > asking > > > > > > > > > > > > > > > > > > > > > > for a single topic each, even if > > that > > > > > means > > > > > > > > some > > > > > > > > > of > > > > > > > > > > > the > > > > > > > > > > > > > topic > > > > > > > > > > > > > > > > > > > timeouts > > > > > > > > > > > > > > > > > > > > are > > > > > > > > > > > > > > > > > > > > > > not *exactly* aligned with the > > > > configured > > > > > > > > value. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Absolutely, agreed. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Thanks, > > > > > > > > > > > > > > > > > > > > > Brian > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > best, > > > > > > > > > > > > > > > > > > > > > > Colin > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Thanks, > > > > > > > > > > > > > > > > > > > > > > > Brian > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > On Mon, Nov 11, 2019 at 11:47 > AM > > > > Colin > > > > > > > > McCabe < > > > > > > > > > > > > > > > > > > cmcc...@apache.org> > > > > > > > > > > > > > > > > > > > > > > wrote: > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Hi Brian, > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Thanks for the KIP. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Starting the metadata fetch > > > before > > > > we > > > > > > > need > > > > > > > > > the > > > > > > > > > > > > > result is > > > > > > > > > > > > > > > > > > > > definitely a > > > > > > > > > > > > > > > > > > > > > > > > great idea. This way, the > > > metadata > > > > > > fetch > > > > > > > > > can be > > > > > > > > > > > > > done in > > > > > > > > > > > > > > > > > > parallel > > > > > > > > > > > > > > > > > > > > with > > > > > > > > > > > > > > > > > > > > > > all > > > > > > > > > > > > > > > > > > > > > > > > the other stuff e producer is > > > > doing, > > > > > > > rather > > > > > > > > > than > > > > > > > > > > > > > forcing > > > > > > > > > > > > > > > > the > > > > > > > > > > > > > > > > > > > > producer > > > > > > > > > > > > > > > > > > > > > > to > > > > > > > > > > > > > > > > > > > > > > > > periodically come to a halt > > > > > > periodically > > > > > > > > > while > > > > > > > > > > > > > metadata > > > > > > > > > > > > > > > is > > > > > > > > > > > > > > > > > > > fetched. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Maybe I missed it, but there > > > seemed > > > > > to > > > > > > be > > > > > > > > > some > > > > > > > > > > > > > details > > > > > > > > > > > > > > > > > missing > > > > > > > > > > > > > > > > > > > > here. > > > > > > > > > > > > > > > > > > > > > > When > > > > > > > > > > > > > > > > > > > > > > > > do we start the metadata > fetch? > > > > For > > > > > > > > > example, if > > > > > > > > > > > > > topic > > > > > > > > > > > > > > > > > metadata > > > > > > > > > > > > > > > > > > > > expires > > > > > > > > > > > > > > > > > > > > > > > > every 5 minutes, perhaps we > > > should > > > > > > wait 4 > > > > > > > > > > > minutes, > > > > > > > > > > > > > then > > > > > > > > > > > > > > > > > > starting > > > > > > > > > > > > > > > > > > > > > > fetching > > > > > > > > > > > > > > > > > > > > > > > > the new metadata, which we > > would > > > > > expect > > > > > > > to > > > > > > > > > > > arrive by > > > > > > > > > > > > > the > > > > > > > > > > > > > > > 5 > > > > > > > > > > > > > > > > > > minute > > > > > > > > > > > > > > > > > > > > > > > > deadline. Or perhaps we > should > > > > start > > > > > > the > > > > > > > > > fetch > > > > > > > > > > > even > > > > > > > > > > > > > > > > earlier, > > > > > > > > > > > > > > > > > > > > around > > > > > > > > > > > > > > > > > > > > > > the > > > > > > > > > > > > > > > > > > > > > > > > 2.5 minute mark. In any > case, > > > > there > > > > > > > should > > > > > > > > > be > > > > > > > > > > > some > > > > > > > > > > > > > > > > > discussion > > > > > > > > > > > > > > > > > > > > about > > > > > > > > > > > > > > > > > > > > > > what > > > > > > > > > > > > > > > > > > > > > > > > the actual policy is. Given > > that > > > > > > > > > > > > > metadata.max.age.ms is > > > > > > > > > > > > > > > > > > > > configurable, > > > > > > > > > > > > > > > > > > > > > > > > maybe that policy needs to > be > > > > > > expressed > > > > > > > in > > > > > > > > > > > terms of > > > > > > > > > > > > > a > > > > > > > > > > > > > > > > > > percentage > > > > > > > > > > > > > > > > > > > > of > > > > > > > > > > > > > > > > > > > > > > the > > > > > > > > > > > > > > > > > > > > > > > > refresh period rather than in > > > terms > > > > > of > > > > > > an > > > > > > > > > > > absolute > > > > > > > > > > > > > delay. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > The KIP correctly points out > > that > > > > the > > > > > > > > current > > > > > > > > > > > > > metadata > > > > > > > > > > > > > > > > > fetching > > > > > > > > > > > > > > > > > > > > policy > > > > > > > > > > > > > > > > > > > > > > > > causes us to "[block] in a > > > function > > > > > > > that's > > > > > > > > > > > > > advertised as > > > > > > > > > > > > > > > > > > > > asynchronous." > > > > > > > > > > > > > > > > > > > > > > > > However, the KIP doesn't seem > > to > > > > > spell > > > > > > > out > > > > > > > > > > > whether we > > > > > > > > > > > > > > > will > > > > > > > > > > > > > > > > > > > > continue to > > > > > > > > > > > > > > > > > > > > > > > > block if metadata can't be > > found, > > > > or > > > > > if > > > > > > > > this > > > > > > > > > > > will be > > > > > > > > > > > > > > > > > abolished. > > > > > > > > > > > > > > > > > > > > > > Clearly, > > > > > > > > > > > > > > > > > > > > > > > > starting the metadata fetch > > early > > > > > will > > > > > > > > reduce > > > > > > > > > > > > > blocking in > > > > > > > > > > > > > > > > the > > > > > > > > > > > > > > > > > > > > common > > > > > > > > > > > > > > > > > > > > > > case, > > > > > > > > > > > > > > > > > > > > > > > > but will there still be > > blocking > > > in > > > > > the > > > > > > > > > uncommon > > > > > > > > > > > case > > > > > > > > > > > > > > > where > > > > > > > > > > > > > > > > > the > > > > > > > > > > > > > > > > > > > > early > > > > > > > > > > > > > > > > > > > > > > fetch > > > > > > > > > > > > > > > > > > > > > > > > doesn't succeed in time? > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > To address (2), the > producer > > > > > > currently > > > > > > > > > > > maintains > > > > > > > > > > > > > an > > > > > > > > > > > > > > > > expiry > > > > > > > > > > > > > > > > > > > > threshold > > > > > > > > > > > > > > > > > > > > > > > > for > > > > > > > > > > > > > > > > > > > > > > > > > every topic, which is used > > to > > > > > > remove a > > > > > > > > > topic > > > > > > > > > > > from > > > > > > > > > > > > > the > > > > > > > > > > > > > > > > > > working > > > > > > > > > > > > > > > > > > > > set > > > > > > > > > > > > > > > > > > > > > > at a > > > > > > > > > > > > > > > > > > > > > > > > > future time (currently > > > > hard-coded > > > > > > to 5 > > > > > > > > > > > minutes, > > > > > > > > > > > > > this > > > > > > > > > > > > > > > > > should > > > > > > > > > > > > > > > > > > be > > > > > > > > > > > > > > > > > > > > > > modified > > > > > > > > > > > > > > > > > > > > > > > > to > > > > > > > > > > > > > > > > > > > > > > > > > use metadata.max.age.ms). > > > While > > > > > > this > > > > > > > > does > > > > > > > > > > > work to > > > > > > > > > > > > > > > > reduce > > > > > > > > > > > > > > > > > > the > > > > > > > > > > > > > > > > > > > > size > > > > > > > > > > > > > > > > > > > > > > of > > > > > > > > > > > > > > > > > > > > > > > > the > > > > > > > > > > > > > > > > > > > > > > > > > topic working set, the > > > producer > > > > > will > > > > > > > > > continue > > > > > > > > > > > > > fetching > > > > > > > > > > > > > > > > > > > metadata > > > > > > > > > > > > > > > > > > > > for > > > > > > > > > > > > > > > > > > > > > > > > these > > > > > > > > > > > > > > > > > > > > > > > > > topics in every metadata > > > request > > > > > for > > > > > > > the > > > > > > > > > full > > > > > > > > > > > > > expiry > > > > > > > > > > > > > > > > > > duration. > > > > > > > > > > > > > > > > > > > > This > > > > > > > > > > > > > > > > > > > > > > > > logic > > > > > > > > > > > > > > > > > > > > > > > > > can be made more > intelligent > > > by > > > > > > > managing > > > > > > > > > the > > > > > > > > > > > > > expiry > > > > > > > > > > > > > > > from > > > > > > > > > > > > > > > > > > when > > > > > > > > > > > > > > > > > > > > the > > > > > > > > > > > > > > > > > > > > > > topic > > > > > > > > > > > > > > > > > > > > > > > > > was last used, enabling > the > > > > expiry > > > > > > > > > duration > > > > > > > > > > > to be > > > > > > > > > > > > > > > > reduced > > > > > > > > > > > > > > > > > to > > > > > > > > > > > > > > > > > > > > improve > > > > > > > > > > > > > > > > > > > > > > > > cases > > > > > > > > > > > > > > > > > > > > > > > > > where a large number of > > topics > > > > are > > > > > > > > touched > > > > > > > > > > > > > > > > intermittently. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Can you clarify this part a > > bit? > > > > It > > > > > > > seems > > > > > > > > > like > > > > > > > > > > > we > > > > > > > > > > > > > have a > > > > > > > > > > > > > > > > > > > metadata > > > > > > > > > > > > > > > > > > > > > > > > expiration policy now for > > topics, > > > > and > > > > > > we > > > > > > > > will > > > > > > > > > > > have > > > > > > > > > > > > > one > > > > > > > > > > > > > > > > after > > > > > > > > > > > > > > > > > > this > > > > > > > > > > > > > > > > > > > > KIP, > > > > > > > > > > > > > > > > > > > > > > but > > > > > > > > > > > > > > > > > > > > > > > > they will be somewhat > > different. > > > > But > > > > > > > it's > > > > > > > > > not > > > > > > > > > > > clear > > > > > > > > > > > > > to > > > > > > > > > > > > > > > me > > > > > > > > > > > > > > > > > what > > > > > > > > > > > > > > > > > > > the > > > > > > > > > > > > > > > > > > > > > > > > differences are. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > In general, if load is a > > problem, > > > > we > > > > > > > should > > > > > > > > > > > probably > > > > > > > > > > > > > > > > consider > > > > > > > > > > > > > > > > > > > > adding > > > > > > > > > > > > > > > > > > > > > > some > > > > > > > > > > > > > > > > > > > > > > > > kind of jitter on the client > > > side. > > > > > > There > > > > > > > > are > > > > > > > > > > > > > definitely > > > > > > > > > > > > > > > > > cases > > > > > > > > > > > > > > > > > > > > where > > > > > > > > > > > > > > > > > > > > > > people > > > > > > > > > > > > > > > > > > > > > > > > start up a lot of clients at > > the > > > > same > > > > > > > time > > > > > > > > in > > > > > > > > > > > > > parallel > > > > > > > > > > > > > > > and > > > > > > > > > > > > > > > > > > there > > > > > > > > > > > > > > > > > > > > is a > > > > > > > > > > > > > > > > > > > > > > > > thundering herd problem with > > > > metadata > > > > > > > > > updates. > > > > > > > > > > > > > Adding > > > > > > > > > > > > > > > > jitter > > > > > > > > > > > > > > > > > > > would > > > > > > > > > > > > > > > > > > > > > > spread > > > > > > > > > > > > > > > > > > > > > > > > the load across time rather > > than > > > > > > > creating a > > > > > > > > > spike > > > > > > > > > > > > > every 5 > > > > > > > > > > > > > > > > > > minutes > > > > > > > > > > > > > > > > > > > > in > > > > > > > > > > > > > > > > > > > > > > this > > > > > > > > > > > > > > > > > > > > > > > > case. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > best, > > > > > > > > > > > > > > > > > > > > > > > > Colin > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > On Fri, Nov 8, 2019, at > 08:59, > > > > Ismael > > > > > > > Juma > > > > > > > > > wrote: > > > > > > > > > > > > > > > > > > > > > > > > > I think this KIP affects > when > > > we > > > > > > block > > > > > > > > > which is > > > > > > > > > > > > > > > actually > > > > > > > > > > > > > > > > > user > > > > > > > > > > > > > > > > > > > > visible > > > > > > > > > > > > > > > > > > > > > > > > > behavior. Right? > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Ismael > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > On Fri, Nov 8, 2019, 8:50 > AM > > > > Brian > > > > > > > Byrne > > > > > > > > < > > > > > > > > > > > > > > > > > > bby...@confluent.io> > > > > > > > > > > > > > > > > > > > > > > wrote: > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Hi Guozhang, > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Regarding metadata > expiry, > > no > > > > > > access > > > > > > > > > times > > > > > > > > > > > other > > > > > > > > > > > > > than > > > > > > > > > > > > > > > > the > > > > > > > > > > > > > > > > > > > > initial > > > > > > > > > > > > > > > > > > > > > > > > lookup[1] > > > > > > > > > > > > > > > > > > > > > > > > > > are used for determining > > when > > > > to > > > > > > > expire > > > > > > > > > > > producer > > > > > > > > > > > > > > > > > metadata. > > > > > > > > > > > > > > > > > > > > > > Therefore, > > > > > > > > > > > > > > > > > > > > > > > > > > frequently used topics' > > > > metadata > > > > > > will > > > > > > > > be > > > > > > > > > aged > > > > > > > > > > > > > out and > > > > > > > > > > > > > > > > > > > > subsequently > > > > > > > > > > > > > > > > > > > > > > > > > > refreshed (in a blocking > > > > manner) > > > > > > > every > > > > > > > > > five > > > > > > > > > > > > > minutes, > > > > > > > > > > > > > > > > and > > > > > > > > > > > > > > > > > > > > > > infrequently > > > > > > > > > > > > > > > > > > > > > > > > used > > > > > > > > > > > > > > > > > > > > > > > > > > topics will be retained > > for a > > > > > > minimum > > > > > > > > of > > > > > > > > > five > > > > > > > > > > > > > minutes > > > > > > > > > > > > > > > > and > > > > > > > > > > > > > > > > > > > > currently > > > > > > > > > > > > > > > > > > > > > > > > > > refetched on every > metadata > > > > > update > > > > > > > > during > > > > > > > > > > > that > > > > > > > > > > > > > time > > > > > > > > > > > > > > > > > period. > > > > > > > > > > > > > > > > > > > The > > > > > > > > > > > > > > > > > > > > > > > > sentence is > > > > > > > > > > > > > > > > > > > > > > > > > > suggesting that we could > > > reduce > > > > > the > > > > > > > > > expiry > > > > > > > > > > > time > > > > > > > > > > > > > to > > > > > > > > > > > > > > > > > improve > > > > > > > > > > > > > > > > > > > the > > > > > > > > > > > > > > > > > > > > > > latter > > > > > > > > > > > > > > > > > > > > > > > > > > without affecting (rather > > > > > slightly > > > > > > > > > > > improving) the > > > > > > > > > > > > > > > > former. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Keep in mind that in most > > all > > > > > > cases, > > > > > > > I > > > > > > > > > > > wouldn't > > > > > > > > > > > > > > > > > anticipate > > > > > > > > > > > > > > > > > > > > much of > > > > > > > > > > > > > > > > > > > > > > a > > > > > > > > > > > > > > > > > > > > > > > > > > difference with producer > > > > > behavior, > > > > > > > and > > > > > > > > > the > > > > > > > > > > > extra > > > > > > > > > > > > > > > logic > > > > > > > > > > > > > > > > > can > > > > > > > > > > > > > > > > > > be > > > > > > > > > > > > > > > > > > > > > > > > implemented > > > > > > > > > > > > > > > > > > > > > > > > > > to have insignificant > cost. > > > > It's > > > > > > the > > > > > > > > > > > > > large/dynamic > > > > > > > > > > > > > > > > topic > > > > > > > > > > > > > > > > > > > corner > > > > > > > > > > > > > > > > > > > > > > cases > > > > > > > > > > > > > > > > > > > > > > > > that > > > > > > > > > > > > > > > > > > > > > > > > > > we're trying to improve. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > It'd be convenient if the > > KIP > > > > is > > > > > no > > > > > > > > > longer > > > > > > > > > > > > > necessary. > > > > > > > > > > > > > > > > > > You're > > > > > > > > > > > > > > > > > > > > right > > > > > > > > > > > > > > > > > > > > > > in > > > > > > > > > > > > > > > > > > > > > > > > that > > > > > > > > > > > > > > > > > > > > > > > > > > there's no public API > > changes > > > > and > > > > > > the > > > > > > > > > > > behavioral > > > > > > > > > > > > > > > > changes > > > > > > > > > > > > > > > > > > are > > > > > > > > > > > > > > > > > > > > > > entirely > > > > > > > > > > > > > > > > > > > > > > > > > > internal. I'd be happy to > > > > > continue > > > > > > > the > > > > > > > > > > > discussion > > > > > > > > > > > > > > > > around > > > > > > > > > > > > > > > > > > the > > > > > > > > > > > > > > > > > > > > KIP, > > > > > > > > > > > > > > > > > > > > > > but > > > > > > > > > > > > > > > > > > > > > > > > > > unless otherwise > objected, > > it > > > > can > > > > > > be > > > > > > > > > retired. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > [1] Not entirely > accurate, > > > it's > > > > > > > > actually > > > > > > > > > the > > > > > > > > > > > > > first > > > > > > > > > > > > > > > time > > > > > > > > > > > > > > > > > > when > > > > > > > > > > > > > > > > > > > > the > > > > > > > > > > > > > > > > > > > > > > client > > > > > > > > > > > > > > > > > > > > > > > > > > calculates whether to > > retain > > > > the > > > > > > > topic > > > > > > > > > in its > > > > > > > > > > > > > > > metadata. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Thanks, > > > > > > > > > > > > > > > > > > > > > > > > > > Brian > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > On Thu, Nov 7, 2019 at > 4:48 > > > PM > > > > > > > Guozhang > > > > > > > > > Wang > > > > > > > > > > > < > > > > > > > > > > > > > > > > > > > > wangg...@gmail.com> > > > > > > > > > > > > > > > > > > > > > > > > wrote: > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Hello Brian, > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Could you elaborate a > bit > > > > more > > > > > on > > > > > > > > this > > > > > > > > > > > > > sentence: > > > > > > > > > > > > > > > > "This > > > > > > > > > > > > > > > > > > > logic > > > > > > > > > > > > > > > > > > > > can > > > > > > > > > > > > > > > > > > > > > > be > > > > > > > > > > > > > > > > > > > > > > > > made > > > > > > > > > > > > > > > > > > > > > > > > > > > more intelligent by > > > managing > > > > > the > > > > > > > > expiry > > > > > > > > > > > from > > > > > > > > > > > > > when > > > > > > > > > > > > > > > the > > > > > > > > > > > > > > > > > > topic > > > > > > > > > > > > > > > > > > > > was > > > > > > > > > > > > > > > > > > > > > > last > > > > > > > > > > > > > > > > > > > > > > > > > > used, > > > > > > > > > > > > > > > > > > > > > > > > > > > enabling the expiry > > > duration > > > > to > > > > > > be > > > > > > > > > reduced > > > > > > > > > > > to > > > > > > > > > > > > > > > improve > > > > > > > > > > > > > > > > > > cases > > > > > > > > > > > > > > > > > > > > > > where a > > > > > > > > > > > > > > > > > > > > > > > > large > > > > > > > > > > > > > > > > > > > > > > > > > > > number of topics are > > > touched > > > > > > > > > > > intermittently." > > > > > > > > > > > > > Not > > > > > > > > > > > > > > > > sure > > > > > > > > > > > > > > > > > I > > > > > > > > > > > > > > > > > > > > fully > > > > > > > > > > > > > > > > > > > > > > > > understand > > > > > > > > > > > > > > > > > > > > > > > > > > > the proposal. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Also since now this KIP > > did > > > > not > > > > > > > make > > > > > > > > > any > > > > > > > > > > > > > public API > > > > > > > > > > > > > > > > > > changes > > > > > > > > > > > > > > > > > > > > and > > > > > > > > > > > > > > > > > > > > > > the > > > > > > > > > > > > > > > > > > > > > > > > > > > behavioral changes are > > not > > > > > > > > considered a > > > > > > > > > > > public > > > > > > > > > > > > > API > > > > > > > > > > > > > > > > > > contract > > > > > > > > > > > > > > > > > > > > (i.e. > > > > > > > > > > > > > > > > > > > > > > > > how we > > > > > > > > > > > > > > > > > > > > > > > > > > > maintain the topic > > metadata > > > > in > > > > > > > > producer > > > > > > > > > > > cache > > > > > > > > > > > > > is > > > > > > > > > > > > > > > > never > > > > > > > > > > > > > > > > > > > > committed > > > > > > > > > > > > > > > > > > > > > > to > > > > > > > > > > > > > > > > > > > > > > > > > > users), > > > > > > > > > > > > > > > > > > > > > > > > > > > I wonder if we still > > need a > > > > KIP > > > > > > for > > > > > > > > the > > > > > > > > > > > > > proposed > > > > > > > > > > > > > > > > change > > > > > > > > > > > > > > > > > > any > > > > > > > > > > > > > > > > > > > > more? > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Guozhang > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > On Thu, Nov 7, 2019 at > > > 12:43 > > > > PM > > > > > > > Brian > > > > > > > > > > > Byrne < > > > > > > > > > > > > > > > > > > > > bby...@confluent.io > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > wrote: > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Hello all, > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > I'd like to propose a > > > vote > > > > > for > > > > > > a > > > > > > > > > producer > > > > > > > > > > > > > change > > > > > > > > > > > > > > > to > > > > > > > > > > > > > > > > > > > improve > > > > > > > > > > > > > > > > > > > > > > > > producer > > > > > > > > > > > > > > > > > > > > > > > > > > > > behavior when dealing > > > with > > > > a > > > > > > > large > > > > > > > > > > > number of > > > > > > > > > > > > > > > > topics, > > > > > > > > > > > > > > > > > in > > > > > > > > > > > > > > > > > > > > part by > > > > > > > > > > > > > > > > > > > > > > > > > > reducing > > > > > > > > > > > > > > > > > > > > > > > > > > > > the amount of > metadata > > > > > fetching > > > > > > > > > > > performed. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > The full KIP is > > provided > > > > > here: > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > https://cwiki.apache.org/confluence/display/KAFKA/KIP-526%3A+Reduce+Producer+Metadata+Lookups+for+Large+Number+of+Topics > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > And the discussion > > > thread: > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > https://lists.apache.org/thread.html/b2f8f830ef04587144cf0840c7d4811bbf0a14f3c459723dbc5acf9e@%3Cdev.kafka.apache.org%3E > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Thanks, > > > > > > > > > > > > > > > > > > > > > > > > > > > > Brian > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > > > > > > > > > > > > > > > > > > > > > > -- Guozhang > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > > > > > > > > > > > -- Guozhang > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > > -- Guozhang > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > -- Guozhang > > > > > > > > > > > > > > > > > > -- > > > -- Guozhang > > > > > >