Adding to Crispin. The new micro-benchmark shows improvements of 88% in p99.9 with the KIP changes Vs baseline(& rejected alternate). Its hypothesised improvements are seen as KIP avoids a full metadata refresh(Vs baseline/rejected alternate), and the full metadata refresh can be slow due to metadata reconvergence delay at the server(post leadership-change of partitions). Extending this logic, KIP changes would be beneficial in scenarios where full metadata refresh can be slow. Potential example would be, metadata RPC is slowed due to head-of-line blocking by another slow RPC in front, say a produce RPC(possibly slow due to churn in the ISR).
This new benchmark is in addition to the previously done benchmark of roll simulation, where improvements upto 5% were seen. Please do a review, as the voting thread is live. Thanks! On Wed, Sep 20, 2023 at 4:43 PM Crispin Bernier <cbern...@confluent.io.invalid> wrote: > Hi, > > I've updated the KIP with benchmark results focusing more directly on the > redirect case, please review and +1 in the voting thread if convinced. > > Thank you, > Crispin > > On Wed, Jul 26, 2023 at 11:13 PM Luke Chen <show...@gmail.com> wrote: > > > Thanks for adding the benchmark results, Crispin! > > IMO, 2~5% performance improvement is small, but given the change is > small, > > cost is also small (only append endpoint info when > NOT_LEADER_OR_FOLLOWER.. > > etc error), I think it is worth doing it. > > > > Thank you. > > Luke > > > > On Wed, Jul 26, 2023 at 12:33 AM Ismael Juma <m...@ismaeljuma.com> wrote: > > > > > Thanks Crispin! > > > > > > Ismael > > > > > > On Mon, Jul 24, 2023 at 8:16 PM Crispin Bernier > > > <cbern...@confluent.io.invalid> wrote: > > > > > > > I updated the wiki to include both results along with their average. > > > > > > > > Thank you, > > > > Crispin > > > > > > > > On Mon, Jul 24, 2023 at 10:58 AM Ismael Juma <m...@ismaeljuma.com> > > wrote: > > > > > > > > > Hi Crispin, > > > > > > > > > > One additional question, the wiki says "The results are averaged > > over 2 > > > > > runs.". Can you please provide some measure of variance in the > > > > > distribution, i.e. were both results similar to each other for both > > > > cases? > > > > > > > > > > Ismael > > > > > > > > > > On Fri, Jul 21, 2023 at 11:31 AM Ismael Juma <m...@ismaeljuma.com> > > > wrote: > > > > > > > > > > > Thanks for the update Crispin - very helpful to have actual > > > performance > > > > > > data. 2-5% for the default configuration is a bit on the low side > > for > > > > > this > > > > > > kind of proposal. > > > > > > > > > > > > Ismael > > > > > > > > > > > > On Thu, Jul 20, 2023 at 11:33 PM Crispin Bernier > > > > > > <cbern...@confluent.io.invalid> wrote: > > > > > > > > > > > >> Benchmark numbers have been posted on the KIP, please review. > > > > > >> > > > > > >> On 2023/07/20 13:03:00 Mayank Shekhar Narula wrote: > > > > > >> > Jun > > > > > >> > > > > > > >> > Thanks for the feedback. > > > > > >> > > > > > > >> > Numbers to follow. > > > > > >> > > > > > > >> > If we don't plan to > > > > > >> > > bump up the FetchResponse version, we could just remove the > > > > > reference > > > > > >> to > > > > > >> > > version 16. > > > > > >> > > > > > > >> > Fixed. > > > > > >> > > > > > > >> > On Thu, Jul 20, 2023 at 1:28 AM Jun Rao > > > <ju...@confluent.io.invalid > > > > > > > > > > >> wrote: > > > > > >> > > > > > > >> > > Hi, Mayank, > > > > > >> > > > > > > > >> > > Thanks for the KIP. I agree with others that it would be > > useful > > > to > > > > > >> see the > > > > > >> > > performance results. Otherwise, just a minor comment. If we > > > don't > > > > > >> plan to > > > > > >> > > bump up the FetchResponse version, we could just remove the > > > > > reference > > > > > >> to > > > > > >> > > version 16. > > > > > >> > > > > > > > >> > > Jun > > > > > >> > > > > > > > >> > > On Wed, Jul 19, 2023 at 2:31 PM Mayank Shekhar Narula < > > > > > >> > > mayanks.nar...@gmail.com> wrote: > > > > > >> > > > > > > > >> > > > Luke > > > > > >> > > > > > > > > >> > > > Thanks for the interest in the KIP. > > > > > >> > > > > > > > > >> > > > But what if the consumer was fetching from the follower? > > > > > >> > > > > > > > > >> > > > We already include `PreferredReadReplica` in the fetch > > > response. > > > > > >> > > > > Should we put the node info of PreferredReadReplica > under > > > this > > > > > >> case, > > > > > >> > > > > instead of the leader's info? > > > > > >> > > > > > > > > > >> > > > > > > > > >> > > > PreferredReadReplica is the decided on the leader. Looking > > at > > > > the > > > > > >> Java > > > > > >> > > > client code, AbstractFetch::selectReadReplica, first fetch > > > > request > > > > > >> goes > > > > > >> > > to > > > > > >> > > > Leader of the partition -> Sends back PreferredReadReplica > > -> > > > > Next > > > > > >> fetch > > > > > >> > > > uses PreferredReadReplica. So as long as leader is > > available, > > > > > >> > > > PreferredReadReplica would be found in subsequent fetches. > > > > > >> > > > > > > > > >> > > > Also, under this case, should we include the leader's info > > in > > > > the > > > > > >> > > response? > > > > > >> > > > > > > > > >> > > > > > > > > >> > > > In this case, I think the follower would fail the fetch if > > it > > > > > knows > > > > > >> a > > > > > >> > > > different leader. If the follower knows a newer leader, it > > > would > > > > > >> return > > > > > >> > > new > > > > > >> > > > leader information in the response, for the client to act > > on. > > > > > >> > > > > > > > > >> > > > > > > > > >> > > > Will we include the leader/node info in the response when > > > having > > > > > >> > > > > `UNKNOWN_LEADER_EPOCH` error? > > > > > >> > > > > > > > > >> > > > > > > > > >> > > > My understanding is UNKNOWN_LEADER_EPOCH when a request > > from a > > > > > >> client > > > > > >> > > has a > > > > > >> > > > newer epoch than the broker. So the client is already up > to > > > date > > > > > on > > > > > >> new > > > > > >> > > > leader information, it's the broker that has the catching > up > > > to > > > > > do. > > > > > >> I > > > > > >> > > think > > > > > >> > > > there might be some optimisations to make sure the broker > > > > > refreshes > > > > > >> its > > > > > >> > > > metadata quickly, so it can quickly recover to handle > > requests > > > > > that > > > > > >> > > > previously returned UNKNOWN_LEADER_EPOCH. But this work is > > > > outside > > > > > >> the > > > > > >> > > > scope of this KIP, as for now this KIP focusses on > > client-side > > > > > >> > > > optimisations. > > > > > >> > > > > > > > > >> > > > Mayank > > > > > >> > > > > > > > > >> > > > On Tue, Jul 18, 2023 at 8:51 AM Luke Chen < > sh...@gmail.com> > > > > > wrote: > > > > > >> > > > > > > > > >> > > > > Hi Mayank, > > > > > >> > > > > > > > > > >> > > > > Thanks for the KIP! > > > > > >> > > > > > > > > > >> > > > > Some questions: > > > > > >> > > > > 1. I can see most of the cases we only care about > consumer > > > > fetch > > > > > >> from > > > > > >> > > the > > > > > >> > > > > leader. > > > > > >> > > > > But what if the consumer was fetching from the follower? > > > > > >> > > > > We already include `PreferredReadReplica` in the fetch > > > > response. > > > > > >> > > > > Should we put the node info of PreferredReadReplica > under > > > this > > > > > >> case, > > > > > >> > > > > instead of the leader's info? > > > > > >> > > > > Also, under this case, should we include the leader's > info > > > in > > > > > the > > > > > >> > > > response? > > > > > >> > > > > > > > > > >> > > > > 2. Will we include the leader/node info in the response > > when > > > > > >> having > > > > > >> > > > > `UNKNOWN_LEADER_EPOCH` error? > > > > > >> > > > > I think it's fine we ignore the `UNKNOWN_LEADER_EPOCH` > > error > > > > > >> since when > > > > > >> > > > > this happens, the node might have some error which > should > > > > > refresh > > > > > >> the > > > > > >> > > > > metadata. On the other hand, it might also be good if we > > can > > > > > heal > > > > > >> the > > > > > >> > > > node > > > > > >> > > > > soon to do produce/consume works. > > > > > >> > > > > > > > > > >> > > > > > > > > > >> > > > > Thank you. > > > > > >> > > > > Luke > > > > > >> > > > > > > > > > >> > > > > On Tue, Jul 18, 2023 at 2:00 AM Philip Nee < > > ph...@gmail.com > > > > > > > > > >> > > wrote: > > > > > >> > > > > > > > > > >> > > > > > Hey Mayank: > > > > > >> > > > > > > > > > > >> > > > > > For #1: I think fetch and produce behave a bit > > differently > > > > on > > > > > >> > > metadata. > > > > > >> > > > > > Maybe it is worth highlighting the changes for each > > client > > > > in > > > > > >> detail. > > > > > >> > > > In > > > > > >> > > > > > producer did you mean by the metadata timeout before > > > sending > > > > > out > > > > > >> > > > produce > > > > > >> > > > > > requests? For consumer: I think for fetches it > requires > > > user > > > > > to > > > > > >> retry > > > > > >> > > > if > > > > > >> > > > > > the position does not exist on the leader. I don't > have > > > the > > > > > >> detail on > > > > > >> > > > top > > > > > >> > > > > > of my head, but I think we should lay out these > > behavioral > > > > > >> changes. > > > > > >> > > > > > > > > > > >> > > > > > For #3: Thanks for the clarification. > > > > > >> > > > > > > > > > > >> > > > > > On Mon, Jul 17, 2023 at 10:39 AM Mayank Shekhar > Narula < > > > > > >> > > > > > mayanks.nar...@gmail.com> wrote: > > > > > >> > > > > > > > > > > >> > > > > > > Philip > > > > > >> > > > > > > > > > > > >> > > > > > > 1. Good call out about "poll" behaviour, my > > > understanding > > > > is > > > > > >> the > > > > > >> > > > same. > > > > > >> > > > > I > > > > > >> > > > > > am > > > > > >> > > > > > > assuming it's about the motivation of the KIP. There > > > with > > > > > >> async, my > > > > > >> > > > > > > intention was to convey that the client doesn't wait > > for > > > > the > > > > > >> > > > > > > metadata-refresh before a subsequent retry of the > > > produce > > > > or > > > > > >> fetch > > > > > >> > > > > > request > > > > > >> > > > > > > that failed due to stale metadata(i.e. going to an > old > > > > > >> leader). The > > > > > >> > > > > only > > > > > >> > > > > > > wait client has is the configured retry-delay. > > > > > >> > > > > > > > > > > > >> > > > > > > 2. Yes, in theory other APIs could benefit from this > > > too. > > > > > But > > > > > >> that > > > > > >> > > is > > > > > >> > > > > > > outside of the scope of the KIP. > > > > > >> > > > > > > > > > > > >> > > > > > > 3. Do you mean the response for the Metadata RPC? I > > > think > > > > > >> brokers > > > > > >> > > > > always > > > > > >> > > > > > > have a view of the cluster, although it can be > > stale,it > > > > > would > > > > > >> > > always > > > > > >> > > > > > return > > > > > >> > > > > > > a leader(whether old or new). > > > > > >> > > > > > > > > > > > >> > > > > > > Mayank > > > > > >> > > > > > > > > > > > >> > > > > > > On Fri, Jul 14, 2023 at 8:53 PM Philip Nee < > > > > ph...@gmail.com > > > > > > > > > > > >> > > > > wrote: > > > > > >> > > > > > > > > > > > >> > > > > > > > Hey Mayank, > > > > > >> > > > > > > > > > > > > >> > > > > > > > Thanks for the KIP. I think this is a great > > proposal, > > > > and > > > > > >> I'm in > > > > > >> > > > > favor > > > > > >> > > > > > > > of this idea. A few comments: > > > > > >> > > > > > > > > > > > > >> > > > > > > > 1. Claiming metadata refresh is done > asynchronously > > is > > > > > >> > > misleading. > > > > > >> > > > > The > > > > > >> > > > > > > > metadata refresh requires Network Client to be > > > > physically > > > > > >> polled, > > > > > >> > > > > which > > > > > >> > > > > > > is > > > > > >> > > > > > > > done in a separate thread in Producer and Admin > > Client > > > > > >> (IIRC!) > > > > > >> > > but > > > > > >> > > > > not > > > > > >> > > > > > > > Consumer. > > > > > >> > > > > > > > 2. There are other API calls that might result in > > > > > >> > > > > > NOT_LEADER_OR_FOLLOWER > > > > > >> > > > > > > > response, but it seems like this KIP only wants to > > > > update > > > > > on > > > > > >> > > fetch > > > > > >> > > > > and > > > > > >> > > > > > > > produce. Do we want to make the leader information > > > > > >> available for > > > > > >> > > > > other > > > > > >> > > > > > > API > > > > > >> > > > > > > > calls? > > > > > >> > > > > > > > 3. Do you know what would happen during a leader > > > > election? > > > > > >> I'm > > > > > >> > > not > > > > > >> > > > > sure > > > > > >> > > > > > > > about this process and I wonder if the current > > > metadata > > > > > >> response > > > > > >> > > > uses > > > > > >> > > > > > the > > > > > >> > > > > > > > old leader or null as the leader isn't readily > > > available > > > > > >> yet. > > > > > >> > > > > > > > > > > > > >> > > > > > > > Thanks, > > > > > >> > > > > > > > P > > > > > >> > > > > > > > > > > > > >> > > > > > > > On Fri, Jul 14, 2023 at 11:30 AM Kirk True < > > > > > >> ki...@kirktrue.pro> > > > > > >> > > > > wrote: > > > > > >> > > > > > > > > > > > > >> > > > > > > > > Hi Mayank, > > > > > >> > > > > > > > > > > > > > >> > > > > > > > > > On Jul 14, 2023, at 11:25 AM, Mayank Shekhar > > > Narula > > > > < > > > > > >> > > > > > > > > mayanks.nar...@gmail.com> wrote: > > > > > >> > > > > > > > > > > > > > > >> > > > > > > > > > Kirk > > > > > >> > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > >> > > > > > > > > >> Is the requested restructuring of the > response > > > > > >> “simply” to > > > > > >> > > > > > preserve > > > > > >> > > > > > > > > bytes, > > > > > >> > > > > > > > > >> or is it possible that the fetch response > > > > > >> could/should/would > > > > > >> > > > > > return > > > > > >> > > > > > > > > >> leadership changes for partitions that we’re > > > > > >> specifically > > > > > >> > > > > > requested? > > > > > >> > > > > > > > > >> > > > > > >> > > > > > > > > > > > > > > >> > > > > > > > > > Moving endpoints to top-level fields would > > > preserve > > > > > >> bytes, > > > > > >> > > > > > otherwise > > > > > >> > > > > > > > the > > > > > >> > > > > > > > > > endpoint-information would be duplicated if > > > included > > > > > >> with the > > > > > >> > > > > > > > > > partition-data in the response. Right now, the > > > > > top-level > > > > > >> > > field > > > > > >> > > > > will > > > > > >> > > > > > > > only > > > > > >> > > > > > > > > be > > > > > >> > > > > > > > > > set in case leader changes for any requested > > > > > >> partitions. But > > > > > >> > > it > > > > > >> > > > > can > > > > > >> > > > > > > be > > > > > >> > > > > > > > > > re-used in the future, for which Jose has a > > > use-case > > > > > in > > > > > >> mind > > > > > >> > > > > shared > > > > > >> > > > > > > up > > > > > >> > > > > > > > in > > > > > >> > > > > > > > > > the thread. KIP is now upto date with endpoint > > > info > > > > > >> being at > > > > > >> > > > > > > top-level. > > > > > >> > > > > > > > > > > > > > >> > > > > > > > > > > > > > >> > > > > > > > > I didn’t catch before that there was a separate > > > > section > > > > > >> for the > > > > > >> > > > > full > > > > > >> > > > > > > node > > > > > >> > > > > > > > > information, not just the ID and epoch. > > > > > >> > > > > > > > > > > > > > >> > > > > > > > > Thanks! > > > > > >> > > > > > > > > > > > > > >> > > > > > > > > >>> 3. In the future, I may use this information > > in > > > > the > > > > > >> > > > > > KRaft/Metadata > > > > > >> > > > > > > > > >>> implementation of FETCH. In that > > implementation > > > > not > > > > > >> all of > > > > > >> > > > the > > > > > >> > > > > > > > > >>> replicas are brokers. > > > > > >> > > > > > > > > >> > > > > > >> > > > > > > > > >> Side point: any references to the change > you’re > > > > > >> referring > > > > > >> > > to? > > > > > >> > > > > The > > > > > >> > > > > > > idea > > > > > >> > > > > > > > > of > > > > > >> > > > > > > > > >> non-brokers serving as replicas is blowing my > > > mind > > > > a > > > > > >> bit :) > > > > > >> > > > > > > > > >> > > > > > >> > > > > > > > > >> > > > > > >> > > > > > > > > > Jose, I missed this as well, would love to > know > > > more > > > > > >> about > > > > > >> > > > > > non-broker > > > > > >> > > > > > > > > > serving as replica! > > > > > >> > > > > > > > > > -- > > > > > >> > > > > > > > > > Regards, > > > > > >> > > > > > > > > > Mayank Shekhar Narula > > > > > >> > > > > > > > > > > > > > >> > > > > > > > > > > > > > >> > > > > > > > > > > > > >> > > > > > > > > > > > >> > > > > > > > > > > > >> > > > > > > -- > > > > > >> > > > > > > Regards, > > > > > >> > > > > > > Mayank Shekhar Narula > > > > > >> > > > > > > > > > > > >> > > > > > > > > > > >> > > > > > > > > > >> > > > > > > > > >> > > > > > > > > >> > > > -- > > > > > >> > > > Regards, > > > > > >> > > > Mayank Shekhar Narula > > > > > >> > > > > > > > > >> > > > > > > > >> > > > > > > >> > > > > > > >> > -- > > > > > >> > Regards, > > > > > >> > Mayank Shekhar Narula > > > > > >> > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- Regards, Mayank Shekhar Narula