Hi Mayank, Thanks again for the KIP and thanks for adding the new analysis. Overall, I am fine with it. I have a few minor comments.
01. If I understand correctly, the client will still request a metadata update even when it gets the new leader if the produce response or the fetch response. Is this correct? I think that we need this in order to get the full metadata. Could we elaborate a bit more about this in the KIP? I mean that it would be great to explain why we are doing it. 02. I was debating whether we should return the rack in the NodeEndpoints. I think that returning it makes sense from a consistency point of view. It would be weird to only update the node and the port in the metadata cache. However, I was thinking about the future and I was wondering how we would handle future metadata that we would add to the nodes, say tags. If we follow the same pattern, we would have to return any new fields as well. I suppose that it would be fine. Anyway, I think that the current approach is OK. I just wanted to share my thoughts. 03. In the FetchResponse Message section, it is written "NodeEndpoints is a tagged field, which is a minor optimisation of saving bytes on the network, as it won’t be set always.". Are you saying this because it won't be set for version prior to version 16? Or are you saying this because it may not be provided even in version 16? In other words, I wonder if that new information is actually optional or not. It is not clear in the KIP. 04. In the ProduceResponse Message section, the validVersions field misses in the schema. 05. The documentation of NodeEndpoints fields says "Endpoints for all current-leaders enumerated in PartitionData.". I suppose that this is incorrect, right? We will only provide endpoints when NOT_LEADER_OR_FOLLOWER or FENCED_LEADER_EPOCH are returned. The same applies to the other schema. Best, David On Wed, Sep 27, 2023 at 4:39 AM Mayank Shekhar Narula < mayanks.nar...@gmail.com> wrote: > Adding to Crispin. The new micro-benchmark shows improvements of 88% in > p99.9 with the KIP changes Vs baseline(& rejected alternate). Its > hypothesised improvements are seen as KIP avoids a full metadata refresh(Vs > baseline/rejected alternate), and the full metadata refresh can be slow due > to metadata reconvergence delay at the server(post leadership-change of > partitions). Extending this logic, KIP changes would be beneficial in > scenarios where full metadata refresh can be slow. Potential example would > be, metadata RPC is slowed due to head-of-line blocking by another slow RPC > in front, say a produce RPC(possibly slow due to churn in the ISR). > > This new benchmark is in addition to the previously done benchmark of roll > simulation, where improvements upto 5% were seen. > > Please do a review, as the voting thread is live. > > Thanks! > > On Wed, Sep 20, 2023 at 4:43 PM Crispin Bernier > <cbern...@confluent.io.invalid> wrote: > > > Hi, > > > > I've updated the KIP with benchmark results focusing more directly on the > > redirect case, please review and +1 in the voting thread if convinced. > > > > Thank you, > > Crispin > > > > On Wed, Jul 26, 2023 at 11:13 PM Luke Chen <show...@gmail.com> wrote: > > > > > Thanks for adding the benchmark results, Crispin! > > > IMO, 2~5% performance improvement is small, but given the change is > > small, > > > cost is also small (only append endpoint info when > > NOT_LEADER_OR_FOLLOWER.. > > > etc error), I think it is worth doing it. > > > > > > Thank you. > > > Luke > > > > > > On Wed, Jul 26, 2023 at 12:33 AM Ismael Juma <m...@ismaeljuma.com> > wrote: > > > > > > > Thanks Crispin! > > > > > > > > Ismael > > > > > > > > On Mon, Jul 24, 2023 at 8:16 PM Crispin Bernier > > > > <cbern...@confluent.io.invalid> wrote: > > > > > > > > > I updated the wiki to include both results along with their > average. > > > > > > > > > > Thank you, > > > > > Crispin > > > > > > > > > > On Mon, Jul 24, 2023 at 10:58 AM Ismael Juma <m...@ismaeljuma.com> > > > wrote: > > > > > > > > > > > Hi Crispin, > > > > > > > > > > > > One additional question, the wiki says "The results are averaged > > > over 2 > > > > > > runs.". Can you please provide some measure of variance in the > > > > > > distribution, i.e. were both results similar to each other for > both > > > > > cases? > > > > > > > > > > > > Ismael > > > > > > > > > > > > On Fri, Jul 21, 2023 at 11:31 AM Ismael Juma <m...@ismaeljuma.com> > > > > wrote: > > > > > > > > > > > > > Thanks for the update Crispin - very helpful to have actual > > > > performance > > > > > > > data. 2-5% for the default configuration is a bit on the low > side > > > for > > > > > > this > > > > > > > kind of proposal. > > > > > > > > > > > > > > Ismael > > > > > > > > > > > > > > On Thu, Jul 20, 2023 at 11:33 PM Crispin Bernier > > > > > > > <cbern...@confluent.io.invalid> wrote: > > > > > > > > > > > > > >> Benchmark numbers have been posted on the KIP, please review. > > > > > > >> > > > > > > >> On 2023/07/20 13:03:00 Mayank Shekhar Narula wrote: > > > > > > >> > Jun > > > > > > >> > > > > > > > >> > Thanks for the feedback. > > > > > > >> > > > > > > > >> > Numbers to follow. > > > > > > >> > > > > > > > >> > If we don't plan to > > > > > > >> > > bump up the FetchResponse version, we could just remove > the > > > > > > reference > > > > > > >> to > > > > > > >> > > version 16. > > > > > > >> > > > > > > > >> > Fixed. > > > > > > >> > > > > > > > >> > On Thu, Jul 20, 2023 at 1:28 AM Jun Rao > > > > <ju...@confluent.io.invalid > > > > > > > > > > > > >> wrote: > > > > > > >> > > > > > > > >> > > Hi, Mayank, > > > > > > >> > > > > > > > > >> > > Thanks for the KIP. I agree with others that it would be > > > useful > > > > to > > > > > > >> see the > > > > > > >> > > performance results. Otherwise, just a minor comment. If > we > > > > don't > > > > > > >> plan to > > > > > > >> > > bump up the FetchResponse version, we could just remove > the > > > > > > reference > > > > > > >> to > > > > > > >> > > version 16. > > > > > > >> > > > > > > > > >> > > Jun > > > > > > >> > > > > > > > > >> > > On Wed, Jul 19, 2023 at 2:31 PM Mayank Shekhar Narula < > > > > > > >> > > mayanks.nar...@gmail.com> wrote: > > > > > > >> > > > > > > > > >> > > > Luke > > > > > > >> > > > > > > > > > >> > > > Thanks for the interest in the KIP. > > > > > > >> > > > > > > > > > >> > > > But what if the consumer was fetching from the follower? > > > > > > >> > > > > > > > > > >> > > > We already include `PreferredReadReplica` in the fetch > > > > response. > > > > > > >> > > > > Should we put the node info of PreferredReadReplica > > under > > > > this > > > > > > >> case, > > > > > > >> > > > > instead of the leader's info? > > > > > > >> > > > > > > > > > > >> > > > > > > > > > >> > > > PreferredReadReplica is the decided on the leader. > Looking > > > at > > > > > the > > > > > > >> Java > > > > > > >> > > > client code, AbstractFetch::selectReadReplica, first > fetch > > > > > request > > > > > > >> goes > > > > > > >> > > to > > > > > > >> > > > Leader of the partition -> Sends back > PreferredReadReplica > > > -> > > > > > Next > > > > > > >> fetch > > > > > > >> > > > uses PreferredReadReplica. So as long as leader is > > > available, > > > > > > >> > > > PreferredReadReplica would be found in subsequent > fetches. > > > > > > >> > > > > > > > > > >> > > > Also, under this case, should we include the leader's > info > > > in > > > > > the > > > > > > >> > > response? > > > > > > >> > > > > > > > > > >> > > > > > > > > > >> > > > In this case, I think the follower would fail the fetch > if > > > it > > > > > > knows > > > > > > >> a > > > > > > >> > > > different leader. If the follower knows a newer leader, > it > > > > would > > > > > > >> return > > > > > > >> > > new > > > > > > >> > > > leader information in the response, for the client to > act > > > on. > > > > > > >> > > > > > > > > > >> > > > > > > > > > >> > > > Will we include the leader/node info in the response > when > > > > having > > > > > > >> > > > > `UNKNOWN_LEADER_EPOCH` error? > > > > > > >> > > > > > > > > > >> > > > > > > > > > >> > > > My understanding is UNKNOWN_LEADER_EPOCH when a request > > > from a > > > > > > >> client > > > > > > >> > > has a > > > > > > >> > > > newer epoch than the broker. So the client is already up > > to > > > > date > > > > > > on > > > > > > >> new > > > > > > >> > > > leader information, it's the broker that has the > catching > > up > > > > to > > > > > > do. > > > > > > >> I > > > > > > >> > > think > > > > > > >> > > > there might be some optimisations to make sure the > broker > > > > > > refreshes > > > > > > >> its > > > > > > >> > > > metadata quickly, so it can quickly recover to handle > > > requests > > > > > > that > > > > > > >> > > > previously returned UNKNOWN_LEADER_EPOCH. But this work > is > > > > > outside > > > > > > >> the > > > > > > >> > > > scope of this KIP, as for now this KIP focusses on > > > client-side > > > > > > >> > > > optimisations. > > > > > > >> > > > > > > > > > >> > > > Mayank > > > > > > >> > > > > > > > > > >> > > > On Tue, Jul 18, 2023 at 8:51 AM Luke Chen < > > sh...@gmail.com> > > > > > > wrote: > > > > > > >> > > > > > > > > > >> > > > > Hi Mayank, > > > > > > >> > > > > > > > > > > >> > > > > Thanks for the KIP! > > > > > > >> > > > > > > > > > > >> > > > > Some questions: > > > > > > >> > > > > 1. I can see most of the cases we only care about > > consumer > > > > > fetch > > > > > > >> from > > > > > > >> > > the > > > > > > >> > > > > leader. > > > > > > >> > > > > But what if the consumer was fetching from the > follower? > > > > > > >> > > > > We already include `PreferredReadReplica` in the fetch > > > > > response. > > > > > > >> > > > > Should we put the node info of PreferredReadReplica > > under > > > > this > > > > > > >> case, > > > > > > >> > > > > instead of the leader's info? > > > > > > >> > > > > Also, under this case, should we include the leader's > > info > > > > in > > > > > > the > > > > > > >> > > > response? > > > > > > >> > > > > > > > > > > >> > > > > 2. Will we include the leader/node info in the > response > > > when > > > > > > >> having > > > > > > >> > > > > `UNKNOWN_LEADER_EPOCH` error? > > > > > > >> > > > > I think it's fine we ignore the `UNKNOWN_LEADER_EPOCH` > > > error > > > > > > >> since when > > > > > > >> > > > > this happens, the node might have some error which > > should > > > > > > refresh > > > > > > >> the > > > > > > >> > > > > metadata. On the other hand, it might also be good if > we > > > can > > > > > > heal > > > > > > >> the > > > > > > >> > > > node > > > > > > >> > > > > soon to do produce/consume works. > > > > > > >> > > > > > > > > > > >> > > > > > > > > > > >> > > > > Thank you. > > > > > > >> > > > > Luke > > > > > > >> > > > > > > > > > > >> > > > > On Tue, Jul 18, 2023 at 2:00 AM Philip Nee < > > > ph...@gmail.com > > > > > > > > > > > >> > > wrote: > > > > > > >> > > > > > > > > > > >> > > > > > Hey Mayank: > > > > > > >> > > > > > > > > > > > >> > > > > > For #1: I think fetch and produce behave a bit > > > differently > > > > > on > > > > > > >> > > metadata. > > > > > > >> > > > > > Maybe it is worth highlighting the changes for each > > > client > > > > > in > > > > > > >> detail. > > > > > > >> > > > In > > > > > > >> > > > > > producer did you mean by the metadata timeout before > > > > sending > > > > > > out > > > > > > >> > > > produce > > > > > > >> > > > > > requests? For consumer: I think for fetches it > > requires > > > > user > > > > > > to > > > > > > >> retry > > > > > > >> > > > if > > > > > > >> > > > > > the position does not exist on the leader. I don't > > have > > > > the > > > > > > >> detail on > > > > > > >> > > > top > > > > > > >> > > > > > of my head, but I think we should lay out these > > > behavioral > > > > > > >> changes. > > > > > > >> > > > > > > > > > > > >> > > > > > For #3: Thanks for the clarification. > > > > > > >> > > > > > > > > > > > >> > > > > > On Mon, Jul 17, 2023 at 10:39 AM Mayank Shekhar > > Narula < > > > > > > >> > > > > > mayanks.nar...@gmail.com> wrote: > > > > > > >> > > > > > > > > > > > >> > > > > > > Philip > > > > > > >> > > > > > > > > > > > > >> > > > > > > 1. Good call out about "poll" behaviour, my > > > > understanding > > > > > is > > > > > > >> the > > > > > > >> > > > same. > > > > > > >> > > > > I > > > > > > >> > > > > > am > > > > > > >> > > > > > > assuming it's about the motivation of the KIP. > There > > > > with > > > > > > >> async, my > > > > > > >> > > > > > > intention was to convey that the client doesn't > wait > > > for > > > > > the > > > > > > >> > > > > > > metadata-refresh before a subsequent retry of the > > > > produce > > > > > or > > > > > > >> fetch > > > > > > >> > > > > > request > > > > > > >> > > > > > > that failed due to stale metadata(i.e. going to an > > old > > > > > > >> leader). The > > > > > > >> > > > > only > > > > > > >> > > > > > > wait client has is the configured retry-delay. > > > > > > >> > > > > > > > > > > > > >> > > > > > > 2. Yes, in theory other APIs could benefit from > this > > > > too. > > > > > > But > > > > > > >> that > > > > > > >> > > is > > > > > > >> > > > > > > outside of the scope of the KIP. > > > > > > >> > > > > > > > > > > > > >> > > > > > > 3. Do you mean the response for the Metadata RPC? > I > > > > think > > > > > > >> brokers > > > > > > >> > > > > always > > > > > > >> > > > > > > have a view of the cluster, although it can be > > > stale,it > > > > > > would > > > > > > >> > > always > > > > > > >> > > > > > return > > > > > > >> > > > > > > a leader(whether old or new). > > > > > > >> > > > > > > > > > > > > >> > > > > > > Mayank > > > > > > >> > > > > > > > > > > > > >> > > > > > > On Fri, Jul 14, 2023 at 8:53 PM Philip Nee < > > > > > ph...@gmail.com > > > > > > > > > > > > > >> > > > > wrote: > > > > > > >> > > > > > > > > > > > > >> > > > > > > > Hey Mayank, > > > > > > >> > > > > > > > > > > > > > >> > > > > > > > Thanks for the KIP. I think this is a great > > > proposal, > > > > > and > > > > > > >> I'm in > > > > > > >> > > > > favor > > > > > > >> > > > > > > > of this idea. A few comments: > > > > > > >> > > > > > > > > > > > > > >> > > > > > > > 1. Claiming metadata refresh is done > > asynchronously > > > is > > > > > > >> > > misleading. > > > > > > >> > > > > The > > > > > > >> > > > > > > > metadata refresh requires Network Client to be > > > > > physically > > > > > > >> polled, > > > > > > >> > > > > which > > > > > > >> > > > > > > is > > > > > > >> > > > > > > > done in a separate thread in Producer and Admin > > > Client > > > > > > >> (IIRC!) > > > > > > >> > > but > > > > > > >> > > > > not > > > > > > >> > > > > > > > Consumer. > > > > > > >> > > > > > > > 2. There are other API calls that might result > in > > > > > > >> > > > > > NOT_LEADER_OR_FOLLOWER > > > > > > >> > > > > > > > response, but it seems like this KIP only wants > to > > > > > update > > > > > > on > > > > > > >> > > fetch > > > > > > >> > > > > and > > > > > > >> > > > > > > > produce. Do we want to make the leader > information > > > > > > >> available for > > > > > > >> > > > > other > > > > > > >> > > > > > > API > > > > > > >> > > > > > > > calls? > > > > > > >> > > > > > > > 3. Do you know what would happen during a leader > > > > > election? > > > > > > >> I'm > > > > > > >> > > not > > > > > > >> > > > > sure > > > > > > >> > > > > > > > about this process and I wonder if the current > > > > metadata > > > > > > >> response > > > > > > >> > > > uses > > > > > > >> > > > > > the > > > > > > >> > > > > > > > old leader or null as the leader isn't readily > > > > available > > > > > > >> yet. > > > > > > >> > > > > > > > > > > > > > >> > > > > > > > Thanks, > > > > > > >> > > > > > > > P > > > > > > >> > > > > > > > > > > > > > >> > > > > > > > On Fri, Jul 14, 2023 at 11:30 AM Kirk True < > > > > > > >> ki...@kirktrue.pro> > > > > > > >> > > > > wrote: > > > > > > >> > > > > > > > > > > > > > >> > > > > > > > > Hi Mayank, > > > > > > >> > > > > > > > > > > > > > > >> > > > > > > > > > On Jul 14, 2023, at 11:25 AM, Mayank Shekhar > > > > Narula > > > > > < > > > > > > >> > > > > > > > > mayanks.nar...@gmail.com> wrote: > > > > > > >> > > > > > > > > > > > > > > > >> > > > > > > > > > Kirk > > > > > > >> > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > > >> > > > > > > > > >> Is the requested restructuring of the > > response > > > > > > >> “simply” to > > > > > > >> > > > > > preserve > > > > > > >> > > > > > > > > bytes, > > > > > > >> > > > > > > > > >> or is it possible that the fetch response > > > > > > >> could/should/would > > > > > > >> > > > > > return > > > > > > >> > > > > > > > > >> leadership changes for partitions that > we’re > > > > > > >> specifically > > > > > > >> > > > > > requested? > > > > > > >> > > > > > > > > >> > > > > > > >> > > > > > > > > > > > > > > > >> > > > > > > > > > Moving endpoints to top-level fields would > > > > preserve > > > > > > >> bytes, > > > > > > >> > > > > > otherwise > > > > > > >> > > > > > > > the > > > > > > >> > > > > > > > > > endpoint-information would be duplicated if > > > > included > > > > > > >> with the > > > > > > >> > > > > > > > > > partition-data in the response. Right now, > the > > > > > > top-level > > > > > > >> > > field > > > > > > >> > > > > will > > > > > > >> > > > > > > > only > > > > > > >> > > > > > > > > be > > > > > > >> > > > > > > > > > set in case leader changes for any requested > > > > > > >> partitions. But > > > > > > >> > > it > > > > > > >> > > > > can > > > > > > >> > > > > > > be > > > > > > >> > > > > > > > > > re-used in the future, for which Jose has a > > > > use-case > > > > > > in > > > > > > >> mind > > > > > > >> > > > > shared > > > > > > >> > > > > > > up > > > > > > >> > > > > > > > in > > > > > > >> > > > > > > > > > the thread. KIP is now upto date with > endpoint > > > > info > > > > > > >> being at > > > > > > >> > > > > > > top-level. > > > > > > >> > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > >> > > > > > > > > I didn’t catch before that there was a > separate > > > > > section > > > > > > >> for the > > > > > > >> > > > > full > > > > > > >> > > > > > > node > > > > > > >> > > > > > > > > information, not just the ID and epoch. > > > > > > >> > > > > > > > > > > > > > > >> > > > > > > > > Thanks! > > > > > > >> > > > > > > > > > > > > > > >> > > > > > > > > >>> 3. In the future, I may use this > information > > > in > > > > > the > > > > > > >> > > > > > KRaft/Metadata > > > > > > >> > > > > > > > > >>> implementation of FETCH. In that > > > implementation > > > > > not > > > > > > >> all of > > > > > > >> > > > the > > > > > > >> > > > > > > > > >>> replicas are brokers. > > > > > > >> > > > > > > > > >> > > > > > > >> > > > > > > > > >> Side point: any references to the change > > you’re > > > > > > >> referring > > > > > > >> > > to? > > > > > > >> > > > > The > > > > > > >> > > > > > > idea > > > > > > >> > > > > > > > > of > > > > > > >> > > > > > > > > >> non-brokers serving as replicas is blowing > my > > > > mind > > > > > a > > > > > > >> bit :) > > > > > > >> > > > > > > > > >> > > > > > > >> > > > > > > > > >> > > > > > > >> > > > > > > > > > Jose, I missed this as well, would love to > > know > > > > more > > > > > > >> about > > > > > > >> > > > > > non-broker > > > > > > >> > > > > > > > > > serving as replica! > > > > > > >> > > > > > > > > > -- > > > > > > >> > > > > > > > > > Regards, > > > > > > >> > > > > > > > > > Mayank Shekhar Narula > > > > > > >> > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > >> > > > > > > > > > > > > > >> > > > > > > > > > > > > >> > > > > > > > > > > > > >> > > > > > > -- > > > > > > >> > > > > > > Regards, > > > > > > >> > > > > > > Mayank Shekhar Narula > > > > > > >> > > > > > > > > > > > > >> > > > > > > > > > > > >> > > > > > > > > > > >> > > > > > > > > > >> > > > > > > > > > >> > > > -- > > > > > > >> > > > Regards, > > > > > > >> > > > Mayank Shekhar Narula > > > > > > >> > > > > > > > > > >> > > > > > > > > >> > > > > > > > >> > > > > > > > >> > -- > > > > > > >> > Regards, > > > > > > >> > Mayank Shekhar Narula > > > > > > >> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > Regards, > Mayank Shekhar Narula >