Hi David 01 Same as Java Client's existing behaviour of requesting a full expedited metadata-refresh done asynchronously, for error NOT_LEADER OR FENCED_EPOCH, the KIP proposes to continue doing so. As this would help prevent similar errors on future requests to other partitions affected by leadership changes. Quot-ing from the KIP ->
> Even though client will receive the new leader information in the > ProduceResponse & FetchResponse when leader changes, but same as the > existing behaviour of the Kafka Java client, it will request expedited > metadata-refresh done asynchronously. Since leadership change will likely > affect many partitions, so future requests to such partitions will benefit > from the upto date leadership information, and reduce requests going to old > leaders. > Does this help? 02 Thanks for the feedback. 03 It's the latter, i.e. fields are optional even in version 16, as brokers would only return them in specific scenarios. This was mentioned in earlier sections of the KIP. But clarified now in the FetchResponse section too. 04 Fixed. 05 Updated. On Thu, Sep 28, 2023 at 6:38 PM David Jacot <dja...@confluent.io.invalid> wrote: > Hi Mayank, > > Thanks again for the KIP and thanks for adding the new analysis. Overall, I > am fine with it. I have a few minor comments. > > 01. If I understand correctly, the client will still request a metadata > update even when it gets the new leader if the produce response or the > fetch response. Is this correct? I think that we need this in order to get > the full metadata. Could we elaborate a bit more about this in the KIP? I > mean that it would be great to explain why we are doing it. > > 02. I was debating whether we should return the rack in the NodeEndpoints. > I think that returning it makes sense from a consistency point of view. It > would be weird to only update the node and the port in the metadata cache. > However, I was thinking about the future and I was wondering how we would > handle future metadata that we would add to the nodes, say tags. If we > follow the same pattern, we would have to return any new fields as well. I > suppose that it would be fine. Anyway, I think that the current approach is > OK. I just wanted to share my thoughts. > > 03. In the FetchResponse Message section, it is written "NodeEndpoints is a > tagged field, which is a minor optimisation of saving bytes on the network, > as it won’t be set always.". Are you saying this because it won't be set > for version prior to version 16? Or are you saying this because it may not > be provided even in version 16? In other words, I wonder if that new > information is actually optional or not. It is not clear in the KIP. > > 04. In the ProduceResponse Message section, the validVersions field misses > in the schema. > > 05. The documentation of NodeEndpoints fields says "Endpoints for all > current-leaders enumerated in PartitionData.". I suppose that this is > incorrect, right? We will only provide endpoints > when NOT_LEADER_OR_FOLLOWER or FENCED_LEADER_EPOCH are returned. The same > applies to the other schema. > > Best, > David > > > On Wed, Sep 27, 2023 at 4:39 AM Mayank Shekhar Narula < > mayanks.nar...@gmail.com> wrote: > > > Adding to Crispin. The new micro-benchmark shows improvements of 88% in > > p99.9 with the KIP changes Vs baseline(& rejected alternate). Its > > hypothesised improvements are seen as KIP avoids a full metadata > refresh(Vs > > baseline/rejected alternate), and the full metadata refresh can be slow > due > > to metadata reconvergence delay at the server(post leadership-change of > > partitions). Extending this logic, KIP changes would be beneficial in > > scenarios where full metadata refresh can be slow. Potential example > would > > be, metadata RPC is slowed due to head-of-line blocking by another slow > RPC > > in front, say a produce RPC(possibly slow due to churn in the ISR). > > > > This new benchmark is in addition to the previously done benchmark of > roll > > simulation, where improvements upto 5% were seen. > > > > Please do a review, as the voting thread is live. > > > > Thanks! > > > > On Wed, Sep 20, 2023 at 4:43 PM Crispin Bernier > > <cbern...@confluent.io.invalid> wrote: > > > > > Hi, > > > > > > I've updated the KIP with benchmark results focusing more directly on > the > > > redirect case, please review and +1 in the voting thread if convinced. > > > > > > Thank you, > > > Crispin > > > > > > On Wed, Jul 26, 2023 at 11:13 PM Luke Chen <show...@gmail.com> wrote: > > > > > > > Thanks for adding the benchmark results, Crispin! > > > > IMO, 2~5% performance improvement is small, but given the change is > > > small, > > > > cost is also small (only append endpoint info when > > > NOT_LEADER_OR_FOLLOWER.. > > > > etc error), I think it is worth doing it. > > > > > > > > Thank you. > > > > Luke > > > > > > > > On Wed, Jul 26, 2023 at 12:33 AM Ismael Juma <m...@ismaeljuma.com> > > wrote: > > > > > > > > > Thanks Crispin! > > > > > > > > > > Ismael > > > > > > > > > > On Mon, Jul 24, 2023 at 8:16 PM Crispin Bernier > > > > > <cbern...@confluent.io.invalid> wrote: > > > > > > > > > > > I updated the wiki to include both results along with their > > average. > > > > > > > > > > > > Thank you, > > > > > > Crispin > > > > > > > > > > > > On Mon, Jul 24, 2023 at 10:58 AM Ismael Juma <m...@ismaeljuma.com> > > > > wrote: > > > > > > > > > > > > > Hi Crispin, > > > > > > > > > > > > > > One additional question, the wiki says "The results are > averaged > > > > over 2 > > > > > > > runs.". Can you please provide some measure of variance in the > > > > > > > distribution, i.e. were both results similar to each other for > > both > > > > > > cases? > > > > > > > > > > > > > > Ismael > > > > > > > > > > > > > > On Fri, Jul 21, 2023 at 11:31 AM Ismael Juma < > m...@ismaeljuma.com> > > > > > wrote: > > > > > > > > > > > > > > > Thanks for the update Crispin - very helpful to have actual > > > > > performance > > > > > > > > data. 2-5% for the default configuration is a bit on the low > > side > > > > for > > > > > > > this > > > > > > > > kind of proposal. > > > > > > > > > > > > > > > > Ismael > > > > > > > > > > > > > > > > On Thu, Jul 20, 2023 at 11:33 PM Crispin Bernier > > > > > > > > <cbern...@confluent.io.invalid> wrote: > > > > > > > > > > > > > > > >> Benchmark numbers have been posted on the KIP, please > review. > > > > > > > >> > > > > > > > >> On 2023/07/20 13:03:00 Mayank Shekhar Narula wrote: > > > > > > > >> > Jun > > > > > > > >> > > > > > > > > >> > Thanks for the feedback. > > > > > > > >> > > > > > > > > >> > Numbers to follow. > > > > > > > >> > > > > > > > > >> > If we don't plan to > > > > > > > >> > > bump up the FetchResponse version, we could just remove > > the > > > > > > > reference > > > > > > > >> to > > > > > > > >> > > version 16. > > > > > > > >> > > > > > > > > >> > Fixed. > > > > > > > >> > > > > > > > > >> > On Thu, Jul 20, 2023 at 1:28 AM Jun Rao > > > > > <ju...@confluent.io.invalid > > > > > > > > > > > > > > >> wrote: > > > > > > > >> > > > > > > > > >> > > Hi, Mayank, > > > > > > > >> > > > > > > > > > >> > > Thanks for the KIP. I agree with others that it would be > > > > useful > > > > > to > > > > > > > >> see the > > > > > > > >> > > performance results. Otherwise, just a minor comment. If > > we > > > > > don't > > > > > > > >> plan to > > > > > > > >> > > bump up the FetchResponse version, we could just remove > > the > > > > > > > reference > > > > > > > >> to > > > > > > > >> > > version 16. > > > > > > > >> > > > > > > > > > >> > > Jun > > > > > > > >> > > > > > > > > > >> > > On Wed, Jul 19, 2023 at 2:31 PM Mayank Shekhar Narula < > > > > > > > >> > > mayanks.nar...@gmail.com> wrote: > > > > > > > >> > > > > > > > > > >> > > > Luke > > > > > > > >> > > > > > > > > > > >> > > > Thanks for the interest in the KIP. > > > > > > > >> > > > > > > > > > > >> > > > But what if the consumer was fetching from the > follower? > > > > > > > >> > > > > > > > > > > >> > > > We already include `PreferredReadReplica` in the fetch > > > > > response. > > > > > > > >> > > > > Should we put the node info of PreferredReadReplica > > > under > > > > > this > > > > > > > >> case, > > > > > > > >> > > > > instead of the leader's info? > > > > > > > >> > > > > > > > > > > > >> > > > > > > > > > > >> > > > PreferredReadReplica is the decided on the leader. > > Looking > > > > at > > > > > > the > > > > > > > >> Java > > > > > > > >> > > > client code, AbstractFetch::selectReadReplica, first > > fetch > > > > > > request > > > > > > > >> goes > > > > > > > >> > > to > > > > > > > >> > > > Leader of the partition -> Sends back > > PreferredReadReplica > > > > -> > > > > > > Next > > > > > > > >> fetch > > > > > > > >> > > > uses PreferredReadReplica. So as long as leader is > > > > available, > > > > > > > >> > > > PreferredReadReplica would be found in subsequent > > fetches. > > > > > > > >> > > > > > > > > > > >> > > > Also, under this case, should we include the leader's > > info > > > > in > > > > > > the > > > > > > > >> > > response? > > > > > > > >> > > > > > > > > > > >> > > > > > > > > > > >> > > > In this case, I think the follower would fail the > fetch > > if > > > > it > > > > > > > knows > > > > > > > >> a > > > > > > > >> > > > different leader. If the follower knows a newer > leader, > > it > > > > > would > > > > > > > >> return > > > > > > > >> > > new > > > > > > > >> > > > leader information in the response, for the client to > > act > > > > on. > > > > > > > >> > > > > > > > > > > >> > > > > > > > > > > >> > > > Will we include the leader/node info in the response > > when > > > > > having > > > > > > > >> > > > > `UNKNOWN_LEADER_EPOCH` error? > > > > > > > >> > > > > > > > > > > >> > > > > > > > > > > >> > > > My understanding is UNKNOWN_LEADER_EPOCH when a > request > > > > from a > > > > > > > >> client > > > > > > > >> > > has a > > > > > > > >> > > > newer epoch than the broker. So the client is already > up > > > to > > > > > date > > > > > > > on > > > > > > > >> new > > > > > > > >> > > > leader information, it's the broker that has the > > catching > > > up > > > > > to > > > > > > > do. > > > > > > > >> I > > > > > > > >> > > think > > > > > > > >> > > > there might be some optimisations to make sure the > > broker > > > > > > > refreshes > > > > > > > >> its > > > > > > > >> > > > metadata quickly, so it can quickly recover to handle > > > > requests > > > > > > > that > > > > > > > >> > > > previously returned UNKNOWN_LEADER_EPOCH. But this > work > > is > > > > > > outside > > > > > > > >> the > > > > > > > >> > > > scope of this KIP, as for now this KIP focusses on > > > > client-side > > > > > > > >> > > > optimisations. > > > > > > > >> > > > > > > > > > > >> > > > Mayank > > > > > > > >> > > > > > > > > > > >> > > > On Tue, Jul 18, 2023 at 8:51 AM Luke Chen < > > > sh...@gmail.com> > > > > > > > wrote: > > > > > > > >> > > > > > > > > > > >> > > > > Hi Mayank, > > > > > > > >> > > > > > > > > > > > >> > > > > Thanks for the KIP! > > > > > > > >> > > > > > > > > > > > >> > > > > Some questions: > > > > > > > >> > > > > 1. I can see most of the cases we only care about > > > consumer > > > > > > fetch > > > > > > > >> from > > > > > > > >> > > the > > > > > > > >> > > > > leader. > > > > > > > >> > > > > But what if the consumer was fetching from the > > follower? > > > > > > > >> > > > > We already include `PreferredReadReplica` in the > fetch > > > > > > response. > > > > > > > >> > > > > Should we put the node info of PreferredReadReplica > > > under > > > > > this > > > > > > > >> case, > > > > > > > >> > > > > instead of the leader's info? > > > > > > > >> > > > > Also, under this case, should we include the > leader's > > > info > > > > > in > > > > > > > the > > > > > > > >> > > > response? > > > > > > > >> > > > > > > > > > > > >> > > > > 2. Will we include the leader/node info in the > > response > > > > when > > > > > > > >> having > > > > > > > >> > > > > `UNKNOWN_LEADER_EPOCH` error? > > > > > > > >> > > > > I think it's fine we ignore the > `UNKNOWN_LEADER_EPOCH` > > > > error > > > > > > > >> since when > > > > > > > >> > > > > this happens, the node might have some error which > > > should > > > > > > > refresh > > > > > > > >> the > > > > > > > >> > > > > metadata. On the other hand, it might also be good > if > > we > > > > can > > > > > > > heal > > > > > > > >> the > > > > > > > >> > > > node > > > > > > > >> > > > > soon to do produce/consume works. > > > > > > > >> > > > > > > > > > > > >> > > > > > > > > > > > >> > > > > Thank you. > > > > > > > >> > > > > Luke > > > > > > > >> > > > > > > > > > > > >> > > > > On Tue, Jul 18, 2023 at 2:00 AM Philip Nee < > > > > ph...@gmail.com > > > > > > > > > > > > > >> > > wrote: > > > > > > > >> > > > > > > > > > > > >> > > > > > Hey Mayank: > > > > > > > >> > > > > > > > > > > > > >> > > > > > For #1: I think fetch and produce behave a bit > > > > differently > > > > > > on > > > > > > > >> > > metadata. > > > > > > > >> > > > > > Maybe it is worth highlighting the changes for > each > > > > client > > > > > > in > > > > > > > >> detail. > > > > > > > >> > > > In > > > > > > > >> > > > > > producer did you mean by the metadata timeout > before > > > > > sending > > > > > > > out > > > > > > > >> > > > produce > > > > > > > >> > > > > > requests? For consumer: I think for fetches it > > > requires > > > > > user > > > > > > > to > > > > > > > >> retry > > > > > > > >> > > > if > > > > > > > >> > > > > > the position does not exist on the leader. I don't > > > have > > > > > the > > > > > > > >> detail on > > > > > > > >> > > > top > > > > > > > >> > > > > > of my head, but I think we should lay out these > > > > behavioral > > > > > > > >> changes. > > > > > > > >> > > > > > > > > > > > > >> > > > > > For #3: Thanks for the clarification. > > > > > > > >> > > > > > > > > > > > > >> > > > > > On Mon, Jul 17, 2023 at 10:39 AM Mayank Shekhar > > > Narula < > > > > > > > >> > > > > > mayanks.nar...@gmail.com> wrote: > > > > > > > >> > > > > > > > > > > > > >> > > > > > > Philip > > > > > > > >> > > > > > > > > > > > > > >> > > > > > > 1. Good call out about "poll" behaviour, my > > > > > understanding > > > > > > is > > > > > > > >> the > > > > > > > >> > > > same. > > > > > > > >> > > > > I > > > > > > > >> > > > > > am > > > > > > > >> > > > > > > assuming it's about the motivation of the KIP. > > There > > > > > with > > > > > > > >> async, my > > > > > > > >> > > > > > > intention was to convey that the client doesn't > > wait > > > > for > > > > > > the > > > > > > > >> > > > > > > metadata-refresh before a subsequent retry of > the > > > > > produce > > > > > > or > > > > > > > >> fetch > > > > > > > >> > > > > > request > > > > > > > >> > > > > > > that failed due to stale metadata(i.e. going to > an > > > old > > > > > > > >> leader). The > > > > > > > >> > > > > only > > > > > > > >> > > > > > > wait client has is the configured retry-delay. > > > > > > > >> > > > > > > > > > > > > > >> > > > > > > 2. Yes, in theory other APIs could benefit from > > this > > > > > too. > > > > > > > But > > > > > > > >> that > > > > > > > >> > > is > > > > > > > >> > > > > > > outside of the scope of the KIP. > > > > > > > >> > > > > > > > > > > > > > >> > > > > > > 3. Do you mean the response for the Metadata > RPC? > > I > > > > > think > > > > > > > >> brokers > > > > > > > >> > > > > always > > > > > > > >> > > > > > > have a view of the cluster, although it can be > > > > stale,it > > > > > > > would > > > > > > > >> > > always > > > > > > > >> > > > > > return > > > > > > > >> > > > > > > a leader(whether old or new). > > > > > > > >> > > > > > > > > > > > > > >> > > > > > > Mayank > > > > > > > >> > > > > > > > > > > > > > >> > > > > > > On Fri, Jul 14, 2023 at 8:53 PM Philip Nee < > > > > > > ph...@gmail.com > > > > > > > > > > > > > > > >> > > > > wrote: > > > > > > > >> > > > > > > > > > > > > > >> > > > > > > > Hey Mayank, > > > > > > > >> > > > > > > > > > > > > > > >> > > > > > > > Thanks for the KIP. I think this is a great > > > > proposal, > > > > > > and > > > > > > > >> I'm in > > > > > > > >> > > > > favor > > > > > > > >> > > > > > > > of this idea. A few comments: > > > > > > > >> > > > > > > > > > > > > > > >> > > > > > > > 1. Claiming metadata refresh is done > > > asynchronously > > > > is > > > > > > > >> > > misleading. > > > > > > > >> > > > > The > > > > > > > >> > > > > > > > metadata refresh requires Network Client to be > > > > > > physically > > > > > > > >> polled, > > > > > > > >> > > > > which > > > > > > > >> > > > > > > is > > > > > > > >> > > > > > > > done in a separate thread in Producer and > Admin > > > > Client > > > > > > > >> (IIRC!) > > > > > > > >> > > but > > > > > > > >> > > > > not > > > > > > > >> > > > > > > > Consumer. > > > > > > > >> > > > > > > > 2. There are other API calls that might result > > in > > > > > > > >> > > > > > NOT_LEADER_OR_FOLLOWER > > > > > > > >> > > > > > > > response, but it seems like this KIP only > wants > > to > > > > > > update > > > > > > > on > > > > > > > >> > > fetch > > > > > > > >> > > > > and > > > > > > > >> > > > > > > > produce. Do we want to make the leader > > information > > > > > > > >> available for > > > > > > > >> > > > > other > > > > > > > >> > > > > > > API > > > > > > > >> > > > > > > > calls? > > > > > > > >> > > > > > > > 3. Do you know what would happen during a > leader > > > > > > election? > > > > > > > >> I'm > > > > > > > >> > > not > > > > > > > >> > > > > sure > > > > > > > >> > > > > > > > about this process and I wonder if the current > > > > > metadata > > > > > > > >> response > > > > > > > >> > > > uses > > > > > > > >> > > > > > the > > > > > > > >> > > > > > > > old leader or null as the leader isn't readily > > > > > available > > > > > > > >> yet. > > > > > > > >> > > > > > > > > > > > > > > >> > > > > > > > Thanks, > > > > > > > >> > > > > > > > P > > > > > > > >> > > > > > > > > > > > > > > >> > > > > > > > On Fri, Jul 14, 2023 at 11:30 AM Kirk True < > > > > > > > >> ki...@kirktrue.pro> > > > > > > > >> > > > > wrote: > > > > > > > >> > > > > > > > > > > > > > > >> > > > > > > > > Hi Mayank, > > > > > > > >> > > > > > > > > > > > > > > > >> > > > > > > > > > On Jul 14, 2023, at 11:25 AM, Mayank > Shekhar > > > > > Narula > > > > > > < > > > > > > > >> > > > > > > > > mayanks.nar...@gmail.com> wrote: > > > > > > > >> > > > > > > > > > > > > > > > > >> > > > > > > > > > Kirk > > > > > > > >> > > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > > > >> > > > > > > > > >> Is the requested restructuring of the > > > response > > > > > > > >> “simply” to > > > > > > > >> > > > > > preserve > > > > > > > >> > > > > > > > > bytes, > > > > > > > >> > > > > > > > > >> or is it possible that the fetch response > > > > > > > >> could/should/would > > > > > > > >> > > > > > return > > > > > > > >> > > > > > > > > >> leadership changes for partitions that > > we’re > > > > > > > >> specifically > > > > > > > >> > > > > > requested? > > > > > > > >> > > > > > > > > >> > > > > > > > >> > > > > > > > > > > > > > > > > >> > > > > > > > > > Moving endpoints to top-level fields would > > > > > preserve > > > > > > > >> bytes, > > > > > > > >> > > > > > otherwise > > > > > > > >> > > > > > > > the > > > > > > > >> > > > > > > > > > endpoint-information would be duplicated > if > > > > > included > > > > > > > >> with the > > > > > > > >> > > > > > > > > > partition-data in the response. Right now, > > the > > > > > > > top-level > > > > > > > >> > > field > > > > > > > >> > > > > will > > > > > > > >> > > > > > > > only > > > > > > > >> > > > > > > > > be > > > > > > > >> > > > > > > > > > set in case leader changes for any > requested > > > > > > > >> partitions. But > > > > > > > >> > > it > > > > > > > >> > > > > can > > > > > > > >> > > > > > > be > > > > > > > >> > > > > > > > > > re-used in the future, for which Jose has > a > > > > > use-case > > > > > > > in > > > > > > > >> mind > > > > > > > >> > > > > shared > > > > > > > >> > > > > > > up > > > > > > > >> > > > > > > > in > > > > > > > >> > > > > > > > > > the thread. KIP is now upto date with > > endpoint > > > > > info > > > > > > > >> being at > > > > > > > >> > > > > > > top-level. > > > > > > > >> > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > > >> > > > > > > > > I didn’t catch before that there was a > > separate > > > > > > section > > > > > > > >> for the > > > > > > > >> > > > > full > > > > > > > >> > > > > > > node > > > > > > > >> > > > > > > > > information, not just the ID and epoch. > > > > > > > >> > > > > > > > > > > > > > > > >> > > > > > > > > Thanks! > > > > > > > >> > > > > > > > > > > > > > > > >> > > > > > > > > >>> 3. In the future, I may use this > > information > > > > in > > > > > > the > > > > > > > >> > > > > > KRaft/Metadata > > > > > > > >> > > > > > > > > >>> implementation of FETCH. In that > > > > implementation > > > > > > not > > > > > > > >> all of > > > > > > > >> > > > the > > > > > > > >> > > > > > > > > >>> replicas are brokers. > > > > > > > >> > > > > > > > > >> > > > > > > > >> > > > > > > > > >> Side point: any references to the change > > > you’re > > > > > > > >> referring > > > > > > > >> > > to? > > > > > > > >> > > > > The > > > > > > > >> > > > > > > idea > > > > > > > >> > > > > > > > > of > > > > > > > >> > > > > > > > > >> non-brokers serving as replicas is > blowing > > my > > > > > mind > > > > > > a > > > > > > > >> bit :) > > > > > > > >> > > > > > > > > >> > > > > > > > >> > > > > > > > > >> > > > > > > > >> > > > > > > > > > Jose, I missed this as well, would love to > > > know > > > > > more > > > > > > > >> about > > > > > > > >> > > > > > non-broker > > > > > > > >> > > > > > > > > > serving as replica! > > > > > > > >> > > > > > > > > > -- > > > > > > > >> > > > > > > > > > Regards, > > > > > > > >> > > > > > > > > > Mayank Shekhar Narula > > > > > > > >> > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > >> > > > > > > > > > > > > > >> > > > > > > > > > > > > > >> > > > > > > -- > > > > > > > >> > > > > > > Regards, > > > > > > > >> > > > > > > Mayank Shekhar Narula > > > > > > > >> > > > > > > > > > > > > > >> > > > > > > > > > > > > >> > > > > > > > > > > > >> > > > > > > > > > > >> > > > > > > > > > > >> > > > -- > > > > > > > >> > > > Regards, > > > > > > > >> > > > Mayank Shekhar Narula > > > > > > > >> > > > > > > > > > > >> > > > > > > > > > >> > > > > > > > > >> > > > > > > > > >> > -- > > > > > > > >> > Regards, > > > > > > > >> > Mayank Shekhar Narula > > > > > > > >> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > Regards, > > Mayank Shekhar Narula > > > -- Regards, Mayank Shekhar Narula