Following up on my previous comment: An alternative approach could be to have an empty follower start replication from last-tiered-offset (already available as part of listOffsets) inclusive. On the leader, we change the logic (based on a configurable threshold) on when we return OffsetMovedToTieredException vs. when we fetch from remote and return data to follower.
As an example, the solution works as follows: 1. Follower asks the leader for fetch offset Y. 2. Leader compares if (last-tiered-offset - Y > Z), where Z is a configured threshold. If true, we will return OffsetMovedToTieredException and the follower will ask again with fetch offset = last-tiered-offset. If false, leader will fetch offset Y from remote and return it to the follower. The advantages of this approach over the proposed solution are: 1. we won't be in a cyclic situation as mentioned in my previous email 2. it works with existing protocol which returns last-tiered-offset, i.e. we won't have to make changes to the protocol to add the new Earliest-Pending-Upload-Offset The disadvantages of this approach over the proposed solution are: 1. on the leader, we may have to fetch some data from remote to respond to the follower. The amount of this data can be controlled via the configured value Z which can be set based on how aggressive the upload/archival process is. -- Divij Vaidya On Tue, Jul 2, 2024 at 12:25 PM Divij Vaidya <divijvaidy...@gmail.com> wrote: > Hi folks. > > I am late to the party but I have a question on the proposal. > > How are we preventing a situation such as the following: > > 1. Empty follower asks leader for 0 > 2. Leader compares 0 with last-tiered-offset, and responds with 11 > (where10 is last-tiered-offset) and a OffsetMovedToTieredException > 3. Follower builds aux state from [0-10] and sets the fetch offset to 11 > 4. But leader has already uploaded more data and now the new > last-tiered-offset is 15 > 5. Go back to 2 > > This could cause a cycle where the replica will be stuck trying to > reconcile with the leader. > > -- > Divij Vaidya > > > > On Fri, Apr 26, 2024 at 7:28 AM Abhijeet Kumar <abhijeet.cse....@gmail.com> > wrote: > >> Thank you all for your comments. As all the comments in the thread are >> addressed, I am starting a Vote thread for the KIP. Please have a look. >> >> Regards. >> Abhijeet. >> >> >> >> On Thu, Apr 25, 2024 at 6:08 PM Luke Chen <show...@gmail.com> wrote: >> >> > Hi, Abhijeet, >> > >> > Thanks for the update. >> > >> > I have no more comments. >> > >> > Luke >> > >> > On Thu, Apr 25, 2024 at 4:21 AM Jun Rao <j...@confluent.io.invalid> >> wrote: >> > >> > > Hi, Abhijeet, >> > > >> > > Thanks for the updated KIP. It looks good to me. >> > > >> > > Jun >> > > >> > > On Mon, Apr 22, 2024 at 12:08 PM Abhijeet Kumar < >> > > abhijeet.cse....@gmail.com> >> > > wrote: >> > > >> > > > Hi Jun, >> > > > >> > > > Please find my comments inline. >> > > > >> > > > >> > > > On Thu, Apr 18, 2024 at 3:26 AM Jun Rao <j...@confluent.io.invalid> >> > > wrote: >> > > > >> > > > > Hi, Abhijeet, >> > > > > >> > > > > Thanks for the reply. >> > > > > >> > > > > 1. I am wondering if we could achieve the same result by just >> > lowering >> > > > > local.retention.ms and local.retention.bytes. This also allows >> the >> > > newly >> > > > > started follower to build up the local data before serving the >> > consumer >> > > > > traffic. >> > > > > >> > > > >> > > > I am not sure I fully followed this. Do you mean we could lower the >> > > > local.retention (by size and time) >> > > > so that there is little data on the leader's local storage so that >> the >> > > > follower can quickly catch up with the leader? >> > > > >> > > > In that case, we will need to set small local retention across >> brokers >> > in >> > > > the cluster. It will have the undesired >> > > > effect where there will be increased remote log fetches for serving >> > > consume >> > > > requests, and this can cause >> > > > degradations. Also, this behaviour (of increased remote fetches) >> will >> > > > happen on all brokers at all times, whereas in >> > > > the KIP we are restricting the behavior only to the newly >> bootstrapped >> > > > brokers and only until the time it fully builds >> > > > the local logs as per retention defined at the cluster level. >> > > > (Deprioritization of the broker could help reduce the impact >> > > > even further) >> > > > >> > > > >> > > > > >> > > > > 2. Have you updated the KIP? >> > > > > >> > > > >> > > > The KIP has been updated now. >> > > > >> > > > >> > > > > >> > > > > Thanks, >> > > > > >> > > > > Jun >> > > > > >> > > > > On Tue, Apr 9, 2024 at 3:36 AM Satish Duggana < >> > > satish.dugg...@gmail.com> >> > > > > wrote: >> > > > > >> > > > > > +1 to Jun for adding the consumer fetching from a follower >> scenario >> > > > > > also to the existing section that talked about the drawback >> when a >> > > > > > node built with last-tiered-offset has become a leader. As >> Abhijeet >> > > > > > mentioned, we plan to have a follow-up KIP that will address >> these >> > by >> > > > > > having a deprioritzation of these brokers. The deprioritization >> of >> > > > > > those brokers can be removed once they catchup until the local >> log >> > > > > > retention. >> > > > > > >> > > > > > Thanks, >> > > > > > Satish. >> > > > > > >> > > > > > On Tue, 9 Apr 2024 at 14:08, Luke Chen <show...@gmail.com> >> wrote: >> > > > > > > >> > > > > > > Hi Abhijeet, >> > > > > > > >> > > > > > > Thanks for the KIP to improve the tiered storage feature! >> > > > > > > >> > > > > > > Questions: >> > > > > > > 1. We could also get the "pending-upload-offset" and epoch via >> > > remote >> > > > > log >> > > > > > > metadata, instead of adding a new API to fetch from the >> leader. >> > > Could >> > > > > you >> > > > > > > explain why you choose the later approach, instead of the >> former? >> > > > > > > 2. >> > > > > > > > We plan to have a follow-up KIP that will address both the >> > > > > > > deprioritization >> > > > > > > of these brokers from leadership, as well as >> > > > > > > for consumption (when fetching from followers is allowed). >> > > > > > > >> > > > > > > I agree with Jun that we might need to make it clear all >> possible >> > > > > > drawbacks >> > > > > > > that could have. So, could we add the drawbacks that Jun >> > mentioned >> > > > > about >> > > > > > > the performance issue when consumer fetch from follower? >> > > > > > > >> > > > > > > 3. Could we add "Rejected Alternatives" section to the end of >> the >> > > KIP >> > > > > to >> > > > > > > add some of them? >> > > > > > > Like the "ListOffsetRequest" approach VS >> > > > > "Earliest-Pending-Upload-Offset" >> > > > > > > approach, or getting the "Earliest-Pending-Upload-Offset" from >> > > remote >> > > > > log >> > > > > > > metadata... etc. >> > > > > > > >> > > > > > > Thanks. >> > > > > > > Luke >> > > > > > > >> > > > > > > >> > > > > > > On Tue, Apr 9, 2024 at 2:25 PM Abhijeet Kumar < >> > > > > > abhijeet.cse....@gmail.com> >> > > > > > > wrote: >> > > > > > > >> > > > > > > > Hi Christo, >> > > > > > > > >> > > > > > > > Please find my comments inline. >> > > > > > > > >> > > > > > > > On Fri, Apr 5, 2024 at 12:36 PM Christo Lolov < >> > > > > christolo...@gmail.com> >> > > > > > > > wrote: >> > > > > > > > >> > > > > > > > > Hello Abhijeet and Jun, >> > > > > > > > > >> > > > > > > > > I have been mulling this KIP over a bit more in recent >> days! >> > > > > > > > > >> > > > > > > > > re: Jun >> > > > > > > > > >> > > > > > > > > I wasn't aware we apply 2.1 and 2.2 for reserving new >> > > timestamps >> > > > - >> > > > > in >> > > > > > > > > retrospect it should have been fairly obvious. I would >> need >> > to >> > > go >> > > > > an >> > > > > > > > update >> > > > > > > > > KIP-1005 myself then, thank you for giving the useful >> > > reference! >> > > > > > > > > >> > > > > > > > > 4. I think Abhijeet wants to rebuild state from >> > > > > latest-tiered-offset >> > > > > > and >> > > > > > > > > fetch from latest-tiered-offset + 1 only for new replicas >> (or >> > > > > > replicas >> > > > > > > > > which experienced a disk failure) to decrease the time a >> > > > partition >> > > > > > spends >> > > > > > > > > in under-replicated state. In other words, a follower >> which >> > has >> > > > > just >> > > > > > > > fallen >> > > > > > > > > out of ISR, but has local data will continue using today's >> > > Tiered >> > > > > > Storage >> > > > > > > > > replication protocol (i.e. fetching from earliest-local). >> I >> > > > further >> > > > > > > > believe >> > > > > > > > > he has taken this approach so that local state of replicas >> > > which >> > > > > have >> > > > > > > > just >> > > > > > > > > fallen out of ISR isn't forcefully wiped thus leading to >> > > > situation >> > > > > 1. >> > > > > > > > > Abhijeet, have I understood (and summarised) what you are >> > > > proposing >> > > > > > > > > correctly? >> > > > > > > > > >> > > > > > > > > Yes, your understanding is correct. We want to limit the >> > > behavior >> > > > > > changes >> > > > > > > > only to new replicas. >> > > > > > > > >> > > > > > > > >> > > > > > > > > 5. I think in today's Tiered Storage we know the leader's >> > > > > > > > log-start-offset >> > > > > > > > > from the FetchResponse and we can learn its >> > > > local-log-start-offset >> > > > > > from >> > > > > > > > the >> > > > > > > > > ListOffsets by asking for earliest-local timestamp (-4). >> But >> > > > > granted, >> > > > > > > > this >> > > > > > > > > ought to be added as an additional API call in the KIP. >> > > > > > > > > >> > > > > > > > > >> > > > > > > > Yes, I clarified this in my reply to Jun. I will add this >> > missing >> > > > > > detail in >> > > > > > > > the KIP. >> > > > > > > > >> > > > > > > > >> > > > > > > > > re: Abhijeet >> > > > > > > > > >> > > > > > > > > 101. I am still a bit confused as to why you want to >> include >> > a >> > > > new >> > > > > > offset >> > > > > > > > > (i.e. pending-upload-offset) when you yourself mention >> that >> > you >> > > > > > could use >> > > > > > > > > an already existing offset (i.e. last-tiered-offset + 1). >> In >> > > > > > essence, you >> > > > > > > > > end your Motivation with "In this KIP, we will focus only >> on >> > > the >> > > > > > follower >> > > > > > > > > fetch protocol using the *last-tiered-offset*" and then in >> > the >> > > > > > following >> > > > > > > > > sections you talk about pending-upload-offset. I >> understand >> > > this >> > > > > > might be >> > > > > > > > > classified as an implementation detail, but if you >> introduce >> > a >> > > > new >> > > > > > offset >> > > > > > > > > (i.e. pending-upload-offset) you have to make a change to >> the >> > > > > > ListOffsets >> > > > > > > > > API (i.e. introduce -6) and thus document it in this KIP >> as >> > > such. >> > > > > > > > However, >> > > > > > > > > the last-tiered-offset ought to already be exposed as >> part of >> > > > > > KIP-1005 >> > > > > > > > > (under implementation). Am I misunderstanding something >> here? >> > > > > > > > > >> > > > > > > > >> > > > > > > > I have tried to clarify this in my reply to Jun. >> > > > > > > > >> > > > > > > > > The follower needs to build the local data starting from >> the >> > > > offset >> > > > > > > > > Earliest-Pending-Upload-Offset. Hence it needs the offset >> and >> > > the >> > > > > > > > > corresponding leader-epoch. >> > > > > > > > > There are two ways to do this: >> > > > > > > > > 1. We add support in ListOffsetRequest to be able to >> fetch >> > > > this >> > > > > > offset >> > > > > > > > > (and leader epoch) from the leader. >> > > > > > > > > 2. Or, fetch the tiered-offset (which is already >> > supported). >> > > > > From >> > > > > > this >> > > > > > > > > offset, we can get the Earliest-Pending-Upload-Offset. We >> can >> > > > just >> > > > > > add 1 >> > > > > > > > to >> > > > > > > > > the tiered-offset. >> > > > > > > > > However, we still need the leader epoch for offset, >> > since >> > > > > > there is >> > > > > > > > > no guarantee that the leader epoch for >> > > > > Earliest-Pending-Upload-Offset >> > > > > > > > will >> > > > > > > > > be the same as the >> > > > > > > > > leader epoch for tiered-offset. We may need another >> API >> > > > call >> > > > > > to the >> > > > > > > > > leader for this. >> > > > > > > > > I prefer the first approach. The only problem with the >> first >> > > > > > approach is >> > > > > > > > > that it introduces one more offset. The second approach >> > avoids >> > > > this >> > > > > > > > problem >> > > > > > > > > but is a little complicated. >> > > > > > > > >> > > > > > > > >> > > > > > > > >> > > > > > > > > >> > > > > > > > > Best, >> > > > > > > > > Christo >> > > > > > > > > >> > > > > > > > > On Thu, 4 Apr 2024 at 19:37, Jun Rao >> > <j...@confluent.io.invalid >> > > > >> > > > > > wrote: >> > > > > > > > > >> > > > > > > > > > Hi, Abhijeet, >> > > > > > > > > > >> > > > > > > > > > Thanks for the KIP. Left a few comments. >> > > > > > > > > > >> > > > > > > > > > 1. "A drawback of using the last-tiered-offset is that >> this >> > > new >> > > > > > > > follower >> > > > > > > > > > would possess only a limited number of locally stored >> > > segments. >> > > > > > Should >> > > > > > > > it >> > > > > > > > > > ascend to the role of leader, there is a risk of >> needing to >> > > > fetch >> > > > > > these >> > > > > > > > > > segments from the remote storage, potentially impacting >> > > broker >> > > > > > > > > > performance." >> > > > > > > > > > Since we support consumers fetching from followers, this >> > is a >> > > > > > potential >> > > > > > > > > > issue on the follower side too. In theory, it's possible >> > for >> > > a >> > > > > > segment >> > > > > > > > to >> > > > > > > > > > be tiered immediately after rolling. In that case, there >> > > could >> > > > be >> > > > > > very >> > > > > > > > > > little data after last-tiered-offset. It would be >> useful to >> > > > think >> > > > > > > > through >> > > > > > > > > > how to address this issue. >> > > > > > > > > > >> > > > > > > > > > 2. ListOffsetsRequest: >> > > > > > > > > > 2.1 Typically, we need to bump up the version of the >> > request >> > > if >> > > > > we >> > > > > > add >> > > > > > > > a >> > > > > > > > > > new value for timestamp. See >> > > > > > > > > > >> > > > > > > > > > >> > > > > > > > > >> > > > > > > > >> > > > > > >> > > > > >> > > > >> > > >> > >> https://github.com/apache/kafka/pull/10760/files#diff-fac7080d67da905a80126d58fc1745c9a1409de7ef7d093c2ac66a888b134633 >> > > > > > > > > > . >> > > > > > > > > > 2.2 Since this changes the inter broker request >> protocol, >> > it >> > > > > would >> > > > > > be >> > > > > > > > > > useful to have a section on upgrade (e.g. new >> > > > > > IBP/metadata.version). >> > > > > > > > > > >> > > > > > > > > > 3. "Instead of fetching Earliest-Pending-Upload-Offset, >> it >> > > > could >> > > > > > fetch >> > > > > > > > > the >> > > > > > > > > > last-tiered-offset from the leader, and make a separate >> > > leader >> > > > > > call to >> > > > > > > > > > fetch leader epoch for the following offset." >> > > > > > > > > > Why do we need to make a separate call for the leader >> > epoch? >> > > > > > > > > > ListOffsetsResponse include both the offset and the >> > > > corresponding >> > > > > > > > epoch. >> > > > > > > > > > >> > > > > > > > > > 4. "Check if the follower replica is empty and if the >> > feature >> > > > to >> > > > > > use >> > > > > > > > > > last-tiered-offset is enabled." >> > > > > > > > > > Why do we need to check if the follower replica is >> empty? >> > > > > > > > > > >> > > > > > > > > > 5. It can be confirmed by checking if the leader's >> > > > > > Log-Start-Offset is >> > > > > > > > > the >> > > > > > > > > > same as the Leader's Local-Log-Start-Offset. >> > > > > > > > > > How does the follower know Local-Log-Start-Offset? >> > > > > > > > > > >> > > > > > > > > > Jun >> > > > > > > > > > >> > > > > > > > > > On Sat, Mar 30, 2024 at 5:51 AM Abhijeet Kumar < >> > > > > > > > > abhijeet.cse....@gmail.com >> > > > > > > > > > > >> > > > > > > > > > wrote: >> > > > > > > > > > >> > > > > > > > > > > Hi Christo, >> > > > > > > > > > > >> > > > > > > > > > > Thanks for reviewing the KIP. >> > > > > > > > > > > >> > > > > > > > > > > The follower needs the earliest-pending-upload-offset >> > (and >> > > > the >> > > > > > > > > > > corresponding leader epoch) from the leader. >> > > > > > > > > > > This is the first offset the follower will have >> locally. >> > > > > > > > > > > >> > > > > > > > > > > Regards, >> > > > > > > > > > > Abhijeet. >> > > > > > > > > > > >> > > > > > > > > > > >> > > > > > > > > > > >> > > > > > > > > > > On Fri, Mar 29, 2024 at 1:14 PM Christo Lolov < >> > > > > > > > christolo...@gmail.com> >> > > > > > > > > > > wrote: >> > > > > > > > > > > >> > > > > > > > > > > > Heya! >> > > > > > > > > > > > >> > > > > > > > > > > > First of all, thank you very much for the proposal, >> you >> > > > have >> > > > > > > > > explained >> > > > > > > > > > > the >> > > > > > > > > > > > problem you want solved very well - I think a faster >> > > > > bootstrap >> > > > > > of >> > > > > > > > an >> > > > > > > > > > > empty >> > > > > > > > > > > > replica is definitely an improvement! >> > > > > > > > > > > > >> > > > > > > > > > > > For my understanding, which concrete offset do you >> want >> > > the >> > > > > > leader >> > > > > > > > to >> > > > > > > > > > > give >> > > > > > > > > > > > back to a follower - earliest-pending-upload-offset >> or >> > > the >> > > > > > > > > > > > latest-tiered-offset? If it is the second, then I >> > believe >> > > > > > KIP-1005 >> > > > > > > > > > ought >> > > > > > > > > > > to >> > > > > > > > > > > > already be exposing that offset as part of the >> > > ListOffsets >> > > > > > API, no? >> > > > > > > > > > > > >> > > > > > > > > > > > Best, >> > > > > > > > > > > > Christo >> > > > > > > > > > > > >> > > > > > > > > > > > On Wed, 27 Mar 2024 at 18:23, Abhijeet Kumar < >> > > > > > > > > > abhijeet.cse....@gmail.com >> > > > > > > > > > > > >> > > > > > > > > > > > wrote: >> > > > > > > > > > > > >> > > > > > > > > > > > > Hi All, >> > > > > > > > > > > > > >> > > > > > > > > > > > > I have created KIP-1023 to introduce follower >> fetch >> > > from >> > > > > > tiered >> > > > > > > > > > offset. >> > > > > > > > > > > > > This feature will be helpful in significantly >> > reducing >> > > > > Kafka >> > > > > > > > > > > > > rebalance/rebuild times when the cluster is >> enabled >> > > with >> > > > > > tiered >> > > > > > > > > > > storage. >> > > > > > > > > > > > > >> > > > > > > > > > > > > >> > > > > > > > > > > > > >> > > > > > > > > > > > >> > > > > > > > > > > >> > > > > > > > > > >> > > > > > > > > >> > > > > > > > >> > > > > > >> > > > > >> > > > >> > > >> > >> https://cwiki.apache.org/confluence/display/KAFKA/KIP-1023%3A+Follower+fetch+from+tiered+offset >> > > > > > > > > > > > > >> > > > > > > > > > > > > Feedback and suggestions are welcome. >> > > > > > > > > > > > > >> > > > > > > > > > > > > Regards, >> > > > > > > > > > > > > Abhijeet. >> > > > > > > > > > > > > >> > > > > > > > > > > > >> > > > > > > > > > > >> > > > > > > > > > >> > > > > > > > > >> > > > > > > > >> > > > > > >> > > > > >> > > > >> > > >> > >> >