Thank you for your response Abhijeet. You have understood the scenario correctly. For the purpose of discussion, please consider the latter case where offset 11 is not available on the leader anymore (it got cleaned locally since the last tiered offset is 15). In such a case, you mentioned, the follower will eventually be able to catch up with the leader by resetting its fetch offset until the offset is available on the leader's local log. Correct me if I am wrong but it is not guaranteed that it will eventually catch up because theoretically, everytime it asks for a newer fetch offset, the leader may have deleted it locally. I understand that it is an edge case scenario which will only happen with configurations for small segment sizes and aggressive cleaning but nevertheless, it is a possible scenario.
Do you agree that theoretically it is possible for the follower to loop such that it is never able to catch up? We can proceed with the KIP with an understanding that this scenario is rare and we are willing to accept the risk of it. In such a case, we should add a detection mechanism for such a scenario in the KIP, so that if we encounter this scenario, the user has a way to detect (and mitigate it). Alternatively, we can change the KIP design to ensure that we never encounter this scenario. Given the rarity of the scenario, I am ok with having a detection mechanism (metric?) in place and having this scenario documented as an acceptable risk in current design. -- Divij Vaidya On Tue, Jul 23, 2024 at 11:55 AM Abhijeet Kumar <abhijeet.cse....@gmail.com> wrote: > Hi Divij, > > Seems like there is some confusion about the new protocol for fetching from > tiered offset. > The scenario you are highlighting is where, > Leader's Log Start Offset = 0 > Last Tiered Offset = 10 > > Following is the sequence of events that will happen: > > 1. Follower requests offset 0 from the leader > 2. Assuming offset 0 is not available locally (to arrive at your scenario), > Leader returns OffsetMovedToTieredStorageException > 3. Follower fetches the earliest pending upload offset and receives 11 > 4. Follower builds aux state from [0-10] and sets the fetch offset to 11 > (This step corresponds to step 3 in your email) > > At this stage, even if the leader has uploaded more data and the > last-tiered offset has changed (say to 15), it will not matter > because offset 11 should still be available on the leader and when the > follower requests data with fetch offset 11, the leader > will return with a valid partition data response which the follower can > consume and proceed further. Even if the offset 11 is not > available anymore, the follower will eventually be able to catch up with > the leader by resetting its fetch offset until the offset > is available on the leader's local log. Once it catches up, replication on > the follower can proceed. > > Regards, > Abhijeet. > > > > On Tue, Jul 2, 2024 at 3:55 PM Divij Vaidya <divijvaidy...@gmail.com> > wrote: > > > Hi folks. > > > > I am late to the party but I have a question on the proposal. > > > > How are we preventing a situation such as the following: > > > > 1. Empty follower asks leader for 0 > > 2. Leader compares 0 with last-tiered-offset, and responds with 11 > (where10 > > is last-tiered-offset) and a OffsetMovedToTieredException > > 3. Follower builds aux state from [0-10] and sets the fetch offset to 11 > > 4. But leader has already uploaded more data and now the new > > last-tiered-offset is 15 > > 5. Go back to 2 > > > > This could cause a cycle where the replica will be stuck trying to > > reconcile with the leader. > > > > -- > > Divij Vaidya > > > > > > > > On Fri, Apr 26, 2024 at 7:28 AM Abhijeet Kumar < > abhijeet.cse....@gmail.com > > > > > wrote: > > > > > Thank you all for your comments. As all the comments in the thread are > > > addressed, I am starting a Vote thread for the KIP. Please have a look. > > > > > > Regards. > > > Abhijeet. > > > > > > > > > > > > On Thu, Apr 25, 2024 at 6:08 PM Luke Chen <show...@gmail.com> wrote: > > > > > > > Hi, Abhijeet, > > > > > > > > Thanks for the update. > > > > > > > > I have no more comments. > > > > > > > > Luke > > > > > > > > On Thu, Apr 25, 2024 at 4:21 AM Jun Rao <j...@confluent.io.invalid> > > > wrote: > > > > > > > > > Hi, Abhijeet, > > > > > > > > > > Thanks for the updated KIP. It looks good to me. > > > > > > > > > > Jun > > > > > > > > > > On Mon, Apr 22, 2024 at 12:08 PM Abhijeet Kumar < > > > > > abhijeet.cse....@gmail.com> > > > > > wrote: > > > > > > > > > > > Hi Jun, > > > > > > > > > > > > Please find my comments inline. > > > > > > > > > > > > > > > > > > On Thu, Apr 18, 2024 at 3:26 AM Jun Rao <j...@confluent.io.invalid > > > > > > > wrote: > > > > > > > > > > > > > Hi, Abhijeet, > > > > > > > > > > > > > > Thanks for the reply. > > > > > > > > > > > > > > 1. I am wondering if we could achieve the same result by just > > > > lowering > > > > > > > local.retention.ms and local.retention.bytes. This also allows > > the > > > > > newly > > > > > > > started follower to build up the local data before serving the > > > > consumer > > > > > > > traffic. > > > > > > > > > > > > > > > > > > > I am not sure I fully followed this. Do you mean we could lower > the > > > > > > local.retention (by size and time) > > > > > > so that there is little data on the leader's local storage so > that > > > the > > > > > > follower can quickly catch up with the leader? > > > > > > > > > > > > In that case, we will need to set small local retention across > > > brokers > > > > in > > > > > > the cluster. It will have the undesired > > > > > > effect where there will be increased remote log fetches for > serving > > > > > consume > > > > > > requests, and this can cause > > > > > > degradations. Also, this behaviour (of increased remote fetches) > > will > > > > > > happen on all brokers at all times, whereas in > > > > > > the KIP we are restricting the behavior only to the newly > > > bootstrapped > > > > > > brokers and only until the time it fully builds > > > > > > the local logs as per retention defined at the cluster level. > > > > > > (Deprioritization of the broker could help reduce the impact > > > > > > even further) > > > > > > > > > > > > > > > > > > > > > > > > > > 2. Have you updated the KIP? > > > > > > > > > > > > > > > > > > > The KIP has been updated now. > > > > > > > > > > > > > > > > > > > > > > > > > > Thanks, > > > > > > > > > > > > > > Jun > > > > > > > > > > > > > > On Tue, Apr 9, 2024 at 3:36 AM Satish Duggana < > > > > > satish.dugg...@gmail.com> > > > > > > > wrote: > > > > > > > > > > > > > > > +1 to Jun for adding the consumer fetching from a follower > > > scenario > > > > > > > > also to the existing section that talked about the drawback > > when > > > a > > > > > > > > node built with last-tiered-offset has become a leader. As > > > Abhijeet > > > > > > > > mentioned, we plan to have a follow-up KIP that will address > > > these > > > > by > > > > > > > > having a deprioritzation of these brokers. The > deprioritization > > > of > > > > > > > > those brokers can be removed once they catchup until the > local > > > log > > > > > > > > retention. > > > > > > > > > > > > > > > > Thanks, > > > > > > > > Satish. > > > > > > > > > > > > > > > > On Tue, 9 Apr 2024 at 14:08, Luke Chen <show...@gmail.com> > > > wrote: > > > > > > > > > > > > > > > > > > Hi Abhijeet, > > > > > > > > > > > > > > > > > > Thanks for the KIP to improve the tiered storage feature! > > > > > > > > > > > > > > > > > > Questions: > > > > > > > > > 1. We could also get the "pending-upload-offset" and epoch > > via > > > > > remote > > > > > > > log > > > > > > > > > metadata, instead of adding a new API to fetch from the > > leader. > > > > > Could > > > > > > > you > > > > > > > > > explain why you choose the later approach, instead of the > > > former? > > > > > > > > > 2. > > > > > > > > > > We plan to have a follow-up KIP that will address both > the > > > > > > > > > deprioritization > > > > > > > > > of these brokers from leadership, as well as > > > > > > > > > for consumption (when fetching from followers is allowed). > > > > > > > > > > > > > > > > > > I agree with Jun that we might need to make it clear all > > > possible > > > > > > > > drawbacks > > > > > > > > > that could have. So, could we add the drawbacks that Jun > > > > mentioned > > > > > > > about > > > > > > > > > the performance issue when consumer fetch from follower? > > > > > > > > > > > > > > > > > > 3. Could we add "Rejected Alternatives" section to the end > of > > > the > > > > > KIP > > > > > > > to > > > > > > > > > add some of them? > > > > > > > > > Like the "ListOffsetRequest" approach VS > > > > > > > "Earliest-Pending-Upload-Offset" > > > > > > > > > approach, or getting the "Earliest-Pending-Upload-Offset" > > from > > > > > remote > > > > > > > log > > > > > > > > > metadata... etc. > > > > > > > > > > > > > > > > > > Thanks. > > > > > > > > > Luke > > > > > > > > > > > > > > > > > > > > > > > > > > > On Tue, Apr 9, 2024 at 2:25 PM Abhijeet Kumar < > > > > > > > > abhijeet.cse....@gmail.com> > > > > > > > > > wrote: > > > > > > > > > > > > > > > > > > > Hi Christo, > > > > > > > > > > > > > > > > > > > > Please find my comments inline. > > > > > > > > > > > > > > > > > > > > On Fri, Apr 5, 2024 at 12:36 PM Christo Lolov < > > > > > > > christolo...@gmail.com> > > > > > > > > > > wrote: > > > > > > > > > > > > > > > > > > > > > Hello Abhijeet and Jun, > > > > > > > > > > > > > > > > > > > > > > I have been mulling this KIP over a bit more in recent > > > days! > > > > > > > > > > > > > > > > > > > > > > re: Jun > > > > > > > > > > > > > > > > > > > > > > I wasn't aware we apply 2.1 and 2.2 for reserving new > > > > > timestamps > > > > > > - > > > > > > > in > > > > > > > > > > > retrospect it should have been fairly obvious. I would > > need > > > > to > > > > > go > > > > > > > an > > > > > > > > > > update > > > > > > > > > > > KIP-1005 myself then, thank you for giving the useful > > > > > reference! > > > > > > > > > > > > > > > > > > > > > > 4. I think Abhijeet wants to rebuild state from > > > > > > > latest-tiered-offset > > > > > > > > and > > > > > > > > > > > fetch from latest-tiered-offset + 1 only for new > replicas > > > (or > > > > > > > > replicas > > > > > > > > > > > which experienced a disk failure) to decrease the time > a > > > > > > partition > > > > > > > > spends > > > > > > > > > > > in under-replicated state. In other words, a follower > > which > > > > has > > > > > > > just > > > > > > > > > > fallen > > > > > > > > > > > out of ISR, but has local data will continue using > > today's > > > > > Tiered > > > > > > > > Storage > > > > > > > > > > > replication protocol (i.e. fetching from > > earliest-local). I > > > > > > further > > > > > > > > > > believe > > > > > > > > > > > he has taken this approach so that local state of > > replicas > > > > > which > > > > > > > have > > > > > > > > > > just > > > > > > > > > > > fallen out of ISR isn't forcefully wiped thus leading > to > > > > > > situation > > > > > > > 1. > > > > > > > > > > > Abhijeet, have I understood (and summarised) what you > are > > > > > > proposing > > > > > > > > > > > correctly? > > > > > > > > > > > > > > > > > > > > > > Yes, your understanding is correct. We want to limit > the > > > > > behavior > > > > > > > > changes > > > > > > > > > > only to new replicas. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > 5. I think in today's Tiered Storage we know the > leader's > > > > > > > > > > log-start-offset > > > > > > > > > > > from the FetchResponse and we can learn its > > > > > > local-log-start-offset > > > > > > > > from > > > > > > > > > > the > > > > > > > > > > > ListOffsets by asking for earliest-local timestamp > (-4). > > > But > > > > > > > granted, > > > > > > > > > > this > > > > > > > > > > > ought to be added as an additional API call in the KIP. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Yes, I clarified this in my reply to Jun. I will add this > > > > missing > > > > > > > > detail in > > > > > > > > > > the KIP. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > re: Abhijeet > > > > > > > > > > > > > > > > > > > > > > 101. I am still a bit confused as to why you want to > > > include > > > > a > > > > > > new > > > > > > > > offset > > > > > > > > > > > (i.e. pending-upload-offset) when you yourself mention > > that > > > > you > > > > > > > > could use > > > > > > > > > > > an already existing offset (i.e. last-tiered-offset + > 1). > > > In > > > > > > > > essence, you > > > > > > > > > > > end your Motivation with "In this KIP, we will focus > only > > > on > > > > > the > > > > > > > > follower > > > > > > > > > > > fetch protocol using the *last-tiered-offset*" and then > > in > > > > the > > > > > > > > following > > > > > > > > > > > sections you talk about pending-upload-offset. I > > understand > > > > > this > > > > > > > > might be > > > > > > > > > > > classified as an implementation detail, but if you > > > introduce > > > > a > > > > > > new > > > > > > > > offset > > > > > > > > > > > (i.e. pending-upload-offset) you have to make a change > to > > > the > > > > > > > > ListOffsets > > > > > > > > > > > API (i.e. introduce -6) and thus document it in this > KIP > > as > > > > > such. > > > > > > > > > > However, > > > > > > > > > > > the last-tiered-offset ought to already be exposed as > > part > > > of > > > > > > > > KIP-1005 > > > > > > > > > > > (under implementation). Am I misunderstanding something > > > here? > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > I have tried to clarify this in my reply to Jun. > > > > > > > > > > > > > > > > > > > > > The follower needs to build the local data starting > from > > > the > > > > > > offset > > > > > > > > > > > Earliest-Pending-Upload-Offset. Hence it needs the > offset > > > and > > > > > the > > > > > > > > > > > corresponding leader-epoch. > > > > > > > > > > > There are two ways to do this: > > > > > > > > > > > 1. We add support in ListOffsetRequest to be able to > > > fetch > > > > > > this > > > > > > > > offset > > > > > > > > > > > (and leader epoch) from the leader. > > > > > > > > > > > 2. Or, fetch the tiered-offset (which is already > > > > supported). > > > > > > > From > > > > > > > > this > > > > > > > > > > > offset, we can get the Earliest-Pending-Upload-Offset. > We > > > can > > > > > > just > > > > > > > > add 1 > > > > > > > > > > to > > > > > > > > > > > the tiered-offset. > > > > > > > > > > > However, we still need the leader epoch for > offset, > > > > since > > > > > > > > there is > > > > > > > > > > > no guarantee that the leader epoch for > > > > > > > Earliest-Pending-Upload-Offset > > > > > > > > > > will > > > > > > > > > > > be the same as the > > > > > > > > > > > leader epoch for tiered-offset. We may need > another > > > API > > > > > > call > > > > > > > > to the > > > > > > > > > > > leader for this. > > > > > > > > > > > I prefer the first approach. The only problem with the > > > first > > > > > > > > approach is > > > > > > > > > > > that it introduces one more offset. The second approach > > > > avoids > > > > > > this > > > > > > > > > > problem > > > > > > > > > > > but is a little complicated. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Best, > > > > > > > > > > > Christo > > > > > > > > > > > > > > > > > > > > > > On Thu, 4 Apr 2024 at 19:37, Jun Rao > > > > <j...@confluent.io.invalid > > > > > > > > > > > > > > wrote: > > > > > > > > > > > > > > > > > > > > > > > Hi, Abhijeet, > > > > > > > > > > > > > > > > > > > > > > > > Thanks for the KIP. Left a few comments. > > > > > > > > > > > > > > > > > > > > > > > > 1. "A drawback of using the last-tiered-offset is > that > > > this > > > > > new > > > > > > > > > > follower > > > > > > > > > > > > would possess only a limited number of locally stored > > > > > segments. > > > > > > > > Should > > > > > > > > > > it > > > > > > > > > > > > ascend to the role of leader, there is a risk of > > needing > > > to > > > > > > fetch > > > > > > > > these > > > > > > > > > > > > segments from the remote storage, potentially > impacting > > > > > broker > > > > > > > > > > > > performance." > > > > > > > > > > > > Since we support consumers fetching from followers, > > this > > > > is a > > > > > > > > potential > > > > > > > > > > > > issue on the follower side too. In theory, it's > > possible > > > > for > > > > > a > > > > > > > > segment > > > > > > > > > > to > > > > > > > > > > > > be tiered immediately after rolling. In that case, > > there > > > > > could > > > > > > be > > > > > > > > very > > > > > > > > > > > > little data after last-tiered-offset. It would be > > useful > > > to > > > > > > think > > > > > > > > > > through > > > > > > > > > > > > how to address this issue. > > > > > > > > > > > > > > > > > > > > > > > > 2. ListOffsetsRequest: > > > > > > > > > > > > 2.1 Typically, we need to bump up the version of the > > > > request > > > > > if > > > > > > > we > > > > > > > > add > > > > > > > > > > a > > > > > > > > > > > > new value for timestamp. See > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > https://github.com/apache/kafka/pull/10760/files#diff-fac7080d67da905a80126d58fc1745c9a1409de7ef7d093c2ac66a888b134633 > > > > > > > > > > > > . > > > > > > > > > > > > 2.2 Since this changes the inter broker request > > protocol, > > > > it > > > > > > > would > > > > > > > > be > > > > > > > > > > > > useful to have a section on upgrade (e.g. new > > > > > > > > IBP/metadata.version). > > > > > > > > > > > > > > > > > > > > > > > > 3. "Instead of fetching > Earliest-Pending-Upload-Offset, > > > it > > > > > > could > > > > > > > > fetch > > > > > > > > > > > the > > > > > > > > > > > > last-tiered-offset from the leader, and make a > separate > > > > > leader > > > > > > > > call to > > > > > > > > > > > > fetch leader epoch for the following offset." > > > > > > > > > > > > Why do we need to make a separate call for the leader > > > > epoch? > > > > > > > > > > > > ListOffsetsResponse include both the offset and the > > > > > > corresponding > > > > > > > > > > epoch. > > > > > > > > > > > > > > > > > > > > > > > > 4. "Check if the follower replica is empty and if the > > > > feature > > > > > > to > > > > > > > > use > > > > > > > > > > > > last-tiered-offset is enabled." > > > > > > > > > > > > Why do we need to check if the follower replica is > > empty? > > > > > > > > > > > > > > > > > > > > > > > > 5. It can be confirmed by checking if the leader's > > > > > > > > Log-Start-Offset is > > > > > > > > > > > the > > > > > > > > > > > > same as the Leader's Local-Log-Start-Offset. > > > > > > > > > > > > How does the follower know Local-Log-Start-Offset? > > > > > > > > > > > > > > > > > > > > > > > > Jun > > > > > > > > > > > > > > > > > > > > > > > > On Sat, Mar 30, 2024 at 5:51 AM Abhijeet Kumar < > > > > > > > > > > > abhijeet.cse....@gmail.com > > > > > > > > > > > > > > > > > > > > > > > > > wrote: > > > > > > > > > > > > > > > > > > > > > > > > > Hi Christo, > > > > > > > > > > > > > > > > > > > > > > > > > > Thanks for reviewing the KIP. > > > > > > > > > > > > > > > > > > > > > > > > > > The follower needs the > earliest-pending-upload-offset > > > > (and > > > > > > the > > > > > > > > > > > > > corresponding leader epoch) from the leader. > > > > > > > > > > > > > This is the first offset the follower will have > > > locally. > > > > > > > > > > > > > > > > > > > > > > > > > > Regards, > > > > > > > > > > > > > Abhijeet. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > On Fri, Mar 29, 2024 at 1:14 PM Christo Lolov < > > > > > > > > > > christolo...@gmail.com> > > > > > > > > > > > > > wrote: > > > > > > > > > > > > > > > > > > > > > > > > > > > Heya! > > > > > > > > > > > > > > > > > > > > > > > > > > > > First of all, thank you very much for the > proposal, > > > you > > > > > > have > > > > > > > > > > > explained > > > > > > > > > > > > > the > > > > > > > > > > > > > > problem you want solved very well - I think a > > faster > > > > > > > bootstrap > > > > > > > > of > > > > > > > > > > an > > > > > > > > > > > > > empty > > > > > > > > > > > > > > replica is definitely an improvement! > > > > > > > > > > > > > > > > > > > > > > > > > > > > For my understanding, which concrete offset do > you > > > want > > > > > the > > > > > > > > leader > > > > > > > > > > to > > > > > > > > > > > > > give > > > > > > > > > > > > > > back to a follower - > earliest-pending-upload-offset > > > or > > > > > the > > > > > > > > > > > > > > latest-tiered-offset? If it is the second, then I > > > > believe > > > > > > > > KIP-1005 > > > > > > > > > > > > ought > > > > > > > > > > > > > to > > > > > > > > > > > > > > already be exposing that offset as part of the > > > > > ListOffsets > > > > > > > > API, no? > > > > > > > > > > > > > > > > > > > > > > > > > > > > Best, > > > > > > > > > > > > > > Christo > > > > > > > > > > > > > > > > > > > > > > > > > > > > On Wed, 27 Mar 2024 at 18:23, Abhijeet Kumar < > > > > > > > > > > > > abhijeet.cse....@gmail.com > > > > > > > > > > > > > > > > > > > > > > > > > > > > wrote: > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Hi All, > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > I have created KIP-1023 to introduce follower > > fetch > > > > > from > > > > > > > > tiered > > > > > > > > > > > > offset. > > > > > > > > > > > > > > > This feature will be helpful in significantly > > > > reducing > > > > > > > Kafka > > > > > > > > > > > > > > > rebalance/rebuild times when the cluster is > > enabled > > > > > with > > > > > > > > tiered > > > > > > > > > > > > > storage. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > https://cwiki.apache.org/confluence/display/KAFKA/KIP-1023%3A+Follower+fetch+from+tiered+offset > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Feedback and suggestions are welcome. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Regards, > > > > > > > > > > > > > > > Abhijeet. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > >