Following up on my previous comment:

An alternative approach could be to have an empty follower start
replication from last-tiered-offset (already available as part of
listOffsets) inclusive. On the leader, we change the logic (based on a
configurable threshold) on when we return OffsetMovedToTieredException vs.
when we fetch from remote and return data to follower.

As an example, the solution works as follows:
1. Follower asks the leader for fetch offset Y.
2. Leader compares if (last-tiered-offset - Y > Z), where Z is a configured
threshold. If true, we will return OffsetMovedToTieredException and the
follower will ask again with fetch offset = last-tiered-offset. If false,
leader will fetch offset Y from remote and return it to the follower.

The advantages of this approach over the proposed solution are:
1. we won't be in a cyclic situation as mentioned in my previous email
2. it works with existing protocol which returns last-tiered-offset, i.e.
we won't have to make changes to the protocol to add the
new Earliest-Pending-Upload-Offset

The disadvantages of this approach over the proposed solution are:
1. on the leader, we may have to fetch some data from remote to respond to
the follower. The amount of this data can be controlled via the configured
value Z which can be set based on how aggressive the upload/archival
process is.

--
Divij Vaidya



On Tue, Jul 2, 2024 at 12:25 PM Divij Vaidya <divijvaidy...@gmail.com>
wrote:

> Hi folks.
>
> I am late to the party but I have a question on the proposal.
>
> How are we preventing a situation such as the following:
>
> 1. Empty follower asks leader for 0
> 2. Leader compares 0 with last-tiered-offset, and responds with 11
> (where10 is last-tiered-offset) and a OffsetMovedToTieredException
> 3. Follower builds aux state from [0-10] and sets the fetch offset to 11
> 4. But leader has already uploaded more data and now the new
> last-tiered-offset is 15
> 5. Go back to 2
>
> This could cause a cycle where the replica will be stuck trying to
> reconcile with the leader.
>
> --
> Divij Vaidya
>
>
>
> On Fri, Apr 26, 2024 at 7:28 AM Abhijeet Kumar <abhijeet.cse....@gmail.com>
> wrote:
>
>> Thank you all for your comments. As all the comments in the thread are
>> addressed, I am starting a Vote thread for the KIP. Please have a look.
>>
>> Regards.
>> Abhijeet.
>>
>>
>>
>> On Thu, Apr 25, 2024 at 6:08 PM Luke Chen <show...@gmail.com> wrote:
>>
>> > Hi, Abhijeet,
>> >
>> > Thanks for the update.
>> >
>> > I have no more comments.
>> >
>> > Luke
>> >
>> > On Thu, Apr 25, 2024 at 4:21 AM Jun Rao <j...@confluent.io.invalid>
>> wrote:
>> >
>> > > Hi, Abhijeet,
>> > >
>> > > Thanks for the updated KIP. It looks good to me.
>> > >
>> > > Jun
>> > >
>> > > On Mon, Apr 22, 2024 at 12:08 PM Abhijeet Kumar <
>> > > abhijeet.cse....@gmail.com>
>> > > wrote:
>> > >
>> > > > Hi Jun,
>> > > >
>> > > > Please find my comments inline.
>> > > >
>> > > >
>> > > > On Thu, Apr 18, 2024 at 3:26 AM Jun Rao <j...@confluent.io.invalid>
>> > > wrote:
>> > > >
>> > > > > Hi, Abhijeet,
>> > > > >
>> > > > > Thanks for the reply.
>> > > > >
>> > > > > 1. I am wondering if we could achieve the same result by just
>> > lowering
>> > > > > local.retention.ms and local.retention.bytes. This also allows
>> the
>> > > newly
>> > > > > started follower to build up the local data before serving the
>> > consumer
>> > > > > traffic.
>> > > > >
>> > > >
>> > > > I am not sure I fully followed this. Do you mean we could lower the
>> > > > local.retention (by size and time)
>> > > > so that there is little data on the leader's local storage so that
>> the
>> > > > follower can quickly catch up with the leader?
>> > > >
>> > > > In that case, we will need to set small local retention across
>> brokers
>> > in
>> > > > the cluster. It will have the undesired
>> > > > effect where there will be increased remote log fetches for serving
>> > > consume
>> > > > requests, and this can cause
>> > > > degradations. Also, this behaviour (of increased remote fetches)
>> will
>> > > > happen on all brokers at all times, whereas in
>> > > > the KIP we are restricting the behavior only to the newly
>> bootstrapped
>> > > > brokers and only until the time it fully builds
>> > > > the local logs as per retention defined at the cluster level.
>> > > > (Deprioritization of the broker could help reduce the impact
>> > > >  even further)
>> > > >
>> > > >
>> > > > >
>> > > > > 2. Have you updated the KIP?
>> > > > >
>> > > >
>> > > > The KIP has been updated now.
>> > > >
>> > > >
>> > > > >
>> > > > > Thanks,
>> > > > >
>> > > > > Jun
>> > > > >
>> > > > > On Tue, Apr 9, 2024 at 3:36 AM Satish Duggana <
>> > > satish.dugg...@gmail.com>
>> > > > > wrote:
>> > > > >
>> > > > > > +1 to Jun for adding the consumer fetching from a follower
>> scenario
>> > > > > > also to the existing section that talked about the drawback
>> when a
>> > > > > > node built with last-tiered-offset has become a leader. As
>> Abhijeet
>> > > > > > mentioned, we plan to have a follow-up KIP that will address
>> these
>> > by
>> > > > > > having a deprioritzation of these brokers. The deprioritization
>> of
>> > > > > > those brokers can be removed once they catchup until the local
>> log
>> > > > > > retention.
>> > > > > >
>> > > > > > Thanks,
>> > > > > > Satish.
>> > > > > >
>> > > > > > On Tue, 9 Apr 2024 at 14:08, Luke Chen <show...@gmail.com>
>> wrote:
>> > > > > > >
>> > > > > > > Hi Abhijeet,
>> > > > > > >
>> > > > > > > Thanks for the KIP to improve the tiered storage feature!
>> > > > > > >
>> > > > > > > Questions:
>> > > > > > > 1. We could also get the "pending-upload-offset" and epoch via
>> > > remote
>> > > > > log
>> > > > > > > metadata, instead of adding a new API to fetch from the
>> leader.
>> > > Could
>> > > > > you
>> > > > > > > explain why you choose the later approach, instead of the
>> former?
>> > > > > > > 2.
>> > > > > > > > We plan to have a follow-up KIP that will address both the
>> > > > > > > deprioritization
>> > > > > > > of these brokers from leadership, as well as
>> > > > > > > for consumption (when fetching from followers is allowed).
>> > > > > > >
>> > > > > > > I agree with Jun that we might need to make it clear all
>> possible
>> > > > > > drawbacks
>> > > > > > > that could have. So, could we add the drawbacks that Jun
>> > mentioned
>> > > > > about
>> > > > > > > the performance issue when consumer fetch from follower?
>> > > > > > >
>> > > > > > > 3. Could we add "Rejected Alternatives" section to the end of
>> the
>> > > KIP
>> > > > > to
>> > > > > > > add some of them?
>> > > > > > > Like the "ListOffsetRequest" approach VS
>> > > > > "Earliest-Pending-Upload-Offset"
>> > > > > > > approach, or getting the "Earliest-Pending-Upload-Offset" from
>> > > remote
>> > > > > log
>> > > > > > > metadata... etc.
>> > > > > > >
>> > > > > > > Thanks.
>> > > > > > > Luke
>> > > > > > >
>> > > > > > >
>> > > > > > > On Tue, Apr 9, 2024 at 2:25 PM Abhijeet Kumar <
>> > > > > > abhijeet.cse....@gmail.com>
>> > > > > > > wrote:
>> > > > > > >
>> > > > > > > > Hi Christo,
>> > > > > > > >
>> > > > > > > > Please find my comments inline.
>> > > > > > > >
>> > > > > > > > On Fri, Apr 5, 2024 at 12:36 PM Christo Lolov <
>> > > > > christolo...@gmail.com>
>> > > > > > > > wrote:
>> > > > > > > >
>> > > > > > > > > Hello Abhijeet and Jun,
>> > > > > > > > >
>> > > > > > > > > I have been mulling this KIP over a bit more in recent
>> days!
>> > > > > > > > >
>> > > > > > > > > re: Jun
>> > > > > > > > >
>> > > > > > > > > I wasn't aware we apply 2.1 and 2.2 for reserving new
>> > > timestamps
>> > > > -
>> > > > > in
>> > > > > > > > > retrospect it should have been fairly obvious. I would
>> need
>> > to
>> > > go
>> > > > > an
>> > > > > > > > update
>> > > > > > > > > KIP-1005 myself then, thank you for giving the useful
>> > > reference!
>> > > > > > > > >
>> > > > > > > > > 4. I think Abhijeet wants to rebuild state from
>> > > > > latest-tiered-offset
>> > > > > > and
>> > > > > > > > > fetch from latest-tiered-offset + 1 only for new replicas
>> (or
>> > > > > > replicas
>> > > > > > > > > which experienced a disk failure) to decrease the time a
>> > > > partition
>> > > > > > spends
>> > > > > > > > > in under-replicated state. In other words, a follower
>> which
>> > has
>> > > > > just
>> > > > > > > > fallen
>> > > > > > > > > out of ISR, but has local data will continue using today's
>> > > Tiered
>> > > > > > Storage
>> > > > > > > > > replication protocol (i.e. fetching from earliest-local).
>> I
>> > > > further
>> > > > > > > > believe
>> > > > > > > > > he has taken this approach so that local state of replicas
>> > > which
>> > > > > have
>> > > > > > > > just
>> > > > > > > > > fallen out of ISR isn't forcefully wiped thus leading to
>> > > > situation
>> > > > > 1.
>> > > > > > > > > Abhijeet, have I understood (and summarised) what you are
>> > > > proposing
>> > > > > > > > > correctly?
>> > > > > > > > >
>> > > > > > > > > Yes, your understanding is correct. We want to limit the
>> > > behavior
>> > > > > > changes
>> > > > > > > > only to new replicas.
>> > > > > > > >
>> > > > > > > >
>> > > > > > > > > 5. I think in today's Tiered Storage we know the leader's
>> > > > > > > > log-start-offset
>> > > > > > > > > from the FetchResponse and we can learn its
>> > > > local-log-start-offset
>> > > > > > from
>> > > > > > > > the
>> > > > > > > > > ListOffsets by asking for earliest-local timestamp (-4).
>> But
>> > > > > granted,
>> > > > > > > > this
>> > > > > > > > > ought to be added as an additional API call in the KIP.
>> > > > > > > > >
>> > > > > > > > >
>> > > > > > > > Yes, I clarified this in my reply to Jun. I will add this
>> > missing
>> > > > > > detail in
>> > > > > > > > the KIP.
>> > > > > > > >
>> > > > > > > >
>> > > > > > > > > re: Abhijeet
>> > > > > > > > >
>> > > > > > > > > 101. I am still a bit confused as to why you want to
>> include
>> > a
>> > > > new
>> > > > > > offset
>> > > > > > > > > (i.e. pending-upload-offset) when you yourself mention
>> that
>> > you
>> > > > > > could use
>> > > > > > > > > an already existing offset (i.e. last-tiered-offset + 1).
>> In
>> > > > > > essence, you
>> > > > > > > > > end your Motivation with "In this KIP, we will focus only
>> on
>> > > the
>> > > > > > follower
>> > > > > > > > > fetch protocol using the *last-tiered-offset*" and then in
>> > the
>> > > > > > following
>> > > > > > > > > sections you talk about pending-upload-offset. I
>> understand
>> > > this
>> > > > > > might be
>> > > > > > > > > classified as an implementation detail, but if you
>> introduce
>> > a
>> > > > new
>> > > > > > offset
>> > > > > > > > > (i.e. pending-upload-offset) you have to make a change to
>> the
>> > > > > > ListOffsets
>> > > > > > > > > API (i.e. introduce -6) and thus document it in this KIP
>> as
>> > > such.
>> > > > > > > > However,
>> > > > > > > > > the last-tiered-offset ought to already be exposed as
>> part of
>> > > > > > KIP-1005
>> > > > > > > > > (under implementation). Am I misunderstanding something
>> here?
>> > > > > > > > >
>> > > > > > > >
>> > > > > > > > I have tried to clarify this in my reply to Jun.
>> > > > > > > >
>> > > > > > > > > The follower needs to build the local data starting from
>> the
>> > > > offset
>> > > > > > > > > Earliest-Pending-Upload-Offset. Hence it needs the offset
>> and
>> > > the
>> > > > > > > > > corresponding leader-epoch.
>> > > > > > > > > There are two ways to do this:
>> > > > > > > > >    1. We add support in ListOffsetRequest to be able to
>> fetch
>> > > > this
>> > > > > > offset
>> > > > > > > > > (and leader epoch) from the leader.
>> > > > > > > > >    2. Or, fetch the tiered-offset (which is already
>> > supported).
>> > > > > From
>> > > > > > this
>> > > > > > > > > offset, we can get the Earliest-Pending-Upload-Offset. We
>> can
>> > > > just
>> > > > > > add 1
>> > > > > > > > to
>> > > > > > > > > the tiered-offset.
>> > > > > > > > >       However, we still need the leader epoch for offset,
>> > since
>> > > > > > there is
>> > > > > > > > > no guarantee that the leader epoch for
>> > > > > Earliest-Pending-Upload-Offset
>> > > > > > > > will
>> > > > > > > > > be the same as the
>> > > > > > > > >       leader epoch for tiered-offset. We may need another
>> API
>> > > > call
>> > > > > > to the
>> > > > > > > > > leader for this.
>> > > > > > > > > I prefer the first approach. The only problem with the
>> first
>> > > > > > approach is
>> > > > > > > > > that it introduces one more offset. The second approach
>> > avoids
>> > > > this
>> > > > > > > > problem
>> > > > > > > > > but is a little complicated.
>> > > > > > > >
>> > > > > > > >
>> > > > > > > >
>> > > > > > > > >
>> > > > > > > > > Best,
>> > > > > > > > > Christo
>> > > > > > > > >
>> > > > > > > > > On Thu, 4 Apr 2024 at 19:37, Jun Rao
>> > <j...@confluent.io.invalid
>> > > >
>> > > > > > wrote:
>> > > > > > > > >
>> > > > > > > > > > Hi, Abhijeet,
>> > > > > > > > > >
>> > > > > > > > > > Thanks for the KIP. Left a few comments.
>> > > > > > > > > >
>> > > > > > > > > > 1. "A drawback of using the last-tiered-offset is that
>> this
>> > > new
>> > > > > > > > follower
>> > > > > > > > > > would possess only a limited number of locally stored
>> > > segments.
>> > > > > > Should
>> > > > > > > > it
>> > > > > > > > > > ascend to the role of leader, there is a risk of
>> needing to
>> > > > fetch
>> > > > > > these
>> > > > > > > > > > segments from the remote storage, potentially impacting
>> > > broker
>> > > > > > > > > > performance."
>> > > > > > > > > > Since we support consumers fetching from followers, this
>> > is a
>> > > > > > potential
>> > > > > > > > > > issue on the follower side too. In theory, it's possible
>> > for
>> > > a
>> > > > > > segment
>> > > > > > > > to
>> > > > > > > > > > be tiered immediately after rolling. In that case, there
>> > > could
>> > > > be
>> > > > > > very
>> > > > > > > > > > little data after last-tiered-offset. It would be
>> useful to
>> > > > think
>> > > > > > > > through
>> > > > > > > > > > how to address this issue.
>> > > > > > > > > >
>> > > > > > > > > > 2. ListOffsetsRequest:
>> > > > > > > > > > 2.1 Typically, we need to bump up the version of the
>> > request
>> > > if
>> > > > > we
>> > > > > > add
>> > > > > > > > a
>> > > > > > > > > > new value for timestamp. See
>> > > > > > > > > >
>> > > > > > > > > >
>> > > > > > > > >
>> > > > > > > >
>> > > > > >
>> > > > >
>> > > >
>> > >
>> >
>> https://github.com/apache/kafka/pull/10760/files#diff-fac7080d67da905a80126d58fc1745c9a1409de7ef7d093c2ac66a888b134633
>> > > > > > > > > > .
>> > > > > > > > > > 2.2 Since this changes the inter broker request
>> protocol,
>> > it
>> > > > > would
>> > > > > > be
>> > > > > > > > > > useful to have a section on upgrade (e.g. new
>> > > > > > IBP/metadata.version).
>> > > > > > > > > >
>> > > > > > > > > > 3. "Instead of fetching Earliest-Pending-Upload-Offset,
>> it
>> > > > could
>> > > > > > fetch
>> > > > > > > > > the
>> > > > > > > > > > last-tiered-offset from the leader, and make a separate
>> > > leader
>> > > > > > call to
>> > > > > > > > > > fetch leader epoch for the following offset."
>> > > > > > > > > > Why do we need to make a separate call for the leader
>> > epoch?
>> > > > > > > > > > ListOffsetsResponse include both the offset and the
>> > > > corresponding
>> > > > > > > > epoch.
>> > > > > > > > > >
>> > > > > > > > > > 4. "Check if the follower replica is empty and if the
>> > feature
>> > > > to
>> > > > > > use
>> > > > > > > > > > last-tiered-offset is enabled."
>> > > > > > > > > > Why do we need to check if the follower replica is
>> empty?
>> > > > > > > > > >
>> > > > > > > > > > 5. It can be confirmed by checking if the leader's
>> > > > > > Log-Start-Offset is
>> > > > > > > > > the
>> > > > > > > > > > same as the Leader's Local-Log-Start-Offset.
>> > > > > > > > > > How does the follower know Local-Log-Start-Offset?
>> > > > > > > > > >
>> > > > > > > > > > Jun
>> > > > > > > > > >
>> > > > > > > > > > On Sat, Mar 30, 2024 at 5:51 AM Abhijeet Kumar <
>> > > > > > > > > abhijeet.cse....@gmail.com
>> > > > > > > > > > >
>> > > > > > > > > > wrote:
>> > > > > > > > > >
>> > > > > > > > > > > Hi Christo,
>> > > > > > > > > > >
>> > > > > > > > > > > Thanks for reviewing the KIP.
>> > > > > > > > > > >
>> > > > > > > > > > > The follower needs the earliest-pending-upload-offset
>> > (and
>> > > > the
>> > > > > > > > > > > corresponding leader epoch) from the leader.
>> > > > > > > > > > > This is the first offset the follower will have
>> locally.
>> > > > > > > > > > >
>> > > > > > > > > > > Regards,
>> > > > > > > > > > > Abhijeet.
>> > > > > > > > > > >
>> > > > > > > > > > >
>> > > > > > > > > > >
>> > > > > > > > > > > On Fri, Mar 29, 2024 at 1:14 PM Christo Lolov <
>> > > > > > > > christolo...@gmail.com>
>> > > > > > > > > > > wrote:
>> > > > > > > > > > >
>> > > > > > > > > > > > Heya!
>> > > > > > > > > > > >
>> > > > > > > > > > > > First of all, thank you very much for the proposal,
>> you
>> > > > have
>> > > > > > > > > explained
>> > > > > > > > > > > the
>> > > > > > > > > > > > problem you want solved very well - I think a faster
>> > > > > bootstrap
>> > > > > > of
>> > > > > > > > an
>> > > > > > > > > > > empty
>> > > > > > > > > > > > replica is definitely an improvement!
>> > > > > > > > > > > >
>> > > > > > > > > > > > For my understanding, which concrete offset do you
>> want
>> > > the
>> > > > > > leader
>> > > > > > > > to
>> > > > > > > > > > > give
>> > > > > > > > > > > > back to a follower - earliest-pending-upload-offset
>> or
>> > > the
>> > > > > > > > > > > > latest-tiered-offset? If it is the second, then I
>> > believe
>> > > > > > KIP-1005
>> > > > > > > > > > ought
>> > > > > > > > > > > to
>> > > > > > > > > > > > already be exposing that offset as part of the
>> > > ListOffsets
>> > > > > > API, no?
>> > > > > > > > > > > >
>> > > > > > > > > > > > Best,
>> > > > > > > > > > > > Christo
>> > > > > > > > > > > >
>> > > > > > > > > > > > On Wed, 27 Mar 2024 at 18:23, Abhijeet Kumar <
>> > > > > > > > > > abhijeet.cse....@gmail.com
>> > > > > > > > > > > >
>> > > > > > > > > > > > wrote:
>> > > > > > > > > > > >
>> > > > > > > > > > > > > Hi All,
>> > > > > > > > > > > > >
>> > > > > > > > > > > > > I have created KIP-1023 to introduce follower
>> fetch
>> > > from
>> > > > > > tiered
>> > > > > > > > > > offset.
>> > > > > > > > > > > > > This feature will be helpful in significantly
>> > reducing
>> > > > > Kafka
>> > > > > > > > > > > > > rebalance/rebuild times when the cluster is
>> enabled
>> > > with
>> > > > > > tiered
>> > > > > > > > > > > storage.
>> > > > > > > > > > > > >
>> > > > > > > > > > > > >
>> > > > > > > > > > > > >
>> > > > > > > > > > > >
>> > > > > > > > > > >
>> > > > > > > > > >
>> > > > > > > > >
>> > > > > > > >
>> > > > > >
>> > > > >
>> > > >
>> > >
>> >
>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-1023%3A+Follower+fetch+from+tiered+offset
>> > > > > > > > > > > > >
>> > > > > > > > > > > > > Feedback and suggestions are welcome.
>> > > > > > > > > > > > >
>> > > > > > > > > > > > > Regards,
>> > > > > > > > > > > > > Abhijeet.
>> > > > > > > > > > > > >
>> > > > > > > > > > > >
>> > > > > > > > > > >
>> > > > > > > > > >
>> > > > > > > > >
>> > > > > > > >
>> > > > > >
>> > > > >
>> > > >
>> > >
>> >
>>
>

Reply via email to