Re: [DISCUSS] KIP-852 Optimize calculation of size for log in remote tier

Kamal Chandraprakash Sun, 04 Jun 2023 08:48:12 -0700

Hi Divij,

Thanks for the KIP! Sorry for the late reply.


Can you explain the rejected alternative-3?
Store the cumulative size of remote tier log in-memory at RemoteLogManager
"*Cons*: Every time a broker starts-up, it will scan through all the
segments in the remote tier to initialise the in-memory value. This would
increase the broker start-up time."

Keeping the source of truth to determine the remote-log-size in the leader
would be consistent across different implementations of the plugin. The
concern posted in the KIP is that we are calculating the remote-log-size on
each iteration of the cleaner thread (say 5 mins). If we calculate only
once during broker startup or during the leadership reassignment, do we
still need the cache?

The broker startup-time won't be affected by the remote log manager
initialisation. The broker continue to start accepting the new
produce/fetch requests, while the RLM thread in the background can
determine the remote-log-size once and start copying/deleting the segments.

Thanks,
Kamal

On Thu, Jun 1, 2023 at 2:08 PM Divij Vaidya <divijvaidy...@gmail.com> wrote:

> Satish / Jun
>
> Do you have any thoughts on this?
>
> --
> Divij Vaidya
>
>
>
> On Tue, Feb 14, 2023 at 4:15 PM Divij Vaidya <divijvaidy...@gmail.com>
> wrote:
>
> > Hey Jun
> >
> > It has been a while since this KIP got some attention. While we wait for
> > Satish to chime in here, perhaps I can answer your question.
> >
> > > Could you explain how you exposed the log size in your KIP-405
> > implementation?
> >
> > The APIs available in RLMM as per KIP405
> > are, addRemoteLogSegmentMetadata(), updateRemoteLogSegmentMetadata(),
> remoteLogSegmentMetadata(), highestOffsetForEpoch(),
> putRemotePartitionDeleteMetadata(), listRemoteLogSegments(),
> onPartitionLeadershipChanges()
> > and onStopPartitions(). None of these APIs allow us to expose the log
> size,
> > hence, the only option that remains is to list all segments using
> > listRemoteLogSegments() and aggregate them every time we require to
> > calculate the size. Based on our prior discussion, this requires reading
> > all segment metadata which won't work for non-local RLMM implementations.
> > Satish's implementation also performs a full scan and calculates the
> > aggregate. see:
> >
> https://github.com/satishd/kafka/blob/2.8.x-tiered-storage/core/src/main/scala/kafka/log/remote/RemoteLogManager.scala#L619
> >
> >
> > Does this answer your question?
> >
> > --
> > Divij Vaidya
> >
> >
> >
> > On Tue, Dec 20, 2022 at 8:40 PM Jun Rao <j...@confluent.io.invalid>
> wrote:
> >
> >> Hi, Divij,
> >>
> >> Thanks for the explanation.
> >>
> >> Good question.
> >>
> >> Hi, Satish,
> >>
> >> Could you explain how you exposed the log size in your KIP-405
> >> implementation?
> >>
> >> Thanks,
> >>
> >> Jun
> >>
> >> On Tue, Dec 20, 2022 at 4:59 AM Divij Vaidya <divijvaidy...@gmail.com>
> >> wrote:
> >>
> >> > Hey Jun
> >> >
> >> > Yes, it is possible to maintain the log size in the cache (see
> rejected
> >> > alternative#3 in the KIP) but I did not understand how it is possible
> to
> >> > retrieve it without the new API. The log size could be calculated on
> >> > startup by scanning through the segments (though I would disagree that
> >> this
> >> > is the right approach since scanning itself takes order of minutes and
> >> > hence delay the start of archive process), and incrementally
> maintained
> >> > afterwards, even then, we would need an API in
> RemoteLogMetadataManager
> >> so
> >> > that RLM could fetch the cached size!
> >> >
> >> > If we wish to cache the size without adding a new API, then we need to
> >> > cache the size in RLM itself (instead of RLMM implementation) and
> >> > incrementally manage it. The downside of longer archive time at
> startup
> >> > (due to initial scale) still remains valid in this situation.
> >> >
> >> > --
> >> > Divij Vaidya
> >> >
> >> >
> >> >
> >> > On Fri, Dec 16, 2022 at 12:43 AM Jun Rao <j...@confluent.io.invalid>
> >> wrote:
> >> >
> >> > > Hi, Divij,
> >> > >
> >> > > Thanks for the explanation.
> >> > >
> >> > > If there is in-memory cache, could we maintain the log size in the
> >> cache
> >> > > with the existing API? For example, a replica could make a
> >> > > listRemoteLogSegments(TopicIdPartition topicIdPartition) call on
> >> startup
> >> > to
> >> > > get the remote segment size before the current leaderEpoch. The
> leader
> >> > > could then maintain the size incrementally afterwards. On leader
> >> change,
> >> > > other replicas can make a listRemoteLogSegments(TopicIdPartition
> >> > > topicIdPartition, int leaderEpoch) call to get the size of newly
> >> > generated
> >> > > segments.
> >> > >
> >> > > Thanks,
> >> > >
> >> > > Jun
> >> > >
> >> > >
> >> > > On Wed, Dec 14, 2022 at 3:27 AM Divij Vaidya <
> divijvaidy...@gmail.com
> >> >
> >> > > wrote:
> >> > >
> >> > > > > Is the new method enough for doing size-based retention?
> >> > > >
> >> > > > Yes. You are right in assuming that this API only provides the
> >> Remote
> >> > > > storage size (for current epoch chain). We would use this API for
> >> size
> >> > > > based retention along with a value of localOnlyLogSegmentSize
> which
> >> is
> >> > > > computed as Log.sizeInBytes(logSegments.filter(_.baseOffset >
> >> > > > highestOffsetWithRemoteIndex)). Hence, (total_log_size =
> >> > > > remoteLogSizeBytes + log.localOnlyLogSegmentSize). I have updated
> >> the
> >> > KIP
> >> > > > with this information. You can also check an example
> implementation
> >> at
> >> > > >
> >> > > >
> >> > >
> >> >
> >>
> https://github.com/satishd/kafka/blob/2.8.x-tiered-storage/core/src/main/scala/kafka/log/Log.scala#L2077
> >> > > >
> >> > > >
> >> > > > > Do you imagine all accesses to remote metadata will be across
> the
> >> > > network
> >> > > > or will there be some local in-memory cache?
> >> > > >
> >> > > > I would expect a disk-less implementation to maintain a finite
> >> > in-memory
> >> > > > cache for segment metadata to optimize the number of network calls
> >> made
> >> > > to
> >> > > > fetch the data. In future, we can think about bringing this finite
> >> size
> >> > > > cache into RLM itself but that's probably a conversation for a
> >> > different
> >> > > > KIP. There are many other things we would like to do to optimize
> the
> >> > > Tiered
> >> > > > storage interface such as introducing a circular buffer /
> streaming
> >> > > > interface from RSM (so that we don't have to wait to fetch the
> >> entire
> >> > > > segment before starting to send records to the consumer), caching
> >> the
> >> > > > segments fetched from RSM locally (I would assume all RSM plugin
> >> > > > implementations to do this, might as well add it to RLM) etc.
> >> > > >
> >> > > > --
> >> > > > Divij Vaidya
> >> > > >
> >> > > >
> >> > > >
> >> > > > On Mon, Dec 12, 2022 at 7:35 PM Jun Rao <j...@confluent.io.invalid
> >
> >> > > wrote:
> >> > > >
> >> > > > > Hi, Divij,
> >> > > > >
> >> > > > > Thanks for the reply.
> >> > > > >
> >> > > > > Is the new method enough for doing size-based retention? It
> gives
> >> the
> >> > > > total
> >> > > > > size of the remote segments, but it seems that we still don't
> know
> >> > the
> >> > > > > exact total size for a log since there could be overlapping
> >> segments
> >> > > > > between the remote and the local segments.
> >> > > > >
> >> > > > > You mentioned a disk-less implementation. Do you imagine all
> >> accesses
> >> > > to
> >> > > > > remote metadata will be across the network or will there be some
> >> > local
> >> > > > > in-memory cache?
> >> > > > >
> >> > > > > Thanks,
> >> > > > >
> >> > > > > Jun
> >> > > > >
> >> > > > >
> >> > > > >
> >> > > > > On Wed, Dec 7, 2022 at 3:10 AM Divij Vaidya <
> >> divijvaidy...@gmail.com
> >> > >
> >> > > > > wrote:
> >> > > > >
> >> > > > > > The method is needed for RLMM implementations which fetch the
> >> > > > information
> >> > > > > > over the network and not for the disk based implementations
> >> (such
> >> > as
> >> > > > the
> >> > > > > > default topic based RLMM).
> >> > > > > >
> >> > > > > > I would argue that adding this API makes the interface more
> >> generic
> >> > > > than
> >> > > > > > what it is today. This is because, with the current APIs an
> >> > > implementor
> >> > > > > is
> >> > > > > > restricted to use disk based RLMM solutions only (i.e. the
> >> default
> >> > > > > > solution) whereas if we add this new API, we unblock usage of
> >> > network
> >> > > > > based
> >> > > > > > RLMM implementations such as databases.
> >> > > > > >
> >> > > > > >
> >> > > > > >
> >> > > > > > On Wed 30. Nov 2022 at 20:40, Jun Rao
> <j...@confluent.io.invalid
> >> >
> >> > > > wrote:
> >> > > > > >
> >> > > > > > > Hi, Divij,
> >> > > > > > >
> >> > > > > > > Thanks for the reply.
> >> > > > > > >
> >> > > > > > > Point#2. My high level question is that is the new method
> >> needed
> >> > > for
> >> > > > > > every
> >> > > > > > > implementation of remote storage or just for a specific
> >> > > > implementation.
> >> > > > > > The
> >> > > > > > > issues that you pointed out exist for the default
> >> implementation
> >> > of
> >> > > > > RLMM
> >> > > > > > as
> >> > > > > > > well and so far, the default implementation hasn't found a
> >> need
> >> > > for a
> >> > > > > > > similar new method. For public interface, ideally we want to
> >> make
> >> > > it
> >> > > > > more
> >> > > > > > > general.
> >> > > > > > >
> >> > > > > > > Thanks,
> >> > > > > > >
> >> > > > > > > Jun
> >> > > > > > >
> >> > > > > > > On Mon, Nov 21, 2022 at 7:11 AM Divij Vaidya <
> >> > > > divijvaidy...@gmail.com>
> >> > > > > > > wrote:
> >> > > > > > >
> >> > > > > > > > Thank you Jun and Alex for your comments.
> >> > > > > > > >
> >> > > > > > > > Point#1: You are right Jun. As Alex mentioned, the
> "derived
> >> > > > metadata"
> >> > > > > > can
> >> > > > > > > > increase the size of cached metadata by a factor of 10 but
> >> it
> >> > > > should
> >> > > > > be
> >> > > > > > > ok
> >> > > > > > > > to cache just the actual metadata. My point about size
> >> being a
> >> > > > > > limitation
> >> > > > > > > > for using cache is not valid anymore.
> >> > > > > > > >
> >> > > > > > > > Point#2: For a new replica, it would still have to fetch
> the
> >> > > > metadata
> >> > > > > > > over
> >> > > > > > > > the network to initiate the warm up of the cache and
> hence,
> >> > > > increase
> >> > > > > > the
> >> > > > > > > > start time of the archival process. Please also note the
> >> > > > > repercussions
> >> > > > > > of
> >> > > > > > > > the warm up scan that Alex mentioned in this thread as
> part
> >> of
> >> > > > > #102.2.
> >> > > > > > > >
> >> > > > > > > > 100#: Agreed Alex. Thanks for clarifying that. My point
> >> about
> >> > > size
> >> > > > > > being
> >> > > > > > > a
> >> > > > > > > > limitation for using cache is not valid anymore.
> >> > > > > > > >
> >> > > > > > > > 101#: Alex, if I understand correctly, you are suggesting
> to
> >> > > cache
> >> > > > > the
> >> > > > > > > > total size at the leader and update it on archival. This
> >> > wouldn't
> >> > > > > work
> >> > > > > > > for
> >> > > > > > > > cases when the leader restarts where we would have to
> make a
> >> > full
> >> > > > > scan
> >> > > > > > > > to update the total size entry on startup. We expect users
> >> to
> >> > > store
> >> > > > > > data
> >> > > > > > > > over longer duration in remote storage which increases the
> >> > > > likelihood
> >> > > > > > of
> >> > > > > > > > leader restarts / failovers.
> >> > > > > > > >
> >> > > > > > > > 102#.1: I don't think that the current design accommodates
> >> the
> >> > > fact
> >> > > > > > that
> >> > > > > > > > data corruption could happen at the RLMM plugin (we don't
> >> have
> >> > > > > checksum
> >> > > > > > > as
> >> > > > > > > > a field in metadata as part of KIP405). If data corruption
> >> > > occurs,
> >> > > > w/
> >> > > > > > or
> >> > > > > > > > w/o the cache, it would be a different problem to solve. I
> >> > would
> >> > > > like
> >> > > > > > to
> >> > > > > > > > keep this outside the scope of this KIP.
> >> > > > > > > >
> >> > > > > > > > 102#.2: Agree. This remains as the main concern for using
> >> the
> >> > > cache
> >> > > > > to
> >> > > > > > > > fetch total size.
> >> > > > > > > >
> >> > > > > > > > Regards,
> >> > > > > > > > Divij Vaidya
> >> > > > > > > >
> >> > > > > > > >
> >> > > > > > > >
> >> > > > > > > > On Fri, Nov 18, 2022 at 12:59 PM Alexandre Dupriez <
> >> > > > > > > > alexandre.dupr...@gmail.com> wrote:
> >> > > > > > > >
> >> > > > > > > > > Hi Divij,
> >> > > > > > > > >
> >> > > > > > > > > Thanks for the KIP. Please find some comments based on
> >> what I
> >> > > > read
> >> > > > > on
> >> > > > > > > > > this thread so far - apologies for the repeats and the
> >> late
> >> > > > reply.
> >> > > > > > > > >
> >> > > > > > > > > If I understand correctly, one of the main elements of
> >> > > discussion
> >> > > > > is
> >> > > > > > > > > about caching in Kafka versus delegation of providing
> the
> >> > > remote
> >> > > > > size
> >> > > > > > > > > of a topic-partition to the plugin.
> >> > > > > > > > >
> >> > > > > > > > > A few comments:
> >> > > > > > > > >
> >> > > > > > > > > 100. The size of the “derived metadata” which is managed
> >> by
> >> > the
> >> > > > > > plugin
> >> > > > > > > > > to represent an rlmMetadata can indeed be close to 1 kB
> on
> >> > > > average
> >> > > > > > > > > depending on its own internal structure, e.g. the
> >> redundancy
> >> > it
> >> > > > > > > > > enforces (unfortunately resulting to duplication),
> >> additional
> >> > > > > > > > > information such as checksums and primary and secondary
> >> > > indexable
> >> > > > > > > > > keys. But indeed, the rlmMetadata is itself a lighter
> data
> >> > > > > structure
> >> > > > > > > > > by a factor of 10. And indeed, instead of caching the
> >> > “derived
> >> > > > > > > > > metadata”, only the rlmMetadata could be, which should
> >> > address
> >> > > > the
> >> > > > > > > > > concern regarding the memory occupancy of the cache.
> >> > > > > > > > >
> >> > > > > > > > > 101. I am not sure I fully understand why we would need
> to
> >> > > cache
> >> > > > > the
> >> > > > > > > > > list of rlmMetadata to retain the remote size of a
> >> > > > topic-partition.
> >> > > > > > > > > Since the leader of a topic-partition is, in
> >> non-degenerated
> >> > > > cases,
> >> > > > > > > > > the only actor which can mutate the remote part of the
> >> > > > > > > > > topic-partition, hence its size, it could in theory only
> >> > cache
> >> > > > the
> >> > > > > > > > > size of the remote log once it has calculated it? In
> which
> >> > case
> >> > > > > there
> >> > > > > > > > > would not be any problem regarding the size of the
> caching
> >> > > > > strategy.
> >> > > > > > > > > Did I miss something there?
> >> > > > > > > > >
> >> > > > > > > > > 102. There may be a few challenges to consider with
> >> caching:
> >> > > > > > > > >
> >> > > > > > > > > 102.1) As mentioned above, the caching strategy assumes
> no
> >> > > > mutation
> >> > > > > > > > > outside the lifetime of a leader. While this is true in
> >> the
> >> > > > normal
> >> > > > > > > > > course of operation, there could be accidental mutation
> >> > outside
> >> > > > of
> >> > > > > > the
> >> > > > > > > > > leader and a loss of consistency between the cached
> state
> >> and
> >> > > the
> >> > > > > > > > > actual remote representation of the log. E.g.
> split-brain
> >> > > > > scenarios,
> >> > > > > > > > > bugs in the plugins, bugs in external systems with
> >> mutating
> >> > > > access
> >> > > > > on
> >> > > > > > > > > the derived metadata. In the worst case, a drift between
> >> the
> >> > > > cached
> >> > > > > > > > > size and the actual size could lead to over-deleting
> >> remote
> >> > > data
> >> > > > > > which
> >> > > > > > > > > is a durability risk.
> >> > > > > > > > >
> >> > > > > > > > > The alternative you propose, by making the plugin the
> >> source
> >> > of
> >> > > > > truth
> >> > > > > > > > > w.r.t. to the size of the remote log, can make it easier
> >> to
> >> > > avoid
> >> > > > > > > > > inconsistencies between plugin-managed metadata and the
> >> > remote
> >> > > > log
> >> > > > > > > > > from the perspective of Kafka. On the other hand, plugin
> >> > > vendors
> >> > > > > > would
> >> > > > > > > > > have to implement it with the expected efficiency to
> have
> >> it
> >> > > > yield
> >> > > > > > > > > benefits.
> >> > > > > > > > >
> >> > > > > > > > > 102.2) As you mentioned, the caching strategy in Kafka
> >> would
> >> > > > still
> >> > > > > > > > > require one iteration over the list of rlmMetadata when
> >> the
> >> > > > > > leadership
> >> > > > > > > > > of a topic-partition is assigned to a broker, while the
> >> > plugin
> >> > > > can
> >> > > > > > > > > offer alternative constant-time approaches. This
> >> calculation
> >> > > > cannot
> >> > > > > > be
> >> > > > > > > > > put on the LeaderAndIsr path and would be performed in
> the
> >> > > > > > background.
> >> > > > > > > > > In case of bulk leadership migration, listing the
> >> rlmMetadata
> >> > > > could
> >> > > > > > a)
> >> > > > > > > > > result in request bursts to any backend system the
> plugin
> >> may
> >> > > use
> >> > > > > > > > > [which shouldn’t be a problem for high-throughput data
> >> stores
> >> > > but
> >> > > > > > > > > could have cost implications] b) increase utilisation
> >> > timespan
> >> > > of
> >> > > > > the
> >> > > > > > > > > RLM threads for these calculations potentially leading
> to
> >> > > > transient
> >> > > > > > > > > starvation of tasks queued for, typically, offloading
> >> > > operations
> >> > > > c)
> >> > > > > > > > > could have a non-marginal CPU footprint on hardware with
> >> > strict
> >> > > > > > > > > resource constraints. All these elements could have an
> >> impact
> >> > > to
> >> > > > > some
> >> > > > > > > > > degree depending on the operational environment.
> >> > > > > > > > >
> >> > > > > > > > > From a design perspective, one question is where we want
> >> the
> >> > > > source
> >> > > > > > of
> >> > > > > > > > > truth w.r.t. remote log size to be during the lifetime
> of
> >> a
> >> > > > leader.
> >> > > > > > > > > The responsibility of maintaining a consistent
> >> representation
> >> > > of
> >> > > > > the
> >> > > > > > > > > remote log is shared by Kafka and the plugin. Which
> >> system is
> >> > > > best
> >> > > > > > > > > placed to maintain such a state while providing the
> >> highest
> >> > > > > > > > > consistency guarantees is something both Kafka and
> plugin
> >> > > > designers
> >> > > > > > > > > could help understand better.
> >> > > > > > > > >
> >> > > > > > > > > Many thanks,
> >> > > > > > > > > Alexandre
> >> > > > > > > > >
> >> > > > > > > > >
> >> > > > > > > > > Le jeu. 17 nov. 2022 à 19:27, Jun Rao
> >> > <j...@confluent.io.invalid
> >> > > >
> >> > > > a
> >> > > > > > > > écrit :
> >> > > > > > > > > >
> >> > > > > > > > > > Hi, Divij,
> >> > > > > > > > > >
> >> > > > > > > > > > Thanks for the reply.
> >> > > > > > > > > >
> >> > > > > > > > > > Point #1. Is the average remote segment metadata
> really
> >> > 1KB?
> >> > > > > What's
> >> > > > > > > > > listed
> >> > > > > > > > > > in the public interface is probably well below 100
> >> bytes.
> >> > > > > > > > > >
> >> > > > > > > > > > Point #2. I guess you are assuming that each broker
> only
> >> > > caches
> >> > > > > the
> >> > > > > > > > > remote
> >> > > > > > > > > > segment metadata in memory. An alternative approach is
> >> to
> >> > > cache
> >> > > > > > them
> >> > > > > > > in
> >> > > > > > > > > > both memory and local disk. That way, on broker
> restart,
> >> > you
> >> > > > just
> >> > > > > > > need
> >> > > > > > > > to
> >> > > > > > > > > > fetch the new remote segments' metadata using the
> >> > > > > > > > > > listRemoteLogSegments(TopicIdPartition
> topicIdPartition,
> >> > int
> >> > > > > > > > leaderEpoch)
> >> > > > > > > > > > api. Will that work?
> >> > > > > > > > > >
> >> > > > > > > > > > Point #3. Thanks for the explanation and it sounds
> good.
> >> > > > > > > > > >
> >> > > > > > > > > > Thanks,
> >> > > > > > > > > >
> >> > > > > > > > > > Jun
> >> > > > > > > > > >
> >> > > > > > > > > > On Thu, Nov 17, 2022 at 7:31 AM Divij Vaidya <
> >> > > > > > > divijvaidy...@gmail.com>
> >> > > > > > > > > > wrote:
> >> > > > > > > > > >
> >> > > > > > > > > > > Hi Jun
> >> > > > > > > > > > >
> >> > > > > > > > > > > There are three points that I would like to present
> >> here:
> >> > > > > > > > > > >
> >> > > > > > > > > > > 1. We would require a large cache size to
> efficiently
> >> > cache
> >> > > > all
> >> > > > > > > > segment
> >> > > > > > > > > > > metadata.
> >> > > > > > > > > > > 2. Linear scan of all metadata at broker startup to
> >> > > populate
> >> > > > > the
> >> > > > > > > > cache
> >> > > > > > > > > will
> >> > > > > > > > > > > be slow and will impact the archival process.
> >> > > > > > > > > > > 3. There is no other use case where a full scan of
> >> > segment
> >> > > > > > metadata
> >> > > > > > > > is
> >> > > > > > > > > > > required.
> >> > > > > > > > > > >
> >> > > > > > > > > > > Let's start by quantifying 1. Here's my estimate for
> >> the
> >> > > size
> >> > > > > of
> >> > > > > > > the
> >> > > > > > > > > cache.
> >> > > > > > > > > > > Average size of segment metadata = 1KB. This could
> be
> >> > more
> >> > > if
> >> > > > > we
> >> > > > > > > have
> >> > > > > > > > > > > frequent leader failover with a large number of
> leader
> >> > > epochs
> >> > > > > > being
> >> > > > > > > > > stored
> >> > > > > > > > > > > per segment.
> >> > > > > > > > > > > Segment size = 100MB. Users will prefer to reduce
> the
> >> > > segment
> >> > > > > > size
> >> > > > > > > > > from the
> >> > > > > > > > > > > default value of 1GB to ensure timely archival of
> data
> >> > > since
> >> > > > > data
> >> > > > > > > > from
> >> > > > > > > > > > > active segment is not archived.
> >> > > > > > > > > > > Cache size = num segments * avg. segment metadata
> >> size =
> >> > > > > > > > > (100TB/100MB)*1KB
> >> > > > > > > > > > > = 1GB.
> >> > > > > > > > > > > While 1GB for cache may not sound like a large
> number
> >> for
> >> > > > > larger
> >> > > > > > > > > machines,
> >> > > > > > > > > > > it does eat into the memory as an additional cache
> and
> >> > > makes
> >> > > > > use
> >> > > > > > > > cases
> >> > > > > > > > > with
> >> > > > > > > > > > > large data retention with low throughout expensive
> >> (where
> >> > > > such
> >> > > > > > use
> >> > > > > > > > case
> >> > > > > > > > > > > would could use smaller machines).
> >> > > > > > > > > > >
> >> > > > > > > > > > > About point#2:
> >> > > > > > > > > > > Even if we say that all segment metadata can fit
> into
> >> the
> >> > > > > cache,
> >> > > > > > we
> >> > > > > > > > > will
> >> > > > > > > > > > > need to populate the cache on broker startup. It
> would
> >> > not
> >> > > be
> >> > > > > in
> >> > > > > > > the
> >> > > > > > > > > > > critical patch of broker startup and hence won't
> >> impact
> >> > the
> >> > > > > > startup
> >> > > > > > > > > time.
> >> > > > > > > > > > > But it will impact the time when we could start the
> >> > > archival
> >> > > > > > > process
> >> > > > > > > > > since
> >> > > > > > > > > > > the RLM thread pool will be blocked on the first
> call
> >> to
> >> > > > > > > > > > > listRemoteLogSegments(). To scan metadata for 1MM
> >> > segments
> >> > > > > > > (computed
> >> > > > > > > > > above)
> >> > > > > > > > > > > and transfer 1GB data over the network from a RLMM
> >> such
> >> > as
> >> > > a
> >> > > > > > remote
> >> > > > > > > > > > > database would be in the order of minutes (depending
> >> on
> >> > how
> >> > > > > > > efficient
> >> > > > > > > > > the
> >> > > > > > > > > > > scan is with the RLMM implementation). Although, I
> >> would
> >> > > > > concede
> >> > > > > > > that
> >> > > > > > > > > > > having RLM threads blocked for a few minutes is
> >> perhaps
> >> > OK
> >> > > > but
> >> > > > > if
> >> > > > > > > we
> >> > > > > > > > > > > introduce the new API proposed in the KIP, we would
> >> have
> >> > a
> >> > > > > > > > > > > deterministic startup time for RLM. Adding the API
> >> comes
> >> > > at a
> >> > > > > low
> >> > > > > > > > cost
> >> > > > > > > > > and
> >> > > > > > > > > > > I believe the trade off is worth it.
> >> > > > > > > > > > >
> >> > > > > > > > > > > About point#3:
> >> > > > > > > > > > > We can use listRemoteLogSegments(TopicIdPartition
> >> > > > > > topicIdPartition,
> >> > > > > > > > int
> >> > > > > > > > > > > leaderEpoch) to calculate the segments eligible for
> >> > > deletion
> >> > > > > > (based
> >> > > > > > > > on
> >> > > > > > > > > size
> >> > > > > > > > > > > retention) where leader epoch(s) belong to the
> current
> >> > > leader
> >> > > > > > epoch
> >> > > > > > > > > chain.
> >> > > > > > > > > > > I understand that it may lead to segments belonging
> to
> >> > > other
> >> > > > > > epoch
> >> > > > > > > > > lineage
> >> > > > > > > > > > > not getting deleted and would require a separate
> >> > mechanism
> >> > > to
> >> > > > > > > delete
> >> > > > > > > > > them.
> >> > > > > > > > > > > The separate mechanism would anyways be required to
> >> > delete
> >> > > > > these
> >> > > > > > > > > "leaked"
> >> > > > > > > > > > > segments as there are other cases which could lead
> to
> >> > leaks
> >> > > > > such
> >> > > > > > as
> >> > > > > > > > > network
> >> > > > > > > > > > > problems with RSM mid way writing through. segment
> >> etc.
> >> > > > > > > > > > >
> >> > > > > > > > > > > Thank you for the replies so far. They have made me
> >> > > re-think
> >> > > > my
> >> > > > > > > > > assumptions
> >> > > > > > > > > > > and this dialogue has been very constructive for me.
> >> > > > > > > > > > >
> >> > > > > > > > > > > Regards,
> >> > > > > > > > > > > Divij Vaidya
> >> > > > > > > > > > >
> >> > > > > > > > > > >
> >> > > > > > > > > > >
> >> > > > > > > > > > > On Thu, Nov 10, 2022 at 10:49 PM Jun Rao
> >> > > > > > <j...@confluent.io.invalid
> >> > > > > > > >
> >> > > > > > > > > wrote:
> >> > > > > > > > > > >
> >> > > > > > > > > > > > Hi, Divij,
> >> > > > > > > > > > > >
> >> > > > > > > > > > > > Thanks for the reply.
> >> > > > > > > > > > > >
> >> > > > > > > > > > > > It's true that the data in Kafka could be kept
> >> longer
> >> > > with
> >> > > > > > > KIP-405.
> >> > > > > > > > > How
> >> > > > > > > > > > > > much data do you envision to have per broker? For
> >> 100TB
> >> > > > data
> >> > > > > > per
> >> > > > > > > > > broker,
> >> > > > > > > > > > > > with 1GB segment and segment metadata of 100
> bytes,
> >> it
> >> > > > > requires
> >> > > > > > > > > > > > 100TB/1GB*100 = 10MB, which should fit in memory.
> >> > > > > > > > > > > >
> >> > > > > > > > > > > > RemoteLogMetadataManager has two
> >> > listRemoteLogSegments()
> >> > > > > > methods.
> >> > > > > > > > > The one
> >> > > > > > > > > > > > you listed listRemoteLogSegments(TopicIdPartition
> >> > > > > > > topicIdPartition,
> >> > > > > > > > > int
> >> > > > > > > > > > > > leaderEpoch) does return data in offset order.
> >> However,
> >> > > the
> >> > > > > > other
> >> > > > > > > > > > > > one listRemoteLogSegments(TopicIdPartition
> >> > > > topicIdPartition)
> >> > > > > > > > doesn't
> >> > > > > > > > > > > > specify the return order. I assume that you need
> the
> >> > > latter
> >> > > > > to
> >> > > > > > > > > calculate
> >> > > > > > > > > > > > the segment size?
> >> > > > > > > > > > > >
> >> > > > > > > > > > > > Thanks,
> >> > > > > > > > > > > >
> >> > > > > > > > > > > > Jun
> >> > > > > > > > > > > >
> >> > > > > > > > > > > > On Thu, Nov 10, 2022 at 10:25 AM Divij Vaidya <
> >> > > > > > > > > divijvaidy...@gmail.com>
> >> > > > > > > > > > > > wrote:
> >> > > > > > > > > > > >
> >> > > > > > > > > > > > > *Jun,*
> >> > > > > > > > > > > > >
> >> > > > > > > > > > > > > *"the default implementation of RLMM does local
> >> > > caching,
> >> > > > > > > right?"*
> >> > > > > > > > > > > > > Yes, Jun. The default implementation of RLMM
> does
> >> > > indeed
> >> > > > > > cache
> >> > > > > > > > the
> >> > > > > > > > > > > > segment
> >> > > > > > > > > > > > > metadata today, hence, it won't work for use
> cases
> >> > when
> >> > > > the
> >> > > > > > > > number
> >> > > > > > > > > of
> >> > > > > > > > > > > > > segments in remote storage is large enough to
> >> exceed
> >> > > the
> >> > > > > size
> >> > > > > > > of
> >> > > > > > > > > cache.
> >> > > > > > > > > > > > As
> >> > > > > > > > > > > > > part of this KIP, I will implement the new
> >> proposed
> >> > API
> >> > > > in
> >> > > > > > the
> >> > > > > > > > > default
> >> > > > > > > > > > > > > implementation of RLMM but the underlying
> >> > > implementation
> >> > > > > will
> >> > > > > > > > > still be
> >> > > > > > > > > > > a
> >> > > > > > > > > > > > > scan. I will pick up optimizing that in a
> separate
> >> > PR.
> >> > > > > > > > > > > > >
> >> > > > > > > > > > > > > *"we also cache all segment metadata in the
> >> brokers
> >> > > > without
> >> > > > > > > > > KIP-405. Do
> >> > > > > > > > > > > > you
> >> > > > > > > > > > > > > see a need to change that?"*
> >> > > > > > > > > > > > > Please correct me if I am wrong here but we
> cache
> >> > > > metadata
> >> > > > > > for
> >> > > > > > > > > segments
> >> > > > > > > > > > > > > "residing in local storage". The size of the
> >> current
> >> > > > cache
> >> > > > > > > works
> >> > > > > > > > > fine
> >> > > > > > > > > > > for
> >> > > > > > > > > > > > > the scale of the number of segments that we
> >> expect to
> >> > > > store
> >> > > > > > in
> >> > > > > > > > > local
> >> > > > > > > > > > > > > storage. After KIP-405, that cache will continue
> >> to
> >> > > store
> >> > > > > > > > metadata
> >> > > > > > > > > for
> >> > > > > > > > > > > > > segments which are residing in local storage and
> >> > hence,
> >> > > > we
> >> > > > > > > don't
> >> > > > > > > > > need
> >> > > > > > > > > > > to
> >> > > > > > > > > > > > > change that. For segments which have been
> >> offloaded
> >> > to
> >> > > > > remote
> >> > > > > > > > > storage,
> >> > > > > > > > > > > it
> >> > > > > > > > > > > > > would rely on RLMM. Note that the scale of data
> >> > stored
> >> > > in
> >> > > > > > RLMM
> >> > > > > > > is
> >> > > > > > > > > > > > different
> >> > > > > > > > > > > > > from local cache because the number of segments
> is
> >> > > > expected
> >> > > > > > to
> >> > > > > > > be
> >> > > > > > > > > much
> >> > > > > > > > > > > > > larger than what current implementation stores
> in
> >> > local
> >> > > > > > > storage.
> >> > > > > > > > > > > > >
> >> > > > > > > > > > > > > 2,3,4:
> >> > RemoteLogMetadataManager.listRemoteLogSegments()
> >> > > > > does
> >> > > > > > > > > specify
> >> > > > > > > > > > > the
> >> > > > > > > > > > > > > order i.e. it returns the segments sorted by
> first
> >> > > offset
> >> > > > > in
> >> > > > > > > > > ascending
> >> > > > > > > > > > > > > order. I am copying the API docs for KIP-405
> here
> >> for
> >> > > > your
> >> > > > > > > > > reference
> >> > > > > > > > > > > > >
> >> > > > > > > > > > > > >
> >> > > > > > > > > > > > >
> >> > > > > > > > > > > > >
> >> > > > > > > > > > > > >
> >> > > > > > > > > > > > >
> >> > > > > > > > > > > > > *Returns iterator of remote log segment
> metadata,
> >> > > sorted
> >> > > > by
> >> > > > > > > > {@link
> >> > > > > > > > > > > > > RemoteLogSegmentMetadata#startOffset()}
> >> inascending
> >> > > order
> >> > > > > > which
> >> > > > > > > > > > > contains
> >> > > > > > > > > > > > > the given leader epoch. This is used by remote
> log
> >> > > > > retention
> >> > > > > > > > > management
> >> > > > > > > > > > > > > subsystemto fetch the segment metadata for a
> given
> >> > > leader
> >> > > > > > > > > epoch.@param
> >> > > > > > > > > > > > > topicIdPartition topic partition@param
> >> leaderEpoch
> >> > > > > > leader
> >> > > > > > > > > > > > > epoch@return
> >> > > > > > > > > > > > > Iterator of remote segments, sorted by start
> >> offset
> >> > in
> >> > > > > > > ascending
> >> > > > > > > > > > > order. *
> >> > > > > > > > > > > > >
> >> > > > > > > > > > > > > *Luke,*
> >> > > > > > > > > > > > >
> >> > > > > > > > > > > > > 5. Note that we are trying to optimize the
> >> efficiency
> >> > > of
> >> > > > > size
> >> > > > > > > > based
> >> > > > > > > > > > > > > retention for remote storage. KIP-405 does not
> >> > > introduce
> >> > > > a
> >> > > > > > new
> >> > > > > > > > > config
> >> > > > > > > > > > > for
> >> > > > > > > > > > > > > periodically checking remote similar to
> >> > > > > > > > > > > log.retention.check.interval.ms
> >> > > > > > > > > > > > > which is applicable for remote storage. Hence,
> the
> >> > > metric
> >> > > > > > will
> >> > > > > > > be
> >> > > > > > > > > > > updated
> >> > > > > > > > > > > > > at the time of invoking log retention check for
> >> > remote
> >> > > > tier
> >> > > > > > > which
> >> > > > > > > > > is
> >> > > > > > > > > > > > > pending implementation today. We can perhaps
> come
> >> > back
> >> > > > and
> >> > > > > > > update
> >> > > > > > > > > the
> >> > > > > > > > > > > > > metric description after the implementation of
> log
> >> > > > > retention
> >> > > > > > > > check
> >> > > > > > > > > in
> >> > > > > > > > > > > > > RemoteLogManager.
> >> > > > > > > > > > > > >
> >> > > > > > > > > > > > > --
> >> > > > > > > > > > > > > Divij Vaidya
> >> > > > > > > > > > > > >
> >> > > > > > > > > > > > >
> >> > > > > > > > > > > > >
> >> > > > > > > > > > > > > On Thu, Nov 10, 2022 at 6:16 AM Luke Chen <
> >> > > > > show...@gmail.com
> >> > > > > > >
> >> > > > > > > > > wrote:
> >> > > > > > > > > > > > >
> >> > > > > > > > > > > > > > Hi Divij,
> >> > > > > > > > > > > > > >
> >> > > > > > > > > > > > > > One more question about the metric:
> >> > > > > > > > > > > > > > I think the metric will be updated when
> >> > > > > > > > > > > > > > (1) each time we run the log retention check
> >> (that
> >> > > is,
> >> > > > > > > > > > > > > > log.retention.check.interval.ms)
> >> > > > > > > > > > > > > > (2) When user explicitly call getRemoteLogSize
> >> > > > > > > > > > > > > >
> >> > > > > > > > > > > > > > Is that correct?
> >> > > > > > > > > > > > > > Maybe we should add a note in metric
> >> description,
> >> > > > > > otherwise,
> >> > > > > > > > when
> >> > > > > > > > > > > user
> >> > > > > > > > > > > > > got,
> >> > > > > > > > > > > > > > let's say 0 of RemoteLogSizeBytes, will be
> >> > surprised.
> >> > > > > > > > > > > > > >
> >> > > > > > > > > > > > > > Otherwise, LGTM
> >> > > > > > > > > > > > > >
> >> > > > > > > > > > > > > > Thank you for the KIP
> >> > > > > > > > > > > > > > Luke
> >> > > > > > > > > > > > > >
> >> > > > > > > > > > > > > > On Thu, Nov 10, 2022 at 2:55 AM Jun Rao
> >> > > > > > > > <j...@confluent.io.invalid
> >> > > > > > > > > >
> >> > > > > > > > > > > > > wrote:
> >> > > > > > > > > > > > > >
> >> > > > > > > > > > > > > > > Hi, Divij,
> >> > > > > > > > > > > > > > >
> >> > > > > > > > > > > > > > > Thanks for the explanation.
> >> > > > > > > > > > > > > > >
> >> > > > > > > > > > > > > > > 1. Hmm, the default implementation of RLMM
> >> does
> >> > > local
> >> > > > > > > > caching,
> >> > > > > > > > > > > right?
> >> > > > > > > > > > > > > > > Currently, we also cache all segment
> metadata
> >> in
> >> > > the
> >> > > > > > > brokers
> >> > > > > > > > > > > without
> >> > > > > > > > > > > > > > > KIP-405. Do you see a need to change that?
> >> > > > > > > > > > > > > > >
> >> > > > > > > > > > > > > > > 2,3,4: Yes, your explanation makes sense.
> >> > However,
> >> > > > > > > > > > > > > > > currently,
> >> > > > > > RemoteLogMetadataManager.listRemoteLogSegments()
> >> > > > > > > > > doesn't
> >> > > > > > > > > > > > > > specify
> >> > > > > > > > > > > > > > > a particular order of the iterator. Do you
> >> intend
> >> > > to
> >> > > > > > change
> >> > > > > > > > > that?
> >> > > > > > > > > > > > > > >
> >> > > > > > > > > > > > > > > Thanks,
> >> > > > > > > > > > > > > > >
> >> > > > > > > > > > > > > > > Jun
> >> > > > > > > > > > > > > > >
> >> > > > > > > > > > > > > > > On Tue, Nov 8, 2022 at 3:31 AM Divij Vaidya
> <
> >> > > > > > > > > > > divijvaidy...@gmail.com
> >> > > > > > > > > > > > >
> >> > > > > > > > > > > > > > > wrote:
> >> > > > > > > > > > > > > > >
> >> > > > > > > > > > > > > > > > Hey Jun
> >> > > > > > > > > > > > > > > >
> >> > > > > > > > > > > > > > > > Thank you for your comments.
> >> > > > > > > > > > > > > > > >
> >> > > > > > > > > > > > > > > > *1. "RLMM implementor could ensure that
> >> > > > > > > > > listRemoteLogSegments()
> >> > > > > > > > > > > is
> >> > > > > > > > > > > > > > fast"*
> >> > > > > > > > > > > > > > > > This would be ideal but pragmatically, it
> is
> >> > > > > difficult
> >> > > > > > to
> >> > > > > > > > > ensure
> >> > > > > > > > > > > > that
> >> > > > > > > > > > > > > > > > listRemoteLogSegments() is fast. This is
> >> > because
> >> > > of
> >> > > > > the
> >> > > > > > > > > > > possibility
> >> > > > > > > > > > > > > of
> >> > > > > > > > > > > > > > a
> >> > > > > > > > > > > > > > > > large number of segments (much larger than
> >> what
> >> > > > Kafka
> >> > > > > > > > > currently
> >> > > > > > > > > > > > > handles
> >> > > > > > > > > > > > > > > > with local storage today) would make it
> >> > > infeasible
> >> > > > to
> >> > > > > > > adopt
> >> > > > > > > > > > > > > strategies
> >> > > > > > > > > > > > > > > such
> >> > > > > > > > > > > > > > > > as local caching to improve the
> performance
> >> of
> >> > > > > > > > > > > > listRemoteLogSegments.
> >> > > > > > > > > > > > > > > Apart
> >> > > > > > > > > > > > > > > > from caching (which won't work due to size
> >> > > > > > limitations) I
> >> > > > > > > > > can't
> >> > > > > > > > > > > > think
> >> > > > > > > > > > > > > > of
> >> > > > > > > > > > > > > > > > other strategies which may eliminate the
> >> need
> >> > for
> >> > > > IO
> >> > > > > > > > > > > > > > > > operations proportional to the number of
> >> total
> >> > > > > > segments.
> >> > > > > > > > > Please
> >> > > > > > > > > > > > > advise
> >> > > > > > > > > > > > > > if
> >> > > > > > > > > > > > > > > > you have something in mind.
> >> > > > > > > > > > > > > > > >
> >> > > > > > > > > > > > > > > > 2.  "*If the size exceeds the retention
> >> size,
> >> > we
> >> > > > need
> >> > > > > > to
> >> > > > > > > > > > > determine
> >> > > > > > > > > > > > > the
> >> > > > > > > > > > > > > > > > subset of segments to delete to bring the
> >> size
> >> > > > within
> >> > > > > > the
> >> > > > > > > > > > > retention
> >> > > > > > > > > > > > > > > limit.
> >> > > > > > > > > > > > > > > > Do we need to call
> >> > > > > > > > > > > RemoteLogMetadataManager.listRemoteLogSegments()
> >> > > > > > > > > > > > > to
> >> > > > > > > > > > > > > > > > determine that?"*
> >> > > > > > > > > > > > > > > > Yes, we need to call
> >> listRemoteLogSegments() to
> >> > > > > > determine
> >> > > > > > > > > which
> >> > > > > > > > > > > > > > segments
> >> > > > > > > > > > > > > > > > should be deleted. But there is a
> difference
> >> > with
> >> > > > the
> >> > > > > > use
> >> > > > > > > > > case we
> >> > > > > > > > > > > > are
> >> > > > > > > > > > > > > > > > trying to optimize with this KIP. To
> >> determine
> >> > > the
> >> > > > > > subset
> >> > > > > > > > of
> >> > > > > > > > > > > > segments
> >> > > > > > > > > > > > > > > which
> >> > > > > > > > > > > > > > > > would be deleted, we only read metadata
> for
> >> > > > segments
> >> > > > > > > which
> >> > > > > > > > > would
> >> > > > > > > > > > > be
> >> > > > > > > > > > > > > > > deleted
> >> > > > > > > > > > > > > > > > via the listRemoteLogSegments(). But to
> >> > determine
> >> > > > the
> >> > > > > > > > > > > totalLogSize,
> >> > > > > > > > > > > > > > which
> >> > > > > > > > > > > > > > > > is required every time retention logic
> >> based on
> >> > > > size
> >> > > > > > > > > executes, we
> >> > > > > > > > > > > > > read
> >> > > > > > > > > > > > > > > > metadata of *all* the segments in remote
> >> > storage.
> >> > > > > > Hence,
> >> > > > > > > > the
> >> > > > > > > > > > > number
> >> > > > > > > > > > > > > of
> >> > > > > > > > > > > > > > > > results returned by
> >> > > > > > > > > > > > *RemoteLogMetadataManager.listRemoteLogSegments()
> >> > > > > > > > > > > > > > *is
> >> > > > > > > > > > > > > > > > different when we are calculating
> >> totalLogSize
> >> > > vs.
> >> > > > > when
> >> > > > > > > we
> >> > > > > > > > > are
> >> > > > > > > > > > > > > > > determining
> >> > > > > > > > > > > > > > > > the subset of segments to delete.
> >> > > > > > > > > > > > > > > >
> >> > > > > > > > > > > > > > > > 3.
> >> > > > > > > > > > > > > > > > *"Also, what about time-based retention?
> To
> >> > make
> >> > > > that
> >> > > > > > > > > efficient,
> >> > > > > > > > > > > do
> >> > > > > > > > > > > > > we
> >> > > > > > > > > > > > > > > need
> >> > > > > > > > > > > > > > > > to make some additional interface
> >> changes?"*No.
> >> > > > Note
> >> > > > > > that
> >> > > > > > > > > time
> >> > > > > > > > > > > > > > complexity
> >> > > > > > > > > > > > > > > > to determine the segments for retention is
> >> > > > different
> >> > > > > > for
> >> > > > > > > > time
> >> > > > > > > > > > > based
> >> > > > > > > > > > > > > vs.
> >> > > > > > > > > > > > > > > > size based. For time based, the time
> >> complexity
> >> > > is
> >> > > > a
> >> > > > > > > > > function of
> >> > > > > > > > > > > > the
> >> > > > > > > > > > > > > > > number
> >> > > > > > > > > > > > > > > > of segments which are "eligible for
> >> deletion"
> >> > > > (since
> >> > > > > we
> >> > > > > > > > only
> >> > > > > > > > > read
> >> > > > > > > > > > > > > > > metadata
> >> > > > > > > > > > > > > > > > for segments which would be deleted)
> >> whereas in
> >> > > > size
> >> > > > > > > based
> >> > > > > > > > > > > > retention,
> >> > > > > > > > > > > > > > the
> >> > > > > > > > > > > > > > > > time complexity is a function of "all
> >> segments"
> >> > > > > > available
> >> > > > > > > > in
> >> > > > > > > > > > > remote
> >> > > > > > > > > > > > > > > storage
> >> > > > > > > > > > > > > > > > (metadata of all segments needs to be read
> >> to
> >> > > > > calculate
> >> > > > > > > the
> >> > > > > > > > > total
> >> > > > > > > > > > > > > > size).
> >> > > > > > > > > > > > > > > As
> >> > > > > > > > > > > > > > > > you may observe, this KIP will bring the
> >> time
> >> > > > > > complexity
> >> > > > > > > > for
> >> > > > > > > > > both
> >> > > > > > > > > > > > > time
> >> > > > > > > > > > > > > > > > based retention & size based retention to
> >> the
> >> > > same
> >> > > > > > > > function.
> >> > > > > > > > > > > > > > > >
> >> > > > > > > > > > > > > > > > 4. Also, please note that this new API
> >> > introduced
> >> > > > in
> >> > > > > > this
> >> > > > > > > > KIP
> >> > > > > > > > > > > also
> >> > > > > > > > > > > > > > > enables
> >> > > > > > > > > > > > > > > > us to provide a metric for total size of
> >> data
> >> > > > stored
> >> > > > > in
> >> > > > > > > > > remote
> >> > > > > > > > > > > > > storage.
> >> > > > > > > > > > > > > > > > Without the API, calculation of this
> metric
> >> > will
> >> > > > > become
> >> > > > > > > > very
> >> > > > > > > > > > > > > expensive
> >> > > > > > > > > > > > > > > with
> >> > > > > > > > > > > > > > > > *listRemoteLogSegments().*
> >> > > > > > > > > > > > > > > > I understand that your motivation here is
> to
> >> > > avoid
> >> > > > > > > > polluting
> >> > > > > > > > > the
> >> > > > > > > > > > > > > > > interface
> >> > > > > > > > > > > > > > > > with optimization specific APIs and I will
> >> > agree
> >> > > > with
> >> > > > > > > that
> >> > > > > > > > > goal.
> >> > > > > > > > > > > > But
> >> > > > > > > > > > > > > I
> >> > > > > > > > > > > > > > > > believe that this new API proposed in the
> >> KIP
> >> > > > brings
> >> > > > > in
> >> > > > > > > > > > > significant
> >> > > > > > > > > > > > > > > > improvement and there is no other work
> >> around
> >> > > > > available
> >> > > > > > > to
> >> > > > > > > > > > > achieve
> >> > > > > > > > > > > > > the
> >> > > > > > > > > > > > > > > same
> >> > > > > > > > > > > > > > > > performance.
> >> > > > > > > > > > > > > > > >
> >> > > > > > > > > > > > > > > > Regards,
> >> > > > > > > > > > > > > > > > Divij Vaidya
> >> > > > > > > > > > > > > > > >
> >> > > > > > > > > > > > > > > >
> >> > > > > > > > > > > > > > > >
> >> > > > > > > > > > > > > > > > On Tue, Nov 8, 2022 at 12:12 AM Jun Rao
> >> > > > > > > > > <j...@confluent.io.invalid
> >> > > > > > > > > > > >
> >> > > > > > > > > > > > > > > wrote:
> >> > > > > > > > > > > > > > > >
> >> > > > > > > > > > > > > > > > > Hi, Divij,
> >> > > > > > > > > > > > > > > > >
> >> > > > > > > > > > > > > > > > > Thanks for the KIP. Sorry for the late
> >> reply.
> >> > > > > > > > > > > > > > > > >
> >> > > > > > > > > > > > > > > > > The motivation of the KIP is to improve
> >> the
> >> > > > > > efficiency
> >> > > > > > > of
> >> > > > > > > > > size
> >> > > > > > > > > > > > > based
> >> > > > > > > > > > > > > > > > > retention. I am not sure the proposed
> >> changes
> >> > > are
> >> > > > > > > enough.
> >> > > > > > > > > For
> >> > > > > > > > > > > > > > example,
> >> > > > > > > > > > > > > > > if
> >> > > > > > > > > > > > > > > > > the size exceeds the retention size, we
> >> need
> >> > to
> >> > > > > > > determine
> >> > > > > > > > > the
> >> > > > > > > > > > > > > subset
> >> > > > > > > > > > > > > > of
> >> > > > > > > > > > > > > > > > > segments to delete to bring the size
> >> within
> >> > the
> >> > > > > > > retention
> >> > > > > > > > > > > limit.
> >> > > > > > > > > > > > Do
> >> > > > > > > > > > > > > > we
> >> > > > > > > > > > > > > > > > need
> >> > > > > > > > > > > > > > > > > to call
> >> > > > > > > RemoteLogMetadataManager.listRemoteLogSegments()
> >> > > > > > > > to
> >> > > > > > > > > > > > > determine
> >> > > > > > > > > > > > > > > > that?
> >> > > > > > > > > > > > > > > > > Also, what about time-based retention?
> To
> >> > make
> >> > > > that
> >> > > > > > > > > efficient,
> >> > > > > > > > > > > do
> >> > > > > > > > > > > > > we
> >> > > > > > > > > > > > > > > need
> >> > > > > > > > > > > > > > > > > to make some additional interface
> changes?
> >> > > > > > > > > > > > > > > > >
> >> > > > > > > > > > > > > > > > > An alternative approach is for the RLMM
> >> > > > implementor
> >> > > > > > to
> >> > > > > > > > make
> >> > > > > > > > > > > sure
> >> > > > > > > > > > > > > > > > > that
> >> > > > > RemoteLogMetadataManager.listRemoteLogSegments()
> >> > > > > > > is
> >> > > > > > > > > fast
> >> > > > > > > > > > > > > (e.g.,
> >> > > > > > > > > > > > > > > with
> >> > > > > > > > > > > > > > > > > local caching). This way, we could keep
> >> the
> >> > > > > interface
> >> > > > > > > > > simple.
> >> > > > > > > > > > > > Have
> >> > > > > > > > > > > > > we
> >> > > > > > > > > > > > > > > > > considered that?
> >> > > > > > > > > > > > > > > > >
> >> > > > > > > > > > > > > > > > > Thanks,
> >> > > > > > > > > > > > > > > > >
> >> > > > > > > > > > > > > > > > > Jun
> >> > > > > > > > > > > > > > > > >
> >> > > > > > > > > > > > > > > > > On Wed, Sep 28, 2022 at 6:28 AM Divij
> >> Vaidya
> >> > <
> >> > > > > > > > > > > > > > divijvaidy...@gmail.com>
> >> > > > > > > > > > > > > > > > > wrote:
> >> > > > > > > > > > > > > > > > >
> >> > > > > > > > > > > > > > > > > > Hey folks
> >> > > > > > > > > > > > > > > > > >
> >> > > > > > > > > > > > > > > > > > Does anyone else have any thoughts on
> >> this
> >> > > > > before I
> >> > > > > > > > > propose
> >> > > > > > > > > > > > this
> >> > > > > > > > > > > > > > for
> >> > > > > > > > > > > > > > > a
> >> > > > > > > > > > > > > > > > > > vote?
> >> > > > > > > > > > > > > > > > > >
> >> > > > > > > > > > > > > > > > > > --
> >> > > > > > > > > > > > > > > > > > Divij Vaidya
> >> > > > > > > > > > > > > > > > > >
> >> > > > > > > > > > > > > > > > > >
> >> > > > > > > > > > > > > > > > > >
> >> > > > > > > > > > > > > > > > > > On Mon, Sep 5, 2022 at 12:57 PM Satish
> >> > > Duggana
> >> > > > <
> >> > > > > > > > > > > > > > > > satish.dugg...@gmail.com
> >> > > > > > > > > > > > > > > > > >
> >> > > > > > > > > > > > > > > > > > wrote:
> >> > > > > > > > > > > > > > > > > >
> >> > > > > > > > > > > > > > > > > > > Thanks for the KIP Divij!
> >> > > > > > > > > > > > > > > > > > >
> >> > > > > > > > > > > > > > > > > > > This is a nice improvement to avoid
> >> > > > > recalculation
> >> > > > > > > of
> >> > > > > > > > > size.
> >> > > > > > > > > > > > > > > Customized
> >> > > > > > > > > > > > > > > > > > RLMMs
> >> > > > > > > > > > > > > > > > > > > can implement the best possible
> >> approach
> >> > by
> >> > > > > > caching
> >> > > > > > > > or
> >> > > > > > > > > > > > > > maintaining
> >> > > > > > > > > > > > > > > > the
> >> > > > > > > > > > > > > > > > > > size
> >> > > > > > > > > > > > > > > > > > > in an efficient way. But this is
> not a
> >> > big
> >> > > > > > concern
> >> > > > > > > > for
> >> > > > > > > > > the
> >> > > > > > > > > > > > > > default
> >> > > > > > > > > > > > > > > > > topic
> >> > > > > > > > > > > > > > > > > > > based RLMM as mentioned in the KIP.
> >> > > > > > > > > > > > > > > > > > >
> >> > > > > > > > > > > > > > > > > > > ~Satish.
> >> > > > > > > > > > > > > > > > > > >
> >> > > > > > > > > > > > > > > > > > > On Wed, 13 Jul 2022 at 18:48, Divij
> >> > Vaidya
> >> > > <
> >> > > > > > > > > > > > > > > divijvaidy...@gmail.com>
> >> > > > > > > > > > > > > > > > > > > wrote:
> >> > > > > > > > > > > > > > > > > > >
> >> > > > > > > > > > > > > > > > > > > > Thank you for your review Luke.
> >> > > > > > > > > > > > > > > > > > > >
> >> > > > > > > > > > > > > > > > > > > > > Reg: is that would the new
> >> > > > > > `RemoteLogSizeBytes`
> >> > > > > > > > > metric
> >> > > > > > > > > > > > be a
> >> > > > > > > > > > > > > > > > > > performance
> >> > > > > > > > > > > > > > > > > > > > overhead? Although we move the
> >> > > calculation
> >> > > > > to a
> >> > > > > > > > > seperate
> >> > > > > > > > > > > > API,
> >> > > > > > > > > > > > > > we
> >> > > > > > > > > > > > > > > > > still
> >> > > > > > > > > > > > > > > > > > > > can't assume users will implement
> a
> >> > > > > > light-weight
> >> > > > > > > > > method,
> >> > > > > > > > > > > > > right?
> >> > > > > > > > > > > > > > > > > > > >
> >> > > > > > > > > > > > > > > > > > > > This metric would be logged using
> >> the
> >> > > > > > information
> >> > > > > > > > > that is
> >> > > > > > > > > > > > > > already
> >> > > > > > > > > > > > > > > > > being
> >> > > > > > > > > > > > > > > > > > > > calculated for handling remote
> >> > retention
> >> > > > > logic,
> >> > > > > > > > > hence, no
> >> > > > > > > > > > > > > > > > additional
> >> > > > > > > > > > > > > > > > > > work
> >> > > > > > > > > > > > > > > > > > > > is required to calculate this
> >> metric.
> >> > > More
> >> > > > > > > > > specifically,
> >> > > > > > > > > > > > > > whenever
> >> > > > > > > > > > > > > > > > > > > > RemoteLogManager calls
> >> getRemoteLogSize
> >> > > > API,
> >> > > > > > this
> >> > > > > > > > > metric
> >> > > > > > > > > > > > > would
> >> > > > > > > > > > > > > > be
> >> > > > > > > > > > > > > > > > > > > captured.
> >> > > > > > > > > > > > > > > > > > > > This API call is made every time
> >> > > > > > RemoteLogManager
> >> > > > > > > > > wants
> >> > > > > > > > > > > to
> >> > > > > > > > > > > > > > handle
> >> > > > > > > > > > > > > > > > > > expired
> >> > > > > > > > > > > > > > > > > > > > remote log segments (which should
> be
> >> > > > > periodic).
> >> > > > > > > > Does
> >> > > > > > > > > that
> >> > > > > > > > > > > > > > address
> >> > > > > > > > > > > > > > > > > your
> >> > > > > > > > > > > > > > > > > > > > concern?
> >> > > > > > > > > > > > > > > > > > > >
> >> > > > > > > > > > > > > > > > > > > > Divij Vaidya
> >> > > > > > > > > > > > > > > > > > > >
> >> > > > > > > > > > > > > > > > > > > >
> >> > > > > > > > > > > > > > > > > > > >
> >> > > > > > > > > > > > > > > > > > > > On Tue, Jul 12, 2022 at 11:01 AM
> >> Luke
> >> > > Chen
> >> > > > <
> >> > > > > > > > > > > > > show...@gmail.com>
> >> > > > > > > > > > > > > > > > > wrote:
> >> > > > > > > > > > > > > > > > > > > >
> >> > > > > > > > > > > > > > > > > > > > > Hi Divij,
> >> > > > > > > > > > > > > > > > > > > > >
> >> > > > > > > > > > > > > > > > > > > > > Thanks for the KIP!
> >> > > > > > > > > > > > > > > > > > > > >
> >> > > > > > > > > > > > > > > > > > > > > I think it makes sense to
> delegate
> >> > the
> >> > > > > > > > > responsibility
> >> > > > > > > > > > > of
> >> > > > > > > > > > > > > > > > > calculation
> >> > > > > > > > > > > > > > > > > > to
> >> > > > > > > > > > > > > > > > > > > > the
> >> > > > > > > > > > > > > > > > > > > > > specific
> RemoteLogMetadataManager
> >> > > > > > > implementation.
> >> > > > > > > > > > > > > > > > > > > > > But one thing I'm not quite
> sure,
> >> is
> >> > > that
> >> > > > > > would
> >> > > > > > > > > the new
> >> > > > > > > > > > > > > > > > > > > > > `RemoteLogSizeBytes` metric be a
> >> > > > > performance
> >> > > > > > > > > overhead?
> >> > > > > > > > > > > > > > > > > > > > > Although we move the calculation
> >> to a
> >> > > > > > seperate
> >> > > > > > > > > API, we
> >> > > > > > > > > > > > > still
> >> > > > > > > > > > > > > > > > can't
> >> > > > > > > > > > > > > > > > > > > assume
> >> > > > > > > > > > > > > > > > > > > > > users will implement a
> >> light-weight
> >> > > > method,
> >> > > > > > > > right?
> >> > > > > > > > > > > > > > > > > > > > >
> >> > > > > > > > > > > > > > > > > > > > > Thank you.
> >> > > > > > > > > > > > > > > > > > > > > Luke
> >> > > > > > > > > > > > > > > > > > > > >
> >> > > > > > > > > > > > > > > > > > > > > On Fri, Jul 1, 2022 at 5:47 PM
> >> Divij
> >> > > > > Vaidya <
> >> > > > > > > > > > > > > > > > > divijvaidy...@gmail.com
> >> > > > > > > > > > > > > > > > > > >
> >> > > > > > > > > > > > > > > > > > > > > wrote:
> >> > > > > > > > > > > > > > > > > > > > >
> >> > > > > > > > > > > > > > > > > > > > > >
> >> > > > > > > > > > > > > > > > > > > > > >
> >> > > > > > > > > > > > > > > > > > > > >
> >> > > > > > > > > > > > > > > > > > > >
> >> > > > > > > > > > > > > > > > > > >
> >> > > > > > > > > > > > > > > > > >
> >> > > > > > > > > > > > > > > > >
> >> > > > > > > > > > > > > > > >
> >> > > > > > > > > > > > > > >
> >> > > > > > > > > > > > > >
> >> > > > > > > > > > > > >
> >> > > > > > > > > > > >
> >> > > > > > > > > > >
> >> > > > > > > > >
> >> > > > > > > >
> >> > > > > > >
> >> > > > > >
> >> > > > >
> >> > > >
> >> > >
> >> >
> >>
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-852%3A+Optimize+calculation+of+size+for+log+in+remote+tier
> >> > > > > > > > > > > > > > > > > > > > > >
> >> > > > > > > > > > > > > > > > > > > > > >
> >> > > > > > > > > > > > > > > > > > > > > > Hey folks
> >> > > > > > > > > > > > > > > > > > > > > >
> >> > > > > > > > > > > > > > > > > > > > > > Please take a look at this KIP
> >> > which
> >> > > > > > proposes
> >> > > > > > > > an
> >> > > > > > > > > > > > > extension
> >> > > > > > > > > > > > > > to
> >> > > > > > > > > > > > > > > > > > > KIP-405.
> >> > > > > > > > > > > > > > > > > > > > > This
> >> > > > > > > > > > > > > > > > > > > > > > is my first KIP with Apache
> >> Kafka
> >> > > > > community
> >> > > > > > > so
> >> > > > > > > > > any
> >> > > > > > > > > > > > > feedback
> >> > > > > > > > > > > > > > > > would
> >> > > > > > > > > > > > > > > > > > be
> >> > > > > > > > > > > > > > > > > > > > > highly
> >> > > > > > > > > > > > > > > > > > > > > > appreciated.
> >> > > > > > > > > > > > > > > > > > > > > >
> >> > > > > > > > > > > > > > > > > > > > > > Cheers!
> >> > > > > > > > > > > > > > > > > > > > > >
> >> > > > > > > > > > > > > > > > > > > > > > --
> >> > > > > > > > > > > > > > > > > > > > > > Divij Vaidya
> >> > > > > > > > > > > > > > > > > > > > > > Sr. Software Engineer
> >> > > > > > > > > > > > > > > > > > > > > > Amazon
> >> > > > > > > > > > > > > > > > > > > > > >
> >> > > > > > > > > > > > > > > > > > > > >
> >> > > > > > > > > > > > > > > > > > > >
> >> > > > > > > > > > > > > > > > > > >
> >> > > > > > > > > > > > > > > > > >
> >> > > > > > > > > > > > > > > > >
> >> > > > > > > > > > > > > > > >
> >> > > > > > > > > > > > > > >
> >> > > > > > > > > > > > > >
> >> > > > > > > > > > > > >
> >> > > > > > > > > > > >
> >> > > > > > > > > > >
> >> > > > > > > > >
> >> > > > > > > >
> >> > > > > > >
> >> > > > > >
> >> > > > >
> >> > > >
> >> > >
> >> >
> >>
> >
>

Re: [DISCUSS] KIP-852 Optimize calculation of size for log in remote tier

Reply via email to