Re: [DISCUSS] KIP-852 Optimize calculation of size for log in remote tier

Jun Rao Mon, 12 Dec 2022 10:35:58 -0800

Hi, Divij,

Thanks for the reply.


Is the new method enough for doing size-based retention? It gives the total
size of the remote segments, but it seems that we still don't know the
exact total size for a log since there could be overlapping segments
between the remote and the local segments.

You mentioned a disk-less implementation. Do you imagine all accesses to
remote metadata will be across the network or will there be some local
in-memory cache?

Thanks,

Jun



On Wed, Dec 7, 2022 at 3:10 AM Divij Vaidya <divijvaidy...@gmail.com> wrote:

> The method is needed for RLMM implementations which fetch the information
> over the network and not for the disk based implementations (such as the
> default topic based RLMM).
>
> I would argue that adding this API makes the interface more generic than
> what it is today. This is because, with the current APIs an implementor is
> restricted to use disk based RLMM solutions only (i.e. the default
> solution) whereas if we add this new API, we unblock usage of network based
> RLMM implementations such as databases.
>
>
>
> On Wed 30. Nov 2022 at 20:40, Jun Rao <j...@confluent.io.invalid> wrote:
>
> > Hi, Divij,
> >
> > Thanks for the reply.
> >
> > Point#2. My high level question is that is the new method needed for
> every
> > implementation of remote storage or just for a specific implementation.
> The
> > issues that you pointed out exist for the default implementation of RLMM
> as
> > well and so far, the default implementation hasn't found a need for a
> > similar new method. For public interface, ideally we want to make it more
> > general.
> >
> > Thanks,
> >
> > Jun
> >
> > On Mon, Nov 21, 2022 at 7:11 AM Divij Vaidya <divijvaidy...@gmail.com>
> > wrote:
> >
> > > Thank you Jun and Alex for your comments.
> > >
> > > Point#1: You are right Jun. As Alex mentioned, the "derived metadata"
> can
> > > increase the size of cached metadata by a factor of 10 but it should be
> > ok
> > > to cache just the actual metadata. My point about size being a
> limitation
> > > for using cache is not valid anymore.
> > >
> > > Point#2: For a new replica, it would still have to fetch the metadata
> > over
> > > the network to initiate the warm up of the cache and hence, increase
> the
> > > start time of the archival process. Please also note the repercussions
> of
> > > the warm up scan that Alex mentioned in this thread as part of #102.2.
> > >
> > > 100#: Agreed Alex. Thanks for clarifying that. My point about size
> being
> > a
> > > limitation for using cache is not valid anymore.
> > >
> > > 101#: Alex, if I understand correctly, you are suggesting to cache the
> > > total size at the leader and update it on archival. This wouldn't work
> > for
> > > cases when the leader restarts where we would have to make a full scan
> > > to update the total size entry on startup. We expect users to store
> data
> > > over longer duration in remote storage which increases the likelihood
> of
> > > leader restarts / failovers.
> > >
> > > 102#.1: I don't think that the current design accommodates the fact
> that
> > > data corruption could happen at the RLMM plugin (we don't have checksum
> > as
> > > a field in metadata as part of KIP405). If data corruption occurs, w/
> or
> > > w/o the cache, it would be a different problem to solve. I would like
> to
> > > keep this outside the scope of this KIP.
> > >
> > > 102#.2: Agree. This remains as the main concern for using the cache to
> > > fetch total size.
> > >
> > > Regards,
> > > Divij Vaidya
> > >
> > >
> > >
> > > On Fri, Nov 18, 2022 at 12:59 PM Alexandre Dupriez <
> > > alexandre.dupr...@gmail.com> wrote:
> > >
> > > > Hi Divij,
> > > >
> > > > Thanks for the KIP. Please find some comments based on what I read on
> > > > this thread so far - apologies for the repeats and the late reply.
> > > >
> > > > If I understand correctly, one of the main elements of discussion is
> > > > about caching in Kafka versus delegation of providing the remote size
> > > > of a topic-partition to the plugin.
> > > >
> > > > A few comments:
> > > >
> > > > 100. The size of the “derived metadata” which is managed by the
> plugin
> > > > to represent an rlmMetadata can indeed be close to 1 kB on average
> > > > depending on its own internal structure, e.g. the redundancy it
> > > > enforces (unfortunately resulting to duplication), additional
> > > > information such as checksums and primary and secondary indexable
> > > > keys. But indeed, the rlmMetadata is itself a lighter data structure
> > > > by a factor of 10. And indeed, instead of caching the “derived
> > > > metadata”, only the rlmMetadata could be, which should address the
> > > > concern regarding the memory occupancy of the cache.
> > > >
> > > > 101. I am not sure I fully understand why we would need to cache the
> > > > list of rlmMetadata to retain the remote size of a topic-partition.
> > > > Since the leader of a topic-partition is, in non-degenerated cases,
> > > > the only actor which can mutate the remote part of the
> > > > topic-partition, hence its size, it could in theory only cache the
> > > > size of the remote log once it has calculated it? In which case there
> > > > would not be any problem regarding the size of the caching strategy.
> > > > Did I miss something there?
> > > >
> > > > 102. There may be a few challenges to consider with caching:
> > > >
> > > > 102.1) As mentioned above, the caching strategy assumes no mutation
> > > > outside the lifetime of a leader. While this is true in the normal
> > > > course of operation, there could be accidental mutation outside of
> the
> > > > leader and a loss of consistency between the cached state and the
> > > > actual remote representation of the log. E.g. split-brain scenarios,
> > > > bugs in the plugins, bugs in external systems with mutating access on
> > > > the derived metadata. In the worst case, a drift between the cached
> > > > size and the actual size could lead to over-deleting remote data
> which
> > > > is a durability risk.
> > > >
> > > > The alternative you propose, by making the plugin the source of truth
> > > > w.r.t. to the size of the remote log, can make it easier to avoid
> > > > inconsistencies between plugin-managed metadata and the remote log
> > > > from the perspective of Kafka. On the other hand, plugin vendors
> would
> > > > have to implement it with the expected efficiency to have it yield
> > > > benefits.
> > > >
> > > > 102.2) As you mentioned, the caching strategy in Kafka would still
> > > > require one iteration over the list of rlmMetadata when the
> leadership
> > > > of a topic-partition is assigned to a broker, while the plugin can
> > > > offer alternative constant-time approaches. This calculation cannot
> be
> > > > put on the LeaderAndIsr path and would be performed in the
> background.
> > > > In case of bulk leadership migration, listing the rlmMetadata could
> a)
> > > > result in request bursts to any backend system the plugin may use
> > > > [which shouldn’t be a problem for high-throughput data stores but
> > > > could have cost implications] b) increase utilisation timespan of the
> > > > RLM threads for these calculations potentially leading to transient
> > > > starvation of tasks queued for, typically, offloading operations c)
> > > > could have a non-marginal CPU footprint on hardware with strict
> > > > resource constraints. All these elements could have an impact to some
> > > > degree depending on the operational environment.
> > > >
> > > > From a design perspective, one question is where we want the source
> of
> > > > truth w.r.t. remote log size to be during the lifetime of a leader.
> > > > The responsibility of maintaining a consistent representation of the
> > > > remote log is shared by Kafka and the plugin. Which system is best
> > > > placed to maintain such a state while providing the highest
> > > > consistency guarantees is something both Kafka and plugin designers
> > > > could help understand better.
> > > >
> > > > Many thanks,
> > > > Alexandre
> > > >
> > > >
> > > > Le jeu. 17 nov. 2022 à 19:27, Jun Rao <j...@confluent.io.invalid> a
> > > écrit :
> > > > >
> > > > > Hi, Divij,
> > > > >
> > > > > Thanks for the reply.
> > > > >
> > > > > Point #1. Is the average remote segment metadata really 1KB? What's
> > > > listed
> > > > > in the public interface is probably well below 100 bytes.
> > > > >
> > > > > Point #2. I guess you are assuming that each broker only caches the
> > > > remote
> > > > > segment metadata in memory. An alternative approach is to cache
> them
> > in
> > > > > both memory and local disk. That way, on broker restart, you just
> > need
> > > to
> > > > > fetch the new remote segments' metadata using the
> > > > > listRemoteLogSegments(TopicIdPartition topicIdPartition, int
> > > leaderEpoch)
> > > > > api. Will that work?
> > > > >
> > > > > Point #3. Thanks for the explanation and it sounds good.
> > > > >
> > > > > Thanks,
> > > > >
> > > > > Jun
> > > > >
> > > > > On Thu, Nov 17, 2022 at 7:31 AM Divij Vaidya <
> > divijvaidy...@gmail.com>
> > > > > wrote:
> > > > >
> > > > > > Hi Jun
> > > > > >
> > > > > > There are three points that I would like to present here:
> > > > > >
> > > > > > 1. We would require a large cache size to efficiently cache all
> > > segment
> > > > > > metadata.
> > > > > > 2. Linear scan of all metadata at broker startup to populate the
> > > cache
> > > > will
> > > > > > be slow and will impact the archival process.
> > > > > > 3. There is no other use case where a full scan of segment
> metadata
> > > is
> > > > > > required.
> > > > > >
> > > > > > Let's start by quantifying 1. Here's my estimate for the size of
> > the
> > > > cache.
> > > > > > Average size of segment metadata = 1KB. This could be more if we
> > have
> > > > > > frequent leader failover with a large number of leader epochs
> being
> > > > stored
> > > > > > per segment.
> > > > > > Segment size = 100MB. Users will prefer to reduce the segment
> size
> > > > from the
> > > > > > default value of 1GB to ensure timely archival of data since data
> > > from
> > > > > > active segment is not archived.
> > > > > > Cache size = num segments * avg. segment metadata size =
> > > > (100TB/100MB)*1KB
> > > > > > = 1GB.
> > > > > > While 1GB for cache may not sound like a large number for larger
> > > > machines,
> > > > > > it does eat into the memory as an additional cache and makes use
> > > cases
> > > > with
> > > > > > large data retention with low throughout expensive (where such
> use
> > > case
> > > > > > would could use smaller machines).
> > > > > >
> > > > > > About point#2:
> > > > > > Even if we say that all segment metadata can fit into the cache,
> we
> > > > will
> > > > > > need to populate the cache on broker startup. It would not be in
> > the
> > > > > > critical patch of broker startup and hence won't impact the
> startup
> > > > time.
> > > > > > But it will impact the time when we could start the archival
> > process
> > > > since
> > > > > > the RLM thread pool will be blocked on the first call to
> > > > > > listRemoteLogSegments(). To scan metadata for 1MM segments
> > (computed
> > > > above)
> > > > > > and transfer 1GB data over the network from a RLMM such as a
> remote
> > > > > > database would be in the order of minutes (depending on how
> > efficient
> > > > the
> > > > > > scan is with the RLMM implementation). Although, I would concede
> > that
> > > > > > having RLM threads blocked for a few minutes is perhaps OK but if
> > we
> > > > > > introduce the new API proposed in the KIP, we would have a
> > > > > > deterministic startup time for RLM. Adding the API comes at a low
> > > cost
> > > > and
> > > > > > I believe the trade off is worth it.
> > > > > >
> > > > > > About point#3:
> > > > > > We can use listRemoteLogSegments(TopicIdPartition
> topicIdPartition,
> > > int
> > > > > > leaderEpoch) to calculate the segments eligible for deletion
> (based
> > > on
> > > > size
> > > > > > retention) where leader epoch(s) belong to the current leader
> epoch
> > > > chain.
> > > > > > I understand that it may lead to segments belonging to other
> epoch
> > > > lineage
> > > > > > not getting deleted and would require a separate mechanism to
> > delete
> > > > them.
> > > > > > The separate mechanism would anyways be required to delete these
> > > > "leaked"
> > > > > > segments as there are other cases which could lead to leaks such
> as
> > > > network
> > > > > > problems with RSM mid way writing through. segment etc.
> > > > > >
> > > > > > Thank you for the replies so far. They have made me re-think my
> > > > assumptions
> > > > > > and this dialogue has been very constructive for me.
> > > > > >
> > > > > > Regards,
> > > > > > Divij Vaidya
> > > > > >
> > > > > >
> > > > > >
> > > > > > On Thu, Nov 10, 2022 at 10:49 PM Jun Rao
> <j...@confluent.io.invalid
> > >
> > > > wrote:
> > > > > >
> > > > > > > Hi, Divij,
> > > > > > >
> > > > > > > Thanks for the reply.
> > > > > > >
> > > > > > > It's true that the data in Kafka could be kept longer with
> > KIP-405.
> > > > How
> > > > > > > much data do you envision to have per broker? For 100TB data
> per
> > > > broker,
> > > > > > > with 1GB segment and segment metadata of 100 bytes, it requires
> > > > > > > 100TB/1GB*100 = 10MB, which should fit in memory.
> > > > > > >
> > > > > > > RemoteLogMetadataManager has two listRemoteLogSegments()
> methods.
> > > > The one
> > > > > > > you listed listRemoteLogSegments(TopicIdPartition
> > topicIdPartition,
> > > > int
> > > > > > > leaderEpoch) does return data in offset order. However, the
> other
> > > > > > > one listRemoteLogSegments(TopicIdPartition topicIdPartition)
> > > doesn't
> > > > > > > specify the return order. I assume that you need the latter to
> > > > calculate
> > > > > > > the segment size?
> > > > > > >
> > > > > > > Thanks,
> > > > > > >
> > > > > > > Jun
> > > > > > >
> > > > > > > On Thu, Nov 10, 2022 at 10:25 AM Divij Vaidya <
> > > > divijvaidy...@gmail.com>
> > > > > > > wrote:
> > > > > > >
> > > > > > > > *Jun,*
> > > > > > > >
> > > > > > > > *"the default implementation of RLMM does local caching,
> > right?"*
> > > > > > > > Yes, Jun. The default implementation of RLMM does indeed
> cache
> > > the
> > > > > > > segment
> > > > > > > > metadata today, hence, it won't work for use cases when the
> > > number
> > > > of
> > > > > > > > segments in remote storage is large enough to exceed the size
> > of
> > > > cache.
> > > > > > > As
> > > > > > > > part of this KIP, I will implement the new proposed API in
> the
> > > > default
> > > > > > > > implementation of RLMM but the underlying implementation will
> > > > still be
> > > > > > a
> > > > > > > > scan. I will pick up optimizing that in a separate PR.
> > > > > > > >
> > > > > > > > *"we also cache all segment metadata in the brokers without
> > > > KIP-405. Do
> > > > > > > you
> > > > > > > > see a need to change that?"*
> > > > > > > > Please correct me if I am wrong here but we cache metadata
> for
> > > > segments
> > > > > > > > "residing in local storage". The size of the current cache
> > works
> > > > fine
> > > > > > for
> > > > > > > > the scale of the number of segments that we expect to store
> in
> > > > local
> > > > > > > > storage. After KIP-405, that cache will continue to store
> > > metadata
> > > > for
> > > > > > > > segments which are residing in local storage and hence, we
> > don't
> > > > need
> > > > > > to
> > > > > > > > change that. For segments which have been offloaded to remote
> > > > storage,
> > > > > > it
> > > > > > > > would rely on RLMM. Note that the scale of data stored in
> RLMM
> > is
> > > > > > > different
> > > > > > > > from local cache because the number of segments is expected
> to
> > be
> > > > much
> > > > > > > > larger than what current implementation stores in local
> > storage.
> > > > > > > >
> > > > > > > > 2,3,4: RemoteLogMetadataManager.listRemoteLogSegments() does
> > > > specify
> > > > > > the
> > > > > > > > order i.e. it returns the segments sorted by first offset in
> > > > ascending
> > > > > > > > order. I am copying the API docs for KIP-405 here for your
> > > > reference
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > *Returns iterator of remote log segment metadata, sorted by
> > > {@link
> > > > > > > > RemoteLogSegmentMetadata#startOffset()} inascending order
> which
> > > > > > contains
> > > > > > > > the given leader epoch. This is used by remote log retention
> > > > management
> > > > > > > > subsystemto fetch the segment metadata for a given leader
> > > > epoch.@param
> > > > > > > > topicIdPartition topic partition@param leaderEpoch
> leader
> > > > > > > > epoch@return
> > > > > > > > Iterator of remote segments, sorted by start offset in
> > ascending
> > > > > > order. *
> > > > > > > >
> > > > > > > > *Luke,*
> > > > > > > >
> > > > > > > > 5. Note that we are trying to optimize the efficiency of size
> > > based
> > > > > > > > retention for remote storage. KIP-405 does not introduce a
> new
> > > > config
> > > > > > for
> > > > > > > > periodically checking remote similar to
> > > > > > log.retention.check.interval.ms
> > > > > > > > which is applicable for remote storage. Hence, the metric
> will
> > be
> > > > > > updated
> > > > > > > > at the time of invoking log retention check for remote tier
> > which
> > > > is
> > > > > > > > pending implementation today. We can perhaps come back and
> > update
> > > > the
> > > > > > > > metric description after the implementation of log retention
> > > check
> > > > in
> > > > > > > > RemoteLogManager.
> > > > > > > >
> > > > > > > > --
> > > > > > > > Divij Vaidya
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > On Thu, Nov 10, 2022 at 6:16 AM Luke Chen <show...@gmail.com
> >
> > > > wrote:
> > > > > > > >
> > > > > > > > > Hi Divij,
> > > > > > > > >
> > > > > > > > > One more question about the metric:
> > > > > > > > > I think the metric will be updated when
> > > > > > > > > (1) each time we run the log retention check (that is,
> > > > > > > > > log.retention.check.interval.ms)
> > > > > > > > > (2) When user explicitly call getRemoteLogSize
> > > > > > > > >
> > > > > > > > > Is that correct?
> > > > > > > > > Maybe we should add a note in metric description,
> otherwise,
> > > when
> > > > > > user
> > > > > > > > got,
> > > > > > > > > let's say 0 of RemoteLogSizeBytes, will be surprised.
> > > > > > > > >
> > > > > > > > > Otherwise, LGTM
> > > > > > > > >
> > > > > > > > > Thank you for the KIP
> > > > > > > > > Luke
> > > > > > > > >
> > > > > > > > > On Thu, Nov 10, 2022 at 2:55 AM Jun Rao
> > > <j...@confluent.io.invalid
> > > > >
> > > > > > > > wrote:
> > > > > > > > >
> > > > > > > > > > Hi, Divij,
> > > > > > > > > >
> > > > > > > > > > Thanks for the explanation.
> > > > > > > > > >
> > > > > > > > > > 1. Hmm, the default implementation of RLMM does local
> > > caching,
> > > > > > right?
> > > > > > > > > > Currently, we also cache all segment metadata in the
> > brokers
> > > > > > without
> > > > > > > > > > KIP-405. Do you see a need to change that?
> > > > > > > > > >
> > > > > > > > > > 2,3,4: Yes, your explanation makes sense. However,
> > > > > > > > > > currently,
> RemoteLogMetadataManager.listRemoteLogSegments()
> > > > doesn't
> > > > > > > > > specify
> > > > > > > > > > a particular order of the iterator. Do you intend to
> change
> > > > that?
> > > > > > > > > >
> > > > > > > > > > Thanks,
> > > > > > > > > >
> > > > > > > > > > Jun
> > > > > > > > > >
> > > > > > > > > > On Tue, Nov 8, 2022 at 3:31 AM Divij Vaidya <
> > > > > > divijvaidy...@gmail.com
> > > > > > > >
> > > > > > > > > > wrote:
> > > > > > > > > >
> > > > > > > > > > > Hey Jun
> > > > > > > > > > >
> > > > > > > > > > > Thank you for your comments.
> > > > > > > > > > >
> > > > > > > > > > > *1. "RLMM implementor could ensure that
> > > > listRemoteLogSegments()
> > > > > > is
> > > > > > > > > fast"*
> > > > > > > > > > > This would be ideal but pragmatically, it is difficult
> to
> > > > ensure
> > > > > > > that
> > > > > > > > > > > listRemoteLogSegments() is fast. This is because of the
> > > > > > possibility
> > > > > > > > of
> > > > > > > > > a
> > > > > > > > > > > large number of segments (much larger than what Kafka
> > > > currently
> > > > > > > > handles
> > > > > > > > > > > with local storage today) would make it infeasible to
> > adopt
> > > > > > > > strategies
> > > > > > > > > > such
> > > > > > > > > > > as local caching to improve the performance of
> > > > > > > listRemoteLogSegments.
> > > > > > > > > > Apart
> > > > > > > > > > > from caching (which won't work due to size
> limitations) I
> > > > can't
> > > > > > > think
> > > > > > > > > of
> > > > > > > > > > > other strategies which may eliminate the need for IO
> > > > > > > > > > > operations proportional to the number of total
> segments.
> > > > Please
> > > > > > > > advise
> > > > > > > > > if
> > > > > > > > > > > you have something in mind.
> > > > > > > > > > >
> > > > > > > > > > > 2.  "*If the size exceeds the retention size, we need
> to
> > > > > > determine
> > > > > > > > the
> > > > > > > > > > > subset of segments to delete to bring the size within
> the
> > > > > > retention
> > > > > > > > > > limit.
> > > > > > > > > > > Do we need to call
> > > > > > RemoteLogMetadataManager.listRemoteLogSegments()
> > > > > > > > to
> > > > > > > > > > > determine that?"*
> > > > > > > > > > > Yes, we need to call listRemoteLogSegments() to
> determine
> > > > which
> > > > > > > > > segments
> > > > > > > > > > > should be deleted. But there is a difference with the
> use
> > > > case we
> > > > > > > are
> > > > > > > > > > > trying to optimize with this KIP. To determine the
> subset
> > > of
> > > > > > > segments
> > > > > > > > > > which
> > > > > > > > > > > would be deleted, we only read metadata for segments
> > which
> > > > would
> > > > > > be
> > > > > > > > > > deleted
> > > > > > > > > > > via the listRemoteLogSegments(). But to determine the
> > > > > > totalLogSize,
> > > > > > > > > which
> > > > > > > > > > > is required every time retention logic based on size
> > > > executes, we
> > > > > > > > read
> > > > > > > > > > > metadata of *all* the segments in remote storage.
> Hence,
> > > the
> > > > > > number
> > > > > > > > of
> > > > > > > > > > > results returned by
> > > > > > > *RemoteLogMetadataManager.listRemoteLogSegments()
> > > > > > > > > *is
> > > > > > > > > > > different when we are calculating totalLogSize vs. when
> > we
> > > > are
> > > > > > > > > > determining
> > > > > > > > > > > the subset of segments to delete.
> > > > > > > > > > >
> > > > > > > > > > > 3.
> > > > > > > > > > > *"Also, what about time-based retention? To make that
> > > > efficient,
> > > > > > do
> > > > > > > > we
> > > > > > > > > > need
> > > > > > > > > > > to make some additional interface changes?"*No. Note
> that
> > > > time
> > > > > > > > > complexity
> > > > > > > > > > > to determine the segments for retention is different
> for
> > > time
> > > > > > based
> > > > > > > > vs.
> > > > > > > > > > > size based. For time based, the time complexity is a
> > > > function of
> > > > > > > the
> > > > > > > > > > number
> > > > > > > > > > > of segments which are "eligible for deletion" (since we
> > > only
> > > > read
> > > > > > > > > > metadata
> > > > > > > > > > > for segments which would be deleted) whereas in size
> > based
> > > > > > > retention,
> > > > > > > > > the
> > > > > > > > > > > time complexity is a function of "all segments"
> available
> > > in
> > > > > > remote
> > > > > > > > > > storage
> > > > > > > > > > > (metadata of all segments needs to be read to calculate
> > the
> > > > total
> > > > > > > > > size).
> > > > > > > > > > As
> > > > > > > > > > > you may observe, this KIP will bring the time
> complexity
> > > for
> > > > both
> > > > > > > > time
> > > > > > > > > > > based retention & size based retention to the same
> > > function.
> > > > > > > > > > >
> > > > > > > > > > > 4. Also, please note that this new API introduced in
> this
> > > KIP
> > > > > > also
> > > > > > > > > > enables
> > > > > > > > > > > us to provide a metric for total size of data stored in
> > > > remote
> > > > > > > > storage.
> > > > > > > > > > > Without the API, calculation of this metric will become
> > > very
> > > > > > > > expensive
> > > > > > > > > > with
> > > > > > > > > > > *listRemoteLogSegments().*
> > > > > > > > > > > I understand that your motivation here is to avoid
> > > polluting
> > > > the
> > > > > > > > > > interface
> > > > > > > > > > > with optimization specific APIs and I will agree with
> > that
> > > > goal.
> > > > > > > But
> > > > > > > > I
> > > > > > > > > > > believe that this new API proposed in the KIP brings in
> > > > > > significant
> > > > > > > > > > > improvement and there is no other work around available
> > to
> > > > > > achieve
> > > > > > > > the
> > > > > > > > > > same
> > > > > > > > > > > performance.
> > > > > > > > > > >
> > > > > > > > > > > Regards,
> > > > > > > > > > > Divij Vaidya
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > On Tue, Nov 8, 2022 at 12:12 AM Jun Rao
> > > > <j...@confluent.io.invalid
> > > > > > >
> > > > > > > > > > wrote:
> > > > > > > > > > >
> > > > > > > > > > > > Hi, Divij,
> > > > > > > > > > > >
> > > > > > > > > > > > Thanks for the KIP. Sorry for the late reply.
> > > > > > > > > > > >
> > > > > > > > > > > > The motivation of the KIP is to improve the
> efficiency
> > of
> > > > size
> > > > > > > > based
> > > > > > > > > > > > retention. I am not sure the proposed changes are
> > enough.
> > > > For
> > > > > > > > > example,
> > > > > > > > > > if
> > > > > > > > > > > > the size exceeds the retention size, we need to
> > determine
> > > > the
> > > > > > > > subset
> > > > > > > > > of
> > > > > > > > > > > > segments to delete to bring the size within the
> > retention
> > > > > > limit.
> > > > > > > Do
> > > > > > > > > we
> > > > > > > > > > > need
> > > > > > > > > > > > to call
> > RemoteLogMetadataManager.listRemoteLogSegments()
> > > to
> > > > > > > > determine
> > > > > > > > > > > that?
> > > > > > > > > > > > Also, what about time-based retention? To make that
> > > > efficient,
> > > > > > do
> > > > > > > > we
> > > > > > > > > > need
> > > > > > > > > > > > to make some additional interface changes?
> > > > > > > > > > > >
> > > > > > > > > > > > An alternative approach is for the RLMM implementor
> to
> > > make
> > > > > > sure
> > > > > > > > > > > > that RemoteLogMetadataManager.listRemoteLogSegments()
> > is
> > > > fast
> > > > > > > > (e.g.,
> > > > > > > > > > with
> > > > > > > > > > > > local caching). This way, we could keep the interface
> > > > simple.
> > > > > > > Have
> > > > > > > > we
> > > > > > > > > > > > considered that?
> > > > > > > > > > > >
> > > > > > > > > > > > Thanks,
> > > > > > > > > > > >
> > > > > > > > > > > > Jun
> > > > > > > > > > > >
> > > > > > > > > > > > On Wed, Sep 28, 2022 at 6:28 AM Divij Vaidya <
> > > > > > > > > divijvaidy...@gmail.com>
> > > > > > > > > > > > wrote:
> > > > > > > > > > > >
> > > > > > > > > > > > > Hey folks
> > > > > > > > > > > > >
> > > > > > > > > > > > > Does anyone else have any thoughts on this before I
> > > > propose
> > > > > > > this
> > > > > > > > > for
> > > > > > > > > > a
> > > > > > > > > > > > > vote?
> > > > > > > > > > > > >
> > > > > > > > > > > > > --
> > > > > > > > > > > > > Divij Vaidya
> > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > On Mon, Sep 5, 2022 at 12:57 PM Satish Duggana <
> > > > > > > > > > > satish.dugg...@gmail.com
> > > > > > > > > > > > >
> > > > > > > > > > > > > wrote:
> > > > > > > > > > > > >
> > > > > > > > > > > > > > Thanks for the KIP Divij!
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > This is a nice improvement to avoid recalculation
> > of
> > > > size.
> > > > > > > > > > Customized
> > > > > > > > > > > > > RLMMs
> > > > > > > > > > > > > > can implement the best possible approach by
> caching
> > > or
> > > > > > > > > maintaining
> > > > > > > > > > > the
> > > > > > > > > > > > > size
> > > > > > > > > > > > > > in an efficient way. But this is not a big
> concern
> > > for
> > > > the
> > > > > > > > > default
> > > > > > > > > > > > topic
> > > > > > > > > > > > > > based RLMM as mentioned in the KIP.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > ~Satish.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > On Wed, 13 Jul 2022 at 18:48, Divij Vaidya <
> > > > > > > > > > divijvaidy...@gmail.com>
> > > > > > > > > > > > > > wrote:
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Thank you for your review Luke.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > Reg: is that would the new
> `RemoteLogSizeBytes`
> > > > metric
> > > > > > > be a
> > > > > > > > > > > > > performance
> > > > > > > > > > > > > > > overhead? Although we move the calculation to a
> > > > seperate
> > > > > > > API,
> > > > > > > > > we
> > > > > > > > > > > > still
> > > > > > > > > > > > > > > can't assume users will implement a
> light-weight
> > > > method,
> > > > > > > > right?
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > This metric would be logged using the
> information
> > > > that is
> > > > > > > > > already
> > > > > > > > > > > > being
> > > > > > > > > > > > > > > calculated for handling remote retention logic,
> > > > hence, no
> > > > > > > > > > > additional
> > > > > > > > > > > > > work
> > > > > > > > > > > > > > > is required to calculate this metric. More
> > > > specifically,
> > > > > > > > > whenever
> > > > > > > > > > > > > > > RemoteLogManager calls getRemoteLogSize API,
> this
> > > > metric
> > > > > > > > would
> > > > > > > > > be
> > > > > > > > > > > > > > captured.
> > > > > > > > > > > > > > > This API call is made every time
> RemoteLogManager
> > > > wants
> > > > > > to
> > > > > > > > > handle
> > > > > > > > > > > > > expired
> > > > > > > > > > > > > > > remote log segments (which should be periodic).
> > > Does
> > > > that
> > > > > > > > > address
> > > > > > > > > > > > your
> > > > > > > > > > > > > > > concern?
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Divij Vaidya
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > On Tue, Jul 12, 2022 at 11:01 AM Luke Chen <
> > > > > > > > show...@gmail.com>
> > > > > > > > > > > > wrote:
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > Hi Divij,
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > Thanks for the KIP!
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > I think it makes sense to delegate the
> > > > responsibility
> > > > > > of
> > > > > > > > > > > > calculation
> > > > > > > > > > > > > to
> > > > > > > > > > > > > > > the
> > > > > > > > > > > > > > > > specific RemoteLogMetadataManager
> > implementation.
> > > > > > > > > > > > > > > > But one thing I'm not quite sure, is that
> would
> > > > the new
> > > > > > > > > > > > > > > > `RemoteLogSizeBytes` metric be a performance
> > > > overhead?
> > > > > > > > > > > > > > > > Although we move the calculation to a
> seperate
> > > > API, we
> > > > > > > > still
> > > > > > > > > > > can't
> > > > > > > > > > > > > > assume
> > > > > > > > > > > > > > > > users will implement a light-weight method,
> > > right?
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > Thank you.
> > > > > > > > > > > > > > > > Luke
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > On Fri, Jul 1, 2022 at 5:47 PM Divij Vaidya <
> > > > > > > > > > > > divijvaidy...@gmail.com
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > wrote:
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-852%3A+Optimize+calculation+of+size+for+log+in+remote+tier
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > Hey folks
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > Please take a look at this KIP which
> proposes
> > > an
> > > > > > > > extension
> > > > > > > > > to
> > > > > > > > > > > > > > KIP-405.
> > > > > > > > > > > > > > > > This
> > > > > > > > > > > > > > > > > is my first KIP with Apache Kafka community
> > so
> > > > any
> > > > > > > > feedback
> > > > > > > > > > > would
> > > > > > > > > > > > > be
> > > > > > > > > > > > > > > > highly
> > > > > > > > > > > > > > > > > appreciated.
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > Cheers!
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > --
> > > > > > > > > > > > > > > > > Divij Vaidya
> > > > > > > > > > > > > > > > > Sr. Software Engineer
> > > > > > > > > > > > > > > > > Amazon
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > >
> > >
> >
>

Re: [DISCUSS] KIP-852 Optimize calculation of size for log in remote tier

Reply via email to