Thanks Divij for taking the feedback and updating the motivation section in the KIP.
One more comment on Alternative solution-3, The con is not valid as that will not affect the broker restart times as discussed in the earlier email in this thread. You may want to update that. ~Satish. On Sun, 2 Jul 2023 at 01:03, Divij Vaidya <divijvaidy...@gmail.com> wrote: > > Thank you folks for reviewing this KIP. > > Satish, I have modified the motivation to make it more clear. Now it says, > "Since the main feature of tiered storage is storing a large amount of > data, we expect num_remote_segments to be large. A frequent linear scan > (i.e. listing all segment metadata) could be expensive/slower because of > the underlying storage used by RemoteLogMetadataManager. This slowness to > list all segment metadata could result in the loss of availability...." > > Jun, Kamal, Satish, if you don't have any further concerns, I would > appreciate a vote for this KIP in the voting thread - > https://lists.apache.org/thread/soz00990gvzodv7oyqj4ysvktrqy6xfk > > -- > Divij Vaidya > > > > On Sat, Jul 1, 2023 at 6:16 AM Kamal Chandraprakash < > kamal.chandraprak...@gmail.com> wrote: > > > Hi Divij, > > > > Thanks for the explanation. LGTM. > > > > -- > > Kamal > > > > On Sat, Jul 1, 2023 at 7:28 AM Satish Duggana <satish.dugg...@gmail.com> > > wrote: > > > > > Hi Divij, > > > I am fine with having an API to compute the size as I mentioned in my > > > earlier reply in this mail thread. But I have the below comment for > > > the motivation for this KIP. > > > > > > As you discussed offline, the main issue here is listing calls for > > > remote log segment metadata is slower because of the storage used for > > > RLMM. These can be avoided with this new API. > > > > > > Please add this in the motivation section as it is one of the main > > > motivations for the KIP. > > > > > > Thanks, > > > Satish. > > > > > > On Sat, 1 Jul 2023 at 01:43, Jun Rao <j...@confluent.io.invalid> wrote: > > > > > > > > Hi, Divij, > > > > > > > > Sorry for the late reply. > > > > > > > > Given your explanation, the new API sounds reasonable to me. Is that > > > enough > > > > to build the external metadata layer for the remote segments or do you > > > need > > > > some additional API changes? > > > > > > > > Thanks, > > > > > > > > Jun > > > > > > > > On Fri, Jun 9, 2023 at 7:08 AM Divij Vaidya <divijvaidy...@gmail.com> > > > wrote: > > > > > > > > > Thank you for looking into this Kamal. > > > > > > > > > > You are right in saying that a cold start (i.e. leadership failover > > or > > > > > broker startup) does not impact the broker startup duration. But it > > > does > > > > > have the following impact: > > > > > 1. It leads to a burst of full-scan requests to RLMM in case multiple > > > > > leadership failovers occur at the same time. Even if the RLMM > > > > > implementation has the capability to serve the total size from an > > index > > > > > (and hence handle this burst), we wouldn't be able to use it since > > the > > > > > current API necessarily calls for a full scan. > > > > > 2. The archival (copying of data to tiered storage) process will > > have a > > > > > delayed start. The delayed start of archival could lead to local > > build > > > up > > > > > of data which may lead to disk full. > > > > > > > > > > The disadvantage of adding this new API is that every provider will > > > have to > > > > > implement it, agreed. But I believe that this tradeoff is worthwhile > > > since > > > > > the default implementation could be the same as you mentioned, i.e. > > > keeping > > > > > cumulative in-memory count. > > > > > > > > > > -- > > > > > Divij Vaidya > > > > > > > > > > > > > > > > > > > > On Sun, Jun 4, 2023 at 5:48 PM Kamal Chandraprakash < > > > > > kamal.chandraprak...@gmail.com> wrote: > > > > > > > > > > > Hi Divij, > > > > > > > > > > > > Thanks for the KIP! Sorry for the late reply. > > > > > > > > > > > > Can you explain the rejected alternative-3? > > > > > > Store the cumulative size of remote tier log in-memory at > > > > > RemoteLogManager > > > > > > "*Cons*: Every time a broker starts-up, it will scan through all > > the > > > > > > segments in the remote tier to initialise the in-memory value. This > > > would > > > > > > increase the broker start-up time." > > > > > > > > > > > > Keeping the source of truth to determine the remote-log-size in the > > > > > leader > > > > > > would be consistent across different implementations of the plugin. > > > The > > > > > > concern posted in the KIP is that we are calculating the > > > remote-log-size > > > > > on > > > > > > each iteration of the cleaner thread (say 5 mins). If we calculate > > > only > > > > > > once during broker startup or during the leadership reassignment, > > do > > > we > > > > > > still need the cache? > > > > > > > > > > > > The broker startup-time won't be affected by the remote log manager > > > > > > initialisation. The broker continue to start accepting the new > > > > > > produce/fetch requests, while the RLM thread in the background can > > > > > > determine the remote-log-size once and start copying/deleting the > > > > > segments. > > > > > > > > > > > > Thanks, > > > > > > Kamal > > > > > > > > > > > > On Thu, Jun 1, 2023 at 2:08 PM Divij Vaidya < > > divijvaidy...@gmail.com > > > > > > > > > > wrote: > > > > > > > > > > > > > Satish / Jun > > > > > > > > > > > > > > Do you have any thoughts on this? > > > > > > > > > > > > > > -- > > > > > > > Divij Vaidya > > > > > > > > > > > > > > > > > > > > > > > > > > > > On Tue, Feb 14, 2023 at 4:15 PM Divij Vaidya < > > > divijvaidy...@gmail.com> > > > > > > > wrote: > > > > > > > > > > > > > > > Hey Jun > > > > > > > > > > > > > > > > It has been a while since this KIP got some attention. While we > > > wait > > > > > > for > > > > > > > > Satish to chime in here, perhaps I can answer your question. > > > > > > > > > > > > > > > > > Could you explain how you exposed the log size in your > > KIP-405 > > > > > > > > implementation? > > > > > > > > > > > > > > > > The APIs available in RLMM as per KIP405 > > > > > > > > are, addRemoteLogSegmentMetadata(), > > > updateRemoteLogSegmentMetadata(), > > > > > > > remoteLogSegmentMetadata(), highestOffsetForEpoch(), > > > > > > > putRemotePartitionDeleteMetadata(), listRemoteLogSegments(), > > > > > > > onPartitionLeadershipChanges() > > > > > > > > and onStopPartitions(). None of these APIs allow us to expose > > > the log > > > > > > > size, > > > > > > > > hence, the only option that remains is to list all segments > > using > > > > > > > > listRemoteLogSegments() and aggregate them every time we > > require > > > to > > > > > > > > calculate the size. Based on our prior discussion, this > > requires > > > > > > reading > > > > > > > > all segment metadata which won't work for non-local RLMM > > > > > > implementations. > > > > > > > > Satish's implementation also performs a full scan and > > calculates > > > the > > > > > > > > aggregate. see: > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > https://github.com/satishd/kafka/blob/2.8.x-tiered-storage/core/src/main/scala/kafka/log/remote/RemoteLogManager.scala#L619 > > > > > > > > > > > > > > > > > > > > > > > > Does this answer your question? > > > > > > > > > > > > > > > > -- > > > > > > > > Divij Vaidya > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > On Tue, Dec 20, 2022 at 8:40 PM Jun Rao > > <j...@confluent.io.invalid > > > > > > > > > > > wrote: > > > > > > > > > > > > > > > >> Hi, Divij, > > > > > > > >> > > > > > > > >> Thanks for the explanation. > > > > > > > >> > > > > > > > >> Good question. > > > > > > > >> > > > > > > > >> Hi, Satish, > > > > > > > >> > > > > > > > >> Could you explain how you exposed the log size in your KIP-405 > > > > > > > >> implementation? > > > > > > > >> > > > > > > > >> Thanks, > > > > > > > >> > > > > > > > >> Jun > > > > > > > >> > > > > > > > >> On Tue, Dec 20, 2022 at 4:59 AM Divij Vaidya < > > > > > divijvaidy...@gmail.com > > > > > > > > > > > > > > >> wrote: > > > > > > > >> > > > > > > > >> > Hey Jun > > > > > > > >> > > > > > > > > >> > Yes, it is possible to maintain the log size in the cache > > (see > > > > > > > rejected > > > > > > > >> > alternative#3 in the KIP) but I did not understand how it is > > > > > > possible > > > > > > > to > > > > > > > >> > retrieve it without the new API. The log size could be > > > calculated > > > > > on > > > > > > > >> > startup by scanning through the segments (though I would > > > disagree > > > > > > that > > > > > > > >> this > > > > > > > >> > is the right approach since scanning itself takes order of > > > minutes > > > > > > and > > > > > > > >> > hence delay the start of archive process), and incrementally > > > > > > > maintained > > > > > > > >> > afterwards, even then, we would need an API in > > > > > > > RemoteLogMetadataManager > > > > > > > >> so > > > > > > > >> > that RLM could fetch the cached size! > > > > > > > >> > > > > > > > > >> > If we wish to cache the size without adding a new API, then > > we > > > > > need > > > > > > to > > > > > > > >> > cache the size in RLM itself (instead of RLMM > > implementation) > > > and > > > > > > > >> > incrementally manage it. The downside of longer archive time > > > at > > > > > > > startup > > > > > > > >> > (due to initial scale) still remains valid in this > > situation. > > > > > > > >> > > > > > > > > >> > -- > > > > > > > >> > Divij Vaidya > > > > > > > >> > > > > > > > > >> > > > > > > > > >> > > > > > > > > >> > On Fri, Dec 16, 2022 at 12:43 AM Jun Rao > > > <j...@confluent.io.invalid > > > > > > > > > > > > > >> wrote: > > > > > > > >> > > > > > > > > >> > > Hi, Divij, > > > > > > > >> > > > > > > > > > >> > > Thanks for the explanation. > > > > > > > >> > > > > > > > > > >> > > If there is in-memory cache, could we maintain the log > > size > > > in > > > > > the > > > > > > > >> cache > > > > > > > >> > > with the existing API? For example, a replica could make a > > > > > > > >> > > listRemoteLogSegments(TopicIdPartition topicIdPartition) > > > call on > > > > > > > >> startup > > > > > > > >> > to > > > > > > > >> > > get the remote segment size before the current > > leaderEpoch. > > > The > > > > > > > leader > > > > > > > >> > > could then maintain the size incrementally afterwards. On > > > leader > > > > > > > >> change, > > > > > > > >> > > other replicas can make a > > > listRemoteLogSegments(TopicIdPartition > > > > > > > >> > > topicIdPartition, int leaderEpoch) call to get the size of > > > newly > > > > > > > >> > generated > > > > > > > >> > > segments. > > > > > > > >> > > > > > > > > > >> > > Thanks, > > > > > > > >> > > > > > > > > > >> > > Jun > > > > > > > >> > > > > > > > > > >> > > > > > > > > > >> > > On Wed, Dec 14, 2022 at 3:27 AM Divij Vaidya < > > > > > > > divijvaidy...@gmail.com > > > > > > > >> > > > > > > > > >> > > wrote: > > > > > > > >> > > > > > > > > > >> > > > > Is the new method enough for doing size-based > > retention? > > > > > > > >> > > > > > > > > > > >> > > > Yes. You are right in assuming that this API only > > > provides the > > > > > > > >> Remote > > > > > > > >> > > > storage size (for current epoch chain). We would use > > this > > > API > > > > > > for > > > > > > > >> size > > > > > > > >> > > > based retention along with a value of > > > localOnlyLogSegmentSize > > > > > > > which > > > > > > > >> is > > > > > > > >> > > > computed as > > > Log.sizeInBytes(logSegments.filter(_.baseOffset > > > > > > > > >> > > > highestOffsetWithRemoteIndex)). Hence, (total_log_size = > > > > > > > >> > > > remoteLogSizeBytes + log.localOnlyLogSegmentSize). I > > have > > > > > > updated > > > > > > > >> the > > > > > > > >> > KIP > > > > > > > >> > > > with this information. You can also check an example > > > > > > > implementation > > > > > > > >> at > > > > > > > >> > > > > > > > > > > >> > > > > > > > > > > >> > > > > > > > > > >> > > > > > > > > >> > > > > > > > > > > > > > > > > > > > > > > > https://github.com/satishd/kafka/blob/2.8.x-tiered-storage/core/src/main/scala/kafka/log/Log.scala#L2077 > > > > > > > >> > > > > > > > > > > >> > > > > > > > > > > >> > > > > Do you imagine all accesses to remote metadata will be > > > > > across > > > > > > > the > > > > > > > >> > > network > > > > > > > >> > > > or will there be some local in-memory cache? > > > > > > > >> > > > > > > > > > > >> > > > I would expect a disk-less implementation to maintain a > > > finite > > > > > > > >> > in-memory > > > > > > > >> > > > cache for segment metadata to optimize the number of > > > network > > > > > > calls > > > > > > > >> made > > > > > > > >> > > to > > > > > > > >> > > > fetch the data. In future, we can think about bringing > > > this > > > > > > finite > > > > > > > >> size > > > > > > > >> > > > cache into RLM itself but that's probably a conversation > > > for a > > > > > > > >> > different > > > > > > > >> > > > KIP. There are many other things we would like to do to > > > > > optimize > > > > > > > the > > > > > > > >> > > Tiered > > > > > > > >> > > > storage interface such as introducing a circular buffer > > / > > > > > > > streaming > > > > > > > >> > > > interface from RSM (so that we don't have to wait to > > > fetch the > > > > > > > >> entire > > > > > > > >> > > > segment before starting to send records to the > > consumer), > > > > > > caching > > > > > > > >> the > > > > > > > >> > > > segments fetched from RSM locally (I would assume all > > RSM > > > > > plugin > > > > > > > >> > > > implementations to do this, might as well add it to RLM) > > > etc. > > > > > > > >> > > > > > > > > > > >> > > > -- > > > > > > > >> > > > Divij Vaidya > > > > > > > >> > > > > > > > > > > >> > > > > > > > > > > >> > > > > > > > > > > >> > > > On Mon, Dec 12, 2022 at 7:35 PM Jun Rao > > > > > > <j...@confluent.io.invalid > > > > > > > > > > > > > > > >> > > wrote: > > > > > > > >> > > > > > > > > > > >> > > > > Hi, Divij, > > > > > > > >> > > > > > > > > > > > >> > > > > Thanks for the reply. > > > > > > > >> > > > > > > > > > > > >> > > > > Is the new method enough for doing size-based > > > retention? It > > > > > > > gives > > > > > > > >> the > > > > > > > >> > > > total > > > > > > > >> > > > > size of the remote segments, but it seems that we > > still > > > > > don't > > > > > > > know > > > > > > > >> > the > > > > > > > >> > > > > exact total size for a log since there could be > > > overlapping > > > > > > > >> segments > > > > > > > >> > > > > between the remote and the local segments. > > > > > > > >> > > > > > > > > > > > >> > > > > You mentioned a disk-less implementation. Do you > > > imagine all > > > > > > > >> accesses > > > > > > > >> > > to > > > > > > > >> > > > > remote metadata will be across the network or will > > > there be > > > > > > some > > > > > > > >> > local > > > > > > > >> > > > > in-memory cache? > > > > > > > >> > > > > > > > > > > > >> > > > > Thanks, > > > > > > > >> > > > > > > > > > > > >> > > > > Jun > > > > > > > >> > > > > > > > > > > > >> > > > > > > > > > > > >> > > > > > > > > > > > >> > > > > On Wed, Dec 7, 2022 at 3:10 AM Divij Vaidya < > > > > > > > >> divijvaidy...@gmail.com > > > > > > > >> > > > > > > > > > >> > > > > wrote: > > > > > > > >> > > > > > > > > > > > >> > > > > > The method is needed for RLMM implementations which > > > fetch > > > > > > the > > > > > > > >> > > > information > > > > > > > >> > > > > > over the network and not for the disk based > > > > > implementations > > > > > > > >> (such > > > > > > > >> > as > > > > > > > >> > > > the > > > > > > > >> > > > > > default topic based RLMM). > > > > > > > >> > > > > > > > > > > > > >> > > > > > I would argue that adding this API makes the > > interface > > > > > more > > > > > > > >> generic > > > > > > > >> > > > than > > > > > > > >> > > > > > what it is today. This is because, with the current > > > APIs > > > > > an > > > > > > > >> > > implementor > > > > > > > >> > > > > is > > > > > > > >> > > > > > restricted to use disk based RLMM solutions only > > > (i.e. the > > > > > > > >> default > > > > > > > >> > > > > > solution) whereas if we add this new API, we unblock > > > usage > > > > > > of > > > > > > > >> > network > > > > > > > >> > > > > based > > > > > > > >> > > > > > RLMM implementations such as databases. > > > > > > > >> > > > > > > > > > > > > >> > > > > > > > > > > > > >> > > > > > > > > > > > > >> > > > > > On Wed 30. Nov 2022 at 20:40, Jun Rao > > > > > > > <j...@confluent.io.invalid > > > > > > > >> > > > > > > > > >> > > > wrote: > > > > > > > >> > > > > > > > > > > > > >> > > > > > > Hi, Divij, > > > > > > > >> > > > > > > > > > > > > > >> > > > > > > Thanks for the reply. > > > > > > > >> > > > > > > > > > > > > > >> > > > > > > Point#2. My high level question is that is the new > > > > > method > > > > > > > >> needed > > > > > > > >> > > for > > > > > > > >> > > > > > every > > > > > > > >> > > > > > > implementation of remote storage or just for a > > > specific > > > > > > > >> > > > implementation. > > > > > > > >> > > > > > The > > > > > > > >> > > > > > > issues that you pointed out exist for the default > > > > > > > >> implementation > > > > > > > >> > of > > > > > > > >> > > > > RLMM > > > > > > > >> > > > > > as > > > > > > > >> > > > > > > well and so far, the default implementation hasn't > > > > > found a > > > > > > > >> need > > > > > > > >> > > for a > > > > > > > >> > > > > > > similar new method. For public interface, ideally > > we > > > > > want > > > > > > to > > > > > > > >> make > > > > > > > >> > > it > > > > > > > >> > > > > more > > > > > > > >> > > > > > > general. > > > > > > > >> > > > > > > > > > > > > > >> > > > > > > Thanks, > > > > > > > >> > > > > > > > > > > > > > >> > > > > > > Jun > > > > > > > >> > > > > > > > > > > > > > >> > > > > > > On Mon, Nov 21, 2022 at 7:11 AM Divij Vaidya < > > > > > > > >> > > > divijvaidy...@gmail.com> > > > > > > > >> > > > > > > wrote: > > > > > > > >> > > > > > > > > > > > > > >> > > > > > > > Thank you Jun and Alex for your comments. > > > > > > > >> > > > > > > > > > > > > > > >> > > > > > > > Point#1: You are right Jun. As Alex mentioned, > > the > > > > > > > "derived > > > > > > > >> > > > metadata" > > > > > > > >> > > > > > can > > > > > > > >> > > > > > > > increase the size of cached metadata by a factor > > > of 10 > > > > > > but > > > > > > > >> it > > > > > > > >> > > > should > > > > > > > >> > > > > be > > > > > > > >> > > > > > > ok > > > > > > > >> > > > > > > > to cache just the actual metadata. My point > > about > > > size > > > > > > > >> being a > > > > > > > >> > > > > > limitation > > > > > > > >> > > > > > > > for using cache is not valid anymore. > > > > > > > >> > > > > > > > > > > > > > > >> > > > > > > > Point#2: For a new replica, it would still have > > to > > > > > fetch > > > > > > > the > > > > > > > >> > > > metadata > > > > > > > >> > > > > > > over > > > > > > > >> > > > > > > > the network to initiate the warm up of the cache > > > and > > > > > > > hence, > > > > > > > >> > > > increase > > > > > > > >> > > > > > the > > > > > > > >> > > > > > > > start time of the archival process. Please also > > > note > > > > > the > > > > > > > >> > > > > repercussions > > > > > > > >> > > > > > of > > > > > > > >> > > > > > > > the warm up scan that Alex mentioned in this > > > thread as > > > > > > > part > > > > > > > >> of > > > > > > > >> > > > > #102.2. > > > > > > > >> > > > > > > > > > > > > > > >> > > > > > > > 100#: Agreed Alex. Thanks for clarifying that. > > My > > > > > point > > > > > > > >> about > > > > > > > >> > > size > > > > > > > >> > > > > > being > > > > > > > >> > > > > > > a > > > > > > > >> > > > > > > > limitation for using cache is not valid anymore. > > > > > > > >> > > > > > > > > > > > > > > >> > > > > > > > 101#: Alex, if I understand correctly, you are > > > > > > suggesting > > > > > > > to > > > > > > > >> > > cache > > > > > > > >> > > > > the > > > > > > > >> > > > > > > > total size at the leader and update it on > > > archival. > > > > > This > > > > > > > >> > wouldn't > > > > > > > >> > > > > work > > > > > > > >> > > > > > > for > > > > > > > >> > > > > > > > cases when the leader restarts where we would > > > have to > > > > > > > make a > > > > > > > >> > full > > > > > > > >> > > > > scan > > > > > > > >> > > > > > > > to update the total size entry on startup. We > > > expect > > > > > > users > > > > > > > >> to > > > > > > > >> > > store > > > > > > > >> > > > > > data > > > > > > > >> > > > > > > > over longer duration in remote storage which > > > increases > > > > > > the > > > > > > > >> > > > likelihood > > > > > > > >> > > > > > of > > > > > > > >> > > > > > > > leader restarts / failovers. > > > > > > > >> > > > > > > > > > > > > > > >> > > > > > > > 102#.1: I don't think that the current design > > > > > > accommodates > > > > > > > >> the > > > > > > > >> > > fact > > > > > > > >> > > > > > that > > > > > > > >> > > > > > > > data corruption could happen at the RLMM plugin > > > (we > > > > > > don't > > > > > > > >> have > > > > > > > >> > > > > checksum > > > > > > > >> > > > > > > as > > > > > > > >> > > > > > > > a field in metadata as part of KIP405). If data > > > > > > corruption > > > > > > > >> > > occurs, > > > > > > > >> > > > w/ > > > > > > > >> > > > > > or > > > > > > > >> > > > > > > > w/o the cache, it would be a different problem > > to > > > > > > solve. I > > > > > > > >> > would > > > > > > > >> > > > like > > > > > > > >> > > > > > to > > > > > > > >> > > > > > > > keep this outside the scope of this KIP. > > > > > > > >> > > > > > > > > > > > > > > >> > > > > > > > 102#.2: Agree. This remains as the main concern > > > for > > > > > > using > > > > > > > >> the > > > > > > > >> > > cache > > > > > > > >> > > > > to > > > > > > > >> > > > > > > > fetch total size. > > > > > > > >> > > > > > > > > > > > > > > >> > > > > > > > Regards, > > > > > > > >> > > > > > > > Divij Vaidya > > > > > > > >> > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > >> > > > > > > > On Fri, Nov 18, 2022 at 12:59 PM Alexandre > > > Dupriez < > > > > > > > >> > > > > > > > alexandre.dupr...@gmail.com> wrote: > > > > > > > >> > > > > > > > > > > > > > > >> > > > > > > > > Hi Divij, > > > > > > > >> > > > > > > > > > > > > > > > >> > > > > > > > > Thanks for the KIP. Please find some comments > > > based > > > > > on > > > > > > > >> what I > > > > > > > >> > > > read > > > > > > > >> > > > > on > > > > > > > >> > > > > > > > > this thread so far - apologies for the repeats > > > and > > > > > the > > > > > > > >> late > > > > > > > >> > > > reply. > > > > > > > >> > > > > > > > > > > > > > > > >> > > > > > > > > If I understand correctly, one of the main > > > elements > > > > > of > > > > > > > >> > > discussion > > > > > > > >> > > > > is > > > > > > > >> > > > > > > > > about caching in Kafka versus delegation of > > > > > providing > > > > > > > the > > > > > > > >> > > remote > > > > > > > >> > > > > size > > > > > > > >> > > > > > > > > of a topic-partition to the plugin. > > > > > > > >> > > > > > > > > > > > > > > > >> > > > > > > > > A few comments: > > > > > > > >> > > > > > > > > > > > > > > > >> > > > > > > > > 100. The size of the “derived metadata” which > > is > > > > > > managed > > > > > > > >> by > > > > > > > >> > the > > > > > > > >> > > > > > plugin > > > > > > > >> > > > > > > > > to represent an rlmMetadata can indeed be > > close > > > to 1 > > > > > > kB > > > > > > > on > > > > > > > >> > > > average > > > > > > > >> > > > > > > > > depending on its own internal structure, e.g. > > > the > > > > > > > >> redundancy > > > > > > > >> > it > > > > > > > >> > > > > > > > > enforces (unfortunately resulting to > > > duplication), > > > > > > > >> additional > > > > > > > >> > > > > > > > > information such as checksums and primary and > > > > > > secondary > > > > > > > >> > > indexable > > > > > > > >> > > > > > > > > keys. But indeed, the rlmMetadata is itself a > > > > > lighter > > > > > > > data > > > > > > > >> > > > > structure > > > > > > > >> > > > > > > > > by a factor of 10. And indeed, instead of > > > caching > > > > > the > > > > > > > >> > “derived > > > > > > > >> > > > > > > > > metadata”, only the rlmMetadata could be, > > which > > > > > should > > > > > > > >> > address > > > > > > > >> > > > the > > > > > > > >> > > > > > > > > concern regarding the memory occupancy of the > > > cache. > > > > > > > >> > > > > > > > > > > > > > > > >> > > > > > > > > 101. I am not sure I fully understand why we > > > would > > > > > > need > > > > > > > to > > > > > > > >> > > cache > > > > > > > >> > > > > the > > > > > > > >> > > > > > > > > list of rlmMetadata to retain the remote size > > > of a > > > > > > > >> > > > topic-partition. > > > > > > > >> > > > > > > > > Since the leader of a topic-partition is, in > > > > > > > >> non-degenerated > > > > > > > >> > > > cases, > > > > > > > >> > > > > > > > > the only actor which can mutate the remote > > part > > > of > > > > > the > > > > > > > >> > > > > > > > > topic-partition, hence its size, it could in > > > theory > > > > > > only > > > > > > > >> > cache > > > > > > > >> > > > the > > > > > > > >> > > > > > > > > size of the remote log once it has calculated > > > it? In > > > > > > > which > > > > > > > >> > case > > > > > > > >> > > > > there > > > > > > > >> > > > > > > > > would not be any problem regarding the size of > > > the > > > > > > > caching > > > > > > > >> > > > > strategy. > > > > > > > >> > > > > > > > > Did I miss something there? > > > > > > > >> > > > > > > > > > > > > > > > >> > > > > > > > > 102. There may be a few challenges to consider > > > with > > > > > > > >> caching: > > > > > > > >> > > > > > > > > > > > > > > > >> > > > > > > > > 102.1) As mentioned above, the caching > > strategy > > > > > > assumes > > > > > > > no > > > > > > > >> > > > mutation > > > > > > > >> > > > > > > > > outside the lifetime of a leader. While this > > is > > > true > > > > > > in > > > > > > > >> the > > > > > > > >> > > > normal > > > > > > > >> > > > > > > > > course of operation, there could be accidental > > > > > > mutation > > > > > > > >> > outside > > > > > > > >> > > > of > > > > > > > >> > > > > > the > > > > > > > >> > > > > > > > > leader and a loss of consistency between the > > > cached > > > > > > > state > > > > > > > >> and > > > > > > > >> > > the > > > > > > > >> > > > > > > > > actual remote representation of the log. E.g. > > > > > > > split-brain > > > > > > > >> > > > > scenarios, > > > > > > > >> > > > > > > > > bugs in the plugins, bugs in external systems > > > with > > > > > > > >> mutating > > > > > > > >> > > > access > > > > > > > >> > > > > on > > > > > > > >> > > > > > > > > the derived metadata. In the worst case, a > > drift > > > > > > between > > > > > > > >> the > > > > > > > >> > > > cached > > > > > > > >> > > > > > > > > size and the actual size could lead to > > > over-deleting > > > > > > > >> remote > > > > > > > >> > > data > > > > > > > >> > > > > > which > > > > > > > >> > > > > > > > > is a durability risk. > > > > > > > >> > > > > > > > > > > > > > > > >> > > > > > > > > The alternative you propose, by making the > > > plugin > > > > > the > > > > > > > >> source > > > > > > > >> > of > > > > > > > >> > > > > truth > > > > > > > >> > > > > > > > > w.r.t. to the size of the remote log, can make > > > it > > > > > > easier > > > > > > > >> to > > > > > > > >> > > avoid > > > > > > > >> > > > > > > > > inconsistencies between plugin-managed > > metadata > > > and > > > > > > the > > > > > > > >> > remote > > > > > > > >> > > > log > > > > > > > >> > > > > > > > > from the perspective of Kafka. On the other > > > hand, > > > > > > plugin > > > > > > > >> > > vendors > > > > > > > >> > > > > > would > > > > > > > >> > > > > > > > > have to implement it with the expected > > > efficiency to > > > > > > > have > > > > > > > >> it > > > > > > > >> > > > yield > > > > > > > >> > > > > > > > > benefits. > > > > > > > >> > > > > > > > > > > > > > > > >> > > > > > > > > 102.2) As you mentioned, the caching strategy > > in > > > > > Kafka > > > > > > > >> would > > > > > > > >> > > > still > > > > > > > >> > > > > > > > > require one iteration over the list of > > > rlmMetadata > > > > > > when > > > > > > > >> the > > > > > > > >> > > > > > leadership > > > > > > > >> > > > > > > > > of a topic-partition is assigned to a broker, > > > while > > > > > > the > > > > > > > >> > plugin > > > > > > > >> > > > can > > > > > > > >> > > > > > > > > offer alternative constant-time approaches. > > This > > > > > > > >> calculation > > > > > > > >> > > > cannot > > > > > > > >> > > > > > be > > > > > > > >> > > > > > > > > put on the LeaderAndIsr path and would be > > > performed > > > > > in > > > > > > > the > > > > > > > >> > > > > > background. > > > > > > > >> > > > > > > > > In case of bulk leadership migration, listing > > > the > > > > > > > >> rlmMetadata > > > > > > > >> > > > could > > > > > > > >> > > > > > a) > > > > > > > >> > > > > > > > > result in request bursts to any backend system > > > the > > > > > > > plugin > > > > > > > >> may > > > > > > > >> > > use > > > > > > > >> > > > > > > > > [which shouldn’t be a problem for > > > high-throughput > > > > > data > > > > > > > >> stores > > > > > > > >> > > but > > > > > > > >> > > > > > > > > could have cost implications] b) increase > > > > > utilisation > > > > > > > >> > timespan > > > > > > > >> > > of > > > > > > > >> > > > > the > > > > > > > >> > > > > > > > > RLM threads for these calculations potentially > > > > > leading > > > > > > > to > > > > > > > >> > > > transient > > > > > > > >> > > > > > > > > starvation of tasks queued for, typically, > > > > > offloading > > > > > > > >> > > operations > > > > > > > >> > > > c) > > > > > > > >> > > > > > > > > could have a non-marginal CPU footprint on > > > hardware > > > > > > with > > > > > > > >> > strict > > > > > > > >> > > > > > > > > resource constraints. All these elements could > > > have > > > > > an > > > > > > > >> impact > > > > > > > >> > > to > > > > > > > >> > > > > some > > > > > > > >> > > > > > > > > degree depending on the operational > > environment. > > > > > > > >> > > > > > > > > > > > > > > > >> > > > > > > > > From a design perspective, one question is > > > where we > > > > > > want > > > > > > > >> the > > > > > > > >> > > > source > > > > > > > >> > > > > > of > > > > > > > >> > > > > > > > > truth w.r.t. remote log size to be during the > > > > > lifetime > > > > > > > of > > > > > > > >> a > > > > > > > >> > > > leader. > > > > > > > >> > > > > > > > > The responsibility of maintaining a consistent > > > > > > > >> representation > > > > > > > >> > > of > > > > > > > >> > > > > the > > > > > > > >> > > > > > > > > remote log is shared by Kafka and the plugin. > > > Which > > > > > > > >> system is > > > > > > > >> > > > best > > > > > > > >> > > > > > > > > placed to maintain such a state while > > providing > > > the > > > > > > > >> highest > > > > > > > >> > > > > > > > > consistency guarantees is something both Kafka > > > and > > > > > > > plugin > > > > > > > >> > > > designers > > > > > > > >> > > > > > > > > could help understand better. > > > > > > > >> > > > > > > > > > > > > > > > >> > > > > > > > > Many thanks, > > > > > > > >> > > > > > > > > Alexandre > > > > > > > >> > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > > >> > > > > > > > > Le jeu. 17 nov. 2022 à 19:27, Jun Rao > > > > > > > >> > <j...@confluent.io.invalid > > > > > > > >> > > > > > > > > > > >> > > > a > > > > > > > >> > > > > > > > écrit : > > > > > > > >> > > > > > > > > > > > > > > > > >> > > > > > > > > > Hi, Divij, > > > > > > > >> > > > > > > > > > > > > > > > > >> > > > > > > > > > Thanks for the reply. > > > > > > > >> > > > > > > > > > > > > > > > > >> > > > > > > > > > Point #1. Is the average remote segment > > > metadata > > > > > > > really > > > > > > > >> > 1KB? > > > > > > > >> > > > > What's > > > > > > > >> > > > > > > > > listed > > > > > > > >> > > > > > > > > > in the public interface is probably well > > > below 100 > > > > > > > >> bytes. > > > > > > > >> > > > > > > > > > > > > > > > > >> > > > > > > > > > Point #2. I guess you are assuming that each > > > > > broker > > > > > > > only > > > > > > > >> > > caches > > > > > > > >> > > > > the > > > > > > > >> > > > > > > > > remote > > > > > > > >> > > > > > > > > > segment metadata in memory. An alternative > > > > > approach > > > > > > is > > > > > > > >> to > > > > > > > >> > > cache > > > > > > > >> > > > > > them > > > > > > > >> > > > > > > in > > > > > > > >> > > > > > > > > > both memory and local disk. That way, on > > > broker > > > > > > > restart, > > > > > > > >> > you > > > > > > > >> > > > just > > > > > > > >> > > > > > > need > > > > > > > >> > > > > > > > to > > > > > > > >> > > > > > > > > > fetch the new remote segments' metadata > > using > > > the > > > > > > > >> > > > > > > > > > listRemoteLogSegments(TopicIdPartition > > > > > > > topicIdPartition, > > > > > > > >> > int > > > > > > > >> > > > > > > > leaderEpoch) > > > > > > > >> > > > > > > > > > api. Will that work? > > > > > > > >> > > > > > > > > > > > > > > > > >> > > > > > > > > > Point #3. Thanks for the explanation and it > > > sounds > > > > > > > good. > > > > > > > >> > > > > > > > > > > > > > > > > >> > > > > > > > > > Thanks, > > > > > > > >> > > > > > > > > > > > > > > > > >> > > > > > > > > > Jun > > > > > > > >> > > > > > > > > > > > > > > > > >> > > > > > > > > > On Thu, Nov 17, 2022 at 7:31 AM Divij > > Vaidya < > > > > > > > >> > > > > > > divijvaidy...@gmail.com> > > > > > > > >> > > > > > > > > > wrote: > > > > > > > >> > > > > > > > > > > > > > > > > >> > > > > > > > > > > Hi Jun > > > > > > > >> > > > > > > > > > > > > > > > > > >> > > > > > > > > > > There are three points that I would like > > to > > > > > > present > > > > > > > >> here: > > > > > > > >> > > > > > > > > > > > > > > > > > >> > > > > > > > > > > 1. We would require a large cache size to > > > > > > > efficiently > > > > > > > >> > cache > > > > > > > >> > > > all > > > > > > > >> > > > > > > > segment > > > > > > > >> > > > > > > > > > > metadata. > > > > > > > >> > > > > > > > > > > 2. Linear scan of all metadata at broker > > > startup > > > > > > to > > > > > > > >> > > populate > > > > > > > >> > > > > the > > > > > > > >> > > > > > > > cache > > > > > > > >> > > > > > > > > will > > > > > > > >> > > > > > > > > > > be slow and will impact the archival > > > process. > > > > > > > >> > > > > > > > > > > 3. There is no other use case where a full > > > scan > > > > > of > > > > > > > >> > segment > > > > > > > >> > > > > > metadata > > > > > > > >> > > > > > > > is > > > > > > > >> > > > > > > > > > > required. > > > > > > > >> > > > > > > > > > > > > > > > > > >> > > > > > > > > > > Let's start by quantifying 1. Here's my > > > estimate > > > > > > for > > > > > > > >> the > > > > > > > >> > > size > > > > > > > >> > > > > of > > > > > > > >> > > > > > > the > > > > > > > >> > > > > > > > > cache. > > > > > > > >> > > > > > > > > > > Average size of segment metadata = 1KB. > > This > > > > > could > > > > > > > be > > > > > > > >> > more > > > > > > > >> > > if > > > > > > > >> > > > > we > > > > > > > >> > > > > > > have > > > > > > > >> > > > > > > > > > > frequent leader failover with a large > > > number of > > > > > > > leader > > > > > > > >> > > epochs > > > > > > > >> > > > > > being > > > > > > > >> > > > > > > > > stored > > > > > > > >> > > > > > > > > > > per segment. > > > > > > > >> > > > > > > > > > > Segment size = 100MB. Users will prefer to > > > > > reduce > > > > > > > the > > > > > > > >> > > segment > > > > > > > >> > > > > > size > > > > > > > >> > > > > > > > > from the > > > > > > > >> > > > > > > > > > > default value of 1GB to ensure timely > > > archival > > > > > of > > > > > > > data > > > > > > > >> > > since > > > > > > > >> > > > > data > > > > > > > >> > > > > > > > from > > > > > > > >> > > > > > > > > > > active segment is not archived. > > > > > > > >> > > > > > > > > > > Cache size = num segments * avg. segment > > > > > metadata > > > > > > > >> size = > > > > > > > >> > > > > > > > > (100TB/100MB)*1KB > > > > > > > >> > > > > > > > > > > = 1GB. > > > > > > > >> > > > > > > > > > > While 1GB for cache may not sound like a > > > large > > > > > > > number > > > > > > > >> for > > > > > > > >> > > > > larger > > > > > > > >> > > > > > > > > machines, > > > > > > > >> > > > > > > > > > > it does eat into the memory as an > > additional > > > > > cache > > > > > > > and > > > > > > > >> > > makes > > > > > > > >> > > > > use > > > > > > > >> > > > > > > > cases > > > > > > > >> > > > > > > > > with > > > > > > > >> > > > > > > > > > > large data retention with low throughout > > > > > expensive > > > > > > > >> (where > > > > > > > >> > > > such > > > > > > > >> > > > > > use > > > > > > > >> > > > > > > > case > > > > > > > >> > > > > > > > > > > would could use smaller machines). > > > > > > > >> > > > > > > > > > > > > > > > > > >> > > > > > > > > > > About point#2: > > > > > > > >> > > > > > > > > > > Even if we say that all segment metadata > > > can fit > > > > > > > into > > > > > > > >> the > > > > > > > >> > > > > cache, > > > > > > > >> > > > > > we > > > > > > > >> > > > > > > > > will > > > > > > > >> > > > > > > > > > > need to populate the cache on broker > > > startup. It > > > > > > > would > > > > > > > >> > not > > > > > > > >> > > be > > > > > > > >> > > > > in > > > > > > > >> > > > > > > the > > > > > > > >> > > > > > > > > > > critical patch of broker startup and hence > > > won't > > > > > > > >> impact > > > > > > > >> > the > > > > > > > >> > > > > > startup > > > > > > > >> > > > > > > > > time. > > > > > > > >> > > > > > > > > > > But it will impact the time when we could > > > start > > > > > > the > > > > > > > >> > > archival > > > > > > > >> > > > > > > process > > > > > > > >> > > > > > > > > since > > > > > > > >> > > > > > > > > > > the RLM thread pool will be blocked on the > > > first > > > > > > > call > > > > > > > >> to > > > > > > > >> > > > > > > > > > > listRemoteLogSegments(). To scan metadata > > > for > > > > > 1MM > > > > > > > >> > segments > > > > > > > >> > > > > > > (computed > > > > > > > >> > > > > > > > > above) > > > > > > > >> > > > > > > > > > > and transfer 1GB data over the network > > from > > > a > > > > > RLMM > > > > > > > >> such > > > > > > > >> > as > > > > > > > >> > > a > > > > > > > >> > > > > > remote > > > > > > > >> > > > > > > > > > > database would be in the order of minutes > > > > > > (depending > > > > > > > >> on > > > > > > > >> > how > > > > > > > >> > > > > > > efficient > > > > > > > >> > > > > > > > > the > > > > > > > >> > > > > > > > > > > scan is with the RLMM implementation). > > > > > Although, I > > > > > > > >> would > > > > > > > >> > > > > concede > > > > > > > >> > > > > > > that > > > > > > > >> > > > > > > > > > > having RLM threads blocked for a few > > > minutes is > > > > > > > >> perhaps > > > > > > > >> > OK > > > > > > > >> > > > but > > > > > > > >> > > > > if > > > > > > > >> > > > > > > we > > > > > > > >> > > > > > > > > > > introduce the new API proposed in the KIP, > > > we > > > > > > would > > > > > > > >> have > > > > > > > >> > a > > > > > > > >> > > > > > > > > > > deterministic startup time for RLM. Adding > > > the > > > > > API > > > > > > > >> comes > > > > > > > >> > > at a > > > > > > > >> > > > > low > > > > > > > >> > > > > > > > cost > > > > > > > >> > > > > > > > > and > > > > > > > >> > > > > > > > > > > I believe the trade off is worth it. > > > > > > > >> > > > > > > > > > > > > > > > > > >> > > > > > > > > > > About point#3: > > > > > > > >> > > > > > > > > > > We can use > > > > > listRemoteLogSegments(TopicIdPartition > > > > > > > >> > > > > > topicIdPartition, > > > > > > > >> > > > > > > > int > > > > > > > >> > > > > > > > > > > leaderEpoch) to calculate the segments > > > eligible > > > > > > for > > > > > > > >> > > deletion > > > > > > > >> > > > > > (based > > > > > > > >> > > > > > > > on > > > > > > > >> > > > > > > > > size > > > > > > > >> > > > > > > > > > > retention) where leader epoch(s) belong to > > > the > > > > > > > current > > > > > > > >> > > leader > > > > > > > >> > > > > > epoch > > > > > > > >> > > > > > > > > chain. > > > > > > > >> > > > > > > > > > > I understand that it may lead to segments > > > > > > belonging > > > > > > > to > > > > > > > >> > > other > > > > > > > >> > > > > > epoch > > > > > > > >> > > > > > > > > lineage > > > > > > > >> > > > > > > > > > > not getting deleted and would require a > > > separate > > > > > > > >> > mechanism > > > > > > > >> > > to > > > > > > > >> > > > > > > delete > > > > > > > >> > > > > > > > > them. > > > > > > > >> > > > > > > > > > > The separate mechanism would anyways be > > > required > > > > > > to > > > > > > > >> > delete > > > > > > > >> > > > > these > > > > > > > >> > > > > > > > > "leaked" > > > > > > > >> > > > > > > > > > > segments as there are other cases which > > > could > > > > > lead > > > > > > > to > > > > > > > >> > leaks > > > > > > > >> > > > > such > > > > > > > >> > > > > > as > > > > > > > >> > > > > > > > > network > > > > > > > >> > > > > > > > > > > problems with RSM mid way writing through. > > > > > segment > > > > > > > >> etc. > > > > > > > >> > > > > > > > > > > > > > > > > > >> > > > > > > > > > > Thank you for the replies so far. They > > have > > > made > > > > > > me > > > > > > > >> > > re-think > > > > > > > >> > > > my > > > > > > > >> > > > > > > > > assumptions > > > > > > > >> > > > > > > > > > > and this dialogue has been very > > > constructive for > > > > > > me. > > > > > > > >> > > > > > > > > > > > > > > > > > >> > > > > > > > > > > Regards, > > > > > > > >> > > > > > > > > > > Divij Vaidya > > > > > > > >> > > > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > > > > >> > > > > > > > > > > On Thu, Nov 10, 2022 at 10:49 PM Jun Rao > > > > > > > >> > > > > > <j...@confluent.io.invalid > > > > > > > >> > > > > > > > > > > > > > > >> > > > > > > > > wrote: > > > > > > > >> > > > > > > > > > > > > > > > > > >> > > > > > > > > > > > Hi, Divij, > > > > > > > >> > > > > > > > > > > > > > > > > > > >> > > > > > > > > > > > Thanks for the reply. > > > > > > > >> > > > > > > > > > > > > > > > > > > >> > > > > > > > > > > > It's true that the data in Kafka could > > be > > > kept > > > > > > > >> longer > > > > > > > >> > > with > > > > > > > >> > > > > > > KIP-405. > > > > > > > >> > > > > > > > > How > > > > > > > >> > > > > > > > > > > > much data do you envision to have per > > > broker? > > > > > > For > > > > > > > >> 100TB > > > > > > > >> > > > data > > > > > > > >> > > > > > per > > > > > > > >> > > > > > > > > broker, > > > > > > > >> > > > > > > > > > > > with 1GB segment and segment metadata of > > > 100 > > > > > > > bytes, > > > > > > > >> it > > > > > > > >> > > > > requires > > > > > > > >> > > > > > > > > > > > 100TB/1GB*100 = 10MB, which should fit > > in > > > > > > memory. > > > > > > > >> > > > > > > > > > > > > > > > > > > >> > > > > > > > > > > > RemoteLogMetadataManager has two > > > > > > > >> > listRemoteLogSegments() > > > > > > > >> > > > > > methods. > > > > > > > >> > > > > > > > > The one > > > > > > > >> > > > > > > > > > > > you listed > > > > > > listRemoteLogSegments(TopicIdPartition > > > > > > > >> > > > > > > topicIdPartition, > > > > > > > >> > > > > > > > > int > > > > > > > >> > > > > > > > > > > > leaderEpoch) does return data in offset > > > order. > > > > > > > >> However, > > > > > > > >> > > the > > > > > > > >> > > > > > other > > > > > > > >> > > > > > > > > > > > one > > listRemoteLogSegments(TopicIdPartition > > > > > > > >> > > > topicIdPartition) > > > > > > > >> > > > > > > > doesn't > > > > > > > >> > > > > > > > > > > > specify the return order. I assume that > > > you > > > > > need > > > > > > > the > > > > > > > >> > > latter > > > > > > > >> > > > > to > > > > > > > >> > > > > > > > > calculate > > > > > > > >> > > > > > > > > > > > the segment size? > > > > > > > >> > > > > > > > > > > > > > > > > > > >> > > > > > > > > > > > Thanks, > > > > > > > >> > > > > > > > > > > > > > > > > > > >> > > > > > > > > > > > Jun > > > > > > > >> > > > > > > > > > > > > > > > > > > >> > > > > > > > > > > > On Thu, Nov 10, 2022 at 10:25 AM Divij > > > Vaidya > > > > > < > > > > > > > >> > > > > > > > > divijvaidy...@gmail.com> > > > > > > > >> > > > > > > > > > > > wrote: > > > > > > > >> > > > > > > > > > > > > > > > > > > >> > > > > > > > > > > > > *Jun,* > > > > > > > >> > > > > > > > > > > > > > > > > > > > >> > > > > > > > > > > > > *"the default implementation of RLMM > > > does > > > > > > local > > > > > > > >> > > caching, > > > > > > > >> > > > > > > right?"* > > > > > > > >> > > > > > > > > > > > > Yes, Jun. The default implementation > > of > > > RLMM > > > > > > > does > > > > > > > >> > > indeed > > > > > > > >> > > > > > cache > > > > > > > >> > > > > > > > the > > > > > > > >> > > > > > > > > > > > segment > > > > > > > >> > > > > > > > > > > > > metadata today, hence, it won't work > > > for use > > > > > > > cases > > > > > > > >> > when > > > > > > > >> > > > the > > > > > > > >> > > > > > > > number > > > > > > > >> > > > > > > > > of > > > > > > > >> > > > > > > > > > > > > segments in remote storage is large > > > enough > > > > > to > > > > > > > >> exceed > > > > > > > >> > > the > > > > > > > >> > > > > size > > > > > > > >> > > > > > > of > > > > > > > >> > > > > > > > > cache. > > > > > > > >> > > > > > > > > > > > As > > > > > > > >> > > > > > > > > > > > > part of this KIP, I will implement the > > > new > > > > > > > >> proposed > > > > > > > >> > API > > > > > > > >> > > > in > > > > > > > >> > > > > > the > > > > > > > >> > > > > > > > > default > > > > > > > >> > > > > > > > > > > > > implementation of RLMM but the > > > underlying > > > > > > > >> > > implementation > > > > > > > >> > > > > will > > > > > > > >> > > > > > > > > still be > > > > > > > >> > > > > > > > > > > a > > > > > > > >> > > > > > > > > > > > > scan. I will pick up optimizing that > > in > > > a > > > > > > > separate > > > > > > > >> > PR. > > > > > > > >> > > > > > > > > > > > > > > > > > > > >> > > > > > > > > > > > > *"we also cache all segment metadata > > in > > > the > > > > > > > >> brokers > > > > > > > >> > > > without > > > > > > > >> > > > > > > > > KIP-405. Do > > > > > > > >> > > > > > > > > > > > you > > > > > > > >> > > > > > > > > > > > > see a need to change that?"* > > > > > > > >> > > > > > > > > > > > > Please correct me if I am wrong here > > > but we > > > > > > > cache > > > > > > > >> > > > metadata > > > > > > > >> > > > > > for > > > > > > > >> > > > > > > > > segments > > > > > > > >> > > > > > > > > > > > > "residing in local storage". The size > > > of the > > > > > > > >> current > > > > > > > >> > > > cache > > > > > > > >> > > > > > > works > > > > > > > >> > > > > > > > > fine > > > > > > > >> > > > > > > > > > > for > > > > > > > >> > > > > > > > > > > > > the scale of the number of segments > > > that we > > > > > > > >> expect to > > > > > > > >> > > > store > > > > > > > >> > > > > > in > > > > > > > >> > > > > > > > > local > > > > > > > >> > > > > > > > > > > > > storage. After KIP-405, that cache > > will > > > > > > continue > > > > > > > >> to > > > > > > > >> > > store > > > > > > > >> > > > > > > > metadata > > > > > > > >> > > > > > > > > for > > > > > > > >> > > > > > > > > > > > > segments which are residing in local > > > storage > > > > > > and > > > > > > > >> > hence, > > > > > > > >> > > > we > > > > > > > >> > > > > > > don't > > > > > > > >> > > > > > > > > need > > > > > > > >> > > > > > > > > > > to > > > > > > > >> > > > > > > > > > > > > change that. For segments which have > > > been > > > > > > > >> offloaded > > > > > > > >> > to > > > > > > > >> > > > > remote > > > > > > > >> > > > > > > > > storage, > > > > > > > >> > > > > > > > > > > it > > > > > > > >> > > > > > > > > > > > > would rely on RLMM. Note that the > > scale > > > of > > > > > > data > > > > > > > >> > stored > > > > > > > >> > > in > > > > > > > >> > > > > > RLMM > > > > > > > >> > > > > > > is > > > > > > > >> > > > > > > > > > > > different > > > > > > > >> > > > > > > > > > > > > from local cache because the number of > > > > > > segments > > > > > > > is > > > > > > > >> > > > expected > > > > > > > >> > > > > > to > > > > > > > >> > > > > > > be > > > > > > > >> > > > > > > > > much > > > > > > > >> > > > > > > > > > > > > larger than what current > > implementation > > > > > stores > > > > > > > in > > > > > > > >> > local > > > > > > > >> > > > > > > storage. > > > > > > > >> > > > > > > > > > > > > > > > > > > > >> > > > > > > > > > > > > 2,3,4: > > > > > > > >> > RemoteLogMetadataManager.listRemoteLogSegments() > > > > > > > >> > > > > does > > > > > > > >> > > > > > > > > specify > > > > > > > >> > > > > > > > > > > the > > > > > > > >> > > > > > > > > > > > > order i.e. it returns the segments > > > sorted by > > > > > > > first > > > > > > > >> > > offset > > > > > > > >> > > > > in > > > > > > > >> > > > > > > > > ascending > > > > > > > >> > > > > > > > > > > > > order. I am copying the API docs for > > > KIP-405 > > > > > > > here > > > > > > > >> for > > > > > > > >> > > > your > > > > > > > >> > > > > > > > > reference > > > > > > > >> > > > > > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > > > > > > >> > > > > > > > > > > > > *Returns iterator of remote log > > segment > > > > > > > metadata, > > > > > > > >> > > sorted > > > > > > > >> > > > by > > > > > > > >> > > > > > > > {@link > > > > > > > >> > > > > > > > > > > > > > > RemoteLogSegmentMetadata#startOffset()} > > > > > > > >> inascending > > > > > > > >> > > order > > > > > > > >> > > > > > which > > > > > > > >> > > > > > > > > > > contains > > > > > > > >> > > > > > > > > > > > > the given leader epoch. This is used > > by > > > > > remote > > > > > > > log > > > > > > > >> > > > > retention > > > > > > > >> > > > > > > > > management > > > > > > > >> > > > > > > > > > > > > subsystemto fetch the segment metadata > > > for a > > > > > > > given > > > > > > > >> > > leader > > > > > > > >> > > > > > > > > epoch.@param > > > > > > > >> > > > > > > > > > > > > topicIdPartition topic partition@param > > > > > > > >> leaderEpoch > > > > > > > >> > > > > > leader > > > > > > > >> > > > > > > > > > > > > epoch@return > > > > > > > >> > > > > > > > > > > > > Iterator of remote segments, sorted by > > > start > > > > > > > >> offset > > > > > > > >> > in > > > > > > > >> > > > > > > ascending > > > > > > > >> > > > > > > > > > > order. * > > > > > > > >> > > > > > > > > > > > > > > > > > > > >> > > > > > > > > > > > > *Luke,* > > > > > > > >> > > > > > > > > > > > > > > > > > > > >> > > > > > > > > > > > > 5. Note that we are trying to optimize > > > the > > > > > > > >> efficiency > > > > > > > >> > > of > > > > > > > >> > > > > size > > > > > > > >> > > > > > > > based > > > > > > > >> > > > > > > > > > > > > retention for remote storage. KIP-405 > > > does > > > > > not > > > > > > > >> > > introduce > > > > > > > >> > > > a > > > > > > > >> > > > > > new > > > > > > > >> > > > > > > > > config > > > > > > > >> > > > > > > > > > > for > > > > > > > >> > > > > > > > > > > > > periodically checking remote similar > > to > > > > > > > >> > > > > > > > > > > log.retention.check.interval.ms > > > > > > > >> > > > > > > > > > > > > which is applicable for remote > > storage. > > > > > Hence, > > > > > > > the > > > > > > > >> > > metric > > > > > > > >> > > > > > will > > > > > > > >> > > > > > > be > > > > > > > >> > > > > > > > > > > updated > > > > > > > >> > > > > > > > > > > > > at the time of invoking log retention > > > check > > > > > > for > > > > > > > >> > remote > > > > > > > >> > > > tier > > > > > > > >> > > > > > > which > > > > > > > >> > > > > > > > > is > > > > > > > >> > > > > > > > > > > > > pending implementation today. We can > > > perhaps > > > > > > > come > > > > > > > >> > back > > > > > > > >> > > > and > > > > > > > >> > > > > > > update > > > > > > > >> > > > > > > > > the > > > > > > > >> > > > > > > > > > > > > metric description after the > > > implementation > > > > > of > > > > > > > log > > > > > > > >> > > > > retention > > > > > > > >> > > > > > > > check > > > > > > > >> > > > > > > > > in > > > > > > > >> > > > > > > > > > > > > RemoteLogManager. > > > > > > > >> > > > > > > > > > > > > > > > > > > > >> > > > > > > > > > > > > -- > > > > > > > >> > > > > > > > > > > > > Divij Vaidya > > > > > > > >> > > > > > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > > > > > > >> > > > > > > > > > > > > On Thu, Nov 10, 2022 at 6:16 AM Luke > > > Chen < > > > > > > > >> > > > > show...@gmail.com > > > > > > > >> > > > > > > > > > > > > > >> > > > > > > > > wrote: > > > > > > > >> > > > > > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > Hi Divij, > > > > > > > >> > > > > > > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > One more question about the metric: > > > > > > > >> > > > > > > > > > > > > > I think the metric will be updated > > > when > > > > > > > >> > > > > > > > > > > > > > (1) each time we run the log > > retention > > > > > check > > > > > > > >> (that > > > > > > > >> > > is, > > > > > > > >> > > > > > > > > > > > > > log.retention.check.interval.ms) > > > > > > > >> > > > > > > > > > > > > > (2) When user explicitly call > > > > > > getRemoteLogSize > > > > > > > >> > > > > > > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > Is that correct? > > > > > > > >> > > > > > > > > > > > > > Maybe we should add a note in metric > > > > > > > >> description, > > > > > > > >> > > > > > otherwise, > > > > > > > >> > > > > > > > when > > > > > > > >> > > > > > > > > > > user > > > > > > > >> > > > > > > > > > > > > got, > > > > > > > >> > > > > > > > > > > > > > let's say 0 of RemoteLogSizeBytes, > > > will be > > > > > > > >> > surprised. > > > > > > > >> > > > > > > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > Otherwise, LGTM > > > > > > > >> > > > > > > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > Thank you for the KIP > > > > > > > >> > > > > > > > > > > > > > Luke > > > > > > > >> > > > > > > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > On Thu, Nov 10, 2022 at 2:55 AM Jun > > > Rao > > > > > > > >> > > > > > > > <j...@confluent.io.invalid > > > > > > > >> > > > > > > > > > > > > > > > > >> > > > > > > > > > > > > wrote: > > > > > > > >> > > > > > > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > Hi, Divij, > > > > > > > >> > > > > > > > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > Thanks for the explanation. > > > > > > > >> > > > > > > > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > 1. Hmm, the default implementation > > > of > > > > > RLMM > > > > > > > >> does > > > > > > > >> > > local > > > > > > > >> > > > > > > > caching, > > > > > > > >> > > > > > > > > > > right? > > > > > > > >> > > > > > > > > > > > > > > Currently, we also cache all > > segment > > > > > > > metadata > > > > > > > >> in > > > > > > > >> > > the > > > > > > > >> > > > > > > brokers > > > > > > > >> > > > > > > > > > > without > > > > > > > >> > > > > > > > > > > > > > > KIP-405. Do you see a need to > > change > > > > > that? > > > > > > > >> > > > > > > > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > 2,3,4: Yes, your explanation makes > > > > > sense. > > > > > > > >> > However, > > > > > > > >> > > > > > > > > > > > > > > currently, > > > > > > > >> > > > > > RemoteLogMetadataManager.listRemoteLogSegments() > > > > > > > >> > > > > > > > > doesn't > > > > > > > >> > > > > > > > > > > > > > specify > > > > > > > >> > > > > > > > > > > > > > > a particular order of the > > iterator. > > > Do > > > > > you > > > > > > > >> intend > > > > > > > >> > > to > > > > > > > >> > > > > > change > > > > > > > >> > > > > > > > > that? > > > > > > > >> > > > > > > > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > Thanks, > > > > > > > >> > > > > > > > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > Jun > > > > > > > >> > > > > > > > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > On Tue, Nov 8, 2022 at 3:31 AM > > Divij > > > > > > Vaidya > > > > > > > < > > > > > > > >> > > > > > > > > > > divijvaidy...@gmail.com > > > > > > > >> > > > > > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > wrote: > > > > > > > >> > > > > > > > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > > Hey Jun > > > > > > > >> > > > > > > > > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > > Thank you for your comments. > > > > > > > >> > > > > > > > > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > > *1. "RLMM implementor could > > ensure > > > > > that > > > > > > > >> > > > > > > > > listRemoteLogSegments() > > > > > > > >> > > > > > > > > > > is > > > > > > > >> > > > > > > > > > > > > > fast"* > > > > > > > >> > > > > > > > > > > > > > > > This would be ideal but > > > pragmatically, > > > > > > it > > > > > > > is > > > > > > > >> > > > > difficult > > > > > > > >> > > > > > to > > > > > > > >> > > > > > > > > ensure > > > > > > > >> > > > > > > > > > > > that > > > > > > > >> > > > > > > > > > > > > > > > listRemoteLogSegments() is fast. > > > This > > > > > is > > > > > > > >> > because > > > > > > > >> > > of > > > > > > > >> > > > > the > > > > > > > >> > > > > > > > > > > possibility > > > > > > > >> > > > > > > > > > > > > of > > > > > > > >> > > > > > > > > > > > > > a > > > > > > > >> > > > > > > > > > > > > > > > large number of segments (much > > > larger > > > > > > than > > > > > > > >> what > > > > > > > >> > > > Kafka > > > > > > > >> > > > > > > > > currently > > > > > > > >> > > > > > > > > > > > > handles > > > > > > > >> > > > > > > > > > > > > > > > with local storage today) would > > > make > > > > > it > > > > > > > >> > > infeasible > > > > > > > >> > > > to > > > > > > > >> > > > > > > adopt > > > > > > > >> > > > > > > > > > > > > strategies > > > > > > > >> > > > > > > > > > > > > > > such > > > > > > > >> > > > > > > > > > > > > > > > as local caching to improve the > > > > > > > performance > > > > > > > >> of > > > > > > > >> > > > > > > > > > > > listRemoteLogSegments. > > > > > > > >> > > > > > > > > > > > > > > Apart > > > > > > > >> > > > > > > > > > > > > > > > from caching (which won't work > > > due to > > > > > > size > > > > > > > >> > > > > > limitations) I > > > > > > > >> > > > > > > > > can't > > > > > > > >> > > > > > > > > > > > think > > > > > > > >> > > > > > > > > > > > > > of > > > > > > > >> > > > > > > > > > > > > > > > other strategies which may > > > eliminate > > > > > the > > > > > > > >> need > > > > > > > >> > for > > > > > > > >> > > > IO > > > > > > > >> > > > > > > > > > > > > > > > operations proportional to the > > > number > > > > > of > > > > > > > >> total > > > > > > > >> > > > > > segments. > > > > > > > >> > > > > > > > > Please > > > > > > > >> > > > > > > > > > > > > advise > > > > > > > >> > > > > > > > > > > > > > if > > > > > > > >> > > > > > > > > > > > > > > > you have something in mind. > > > > > > > >> > > > > > > > > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > > 2. "*If the size exceeds the > > > > > retention > > > > > > > >> size, > > > > > > > >> > we > > > > > > > >> > > > need > > > > > > > >> > > > > > to > > > > > > > >> > > > > > > > > > > determine > > > > > > > >> > > > > > > > > > > > > the > > > > > > > >> > > > > > > > > > > > > > > > subset of segments to delete to > > > bring > > > > > > the > > > > > > > >> size > > > > > > > >> > > > within > > > > > > > >> > > > > > the > > > > > > > >> > > > > > > > > > > retention > > > > > > > >> > > > > > > > > > > > > > > limit. > > > > > > > >> > > > > > > > > > > > > > > > Do we need to call > > > > > > > >> > > > > > > > > > > > > > RemoteLogMetadataManager.listRemoteLogSegments() > > > > > > > >> > > > > > > > > > > > > to > > > > > > > >> > > > > > > > > > > > > > > > determine that?"* > > > > > > > >> > > > > > > > > > > > > > > > Yes, we need to call > > > > > > > >> listRemoteLogSegments() to > > > > > > > >> > > > > > determine > > > > > > > >> > > > > > > > > which > > > > > > > >> > > > > > > > > > > > > > segments > > > > > > > >> > > > > > > > > > > > > > > > should be deleted. But there is > > a > > > > > > > difference > > > > > > > >> > with > > > > > > > >> > > > the > > > > > > > >> > > > > > use > > > > > > > >> > > > > > > > > case we > > > > > > > >> > > > > > > > > > > > are > > > > > > > >> > > > > > > > > > > > > > > > trying to optimize with this > > KIP. > > > To > > > > > > > >> determine > > > > > > > >> > > the > > > > > > > >> > > > > > subset > > > > > > > >> > > > > > > > of > > > > > > > >> > > > > > > > > > > > segments > > > > > > > >> > > > > > > > > > > > > > > which > > > > > > > >> > > > > > > > > > > > > > > > would be deleted, we only read > > > > > metadata > > > > > > > for > > > > > > > >> > > > segments > > > > > > > >> > > > > > > which > > > > > > > >> > > > > > > > > would > > > > > > > >> > > > > > > > > > > be > > > > > > > >> > > > > > > > > > > > > > > deleted > > > > > > > >> > > > > > > > > > > > > > > > via the listRemoteLogSegments(). > > > But > > > > > to > > > > > > > >> > determine > > > > > > > >> > > > the > > > > > > > >> > > > > > > > > > > totalLogSize, > > > > > > > >> > > > > > > > > > > > > > which > > > > > > > >> > > > > > > > > > > > > > > > is required every time retention > > > logic > > > > > > > >> based on > > > > > > > >> > > > size > > > > > > > >> > > > > > > > > executes, we > > > > > > > >> > > > > > > > > > > > > read > > > > > > > >> > > > > > > > > > > > > > > > metadata of *all* the segments > > in > > > > > remote > > > > > > > >> > storage. > > > > > > > >> > > > > > Hence, > > > > > > > >> > > > > > > > the > > > > > > > >> > > > > > > > > > > number > > > > > > > >> > > > > > > > > > > > > of > > > > > > > >> > > > > > > > > > > > > > > > results returned by > > > > > > > >> > > > > > > > > > > > > > > > > > *RemoteLogMetadataManager.listRemoteLogSegments() > > > > > > > >> > > > > > > > > > > > > > *is > > > > > > > >> > > > > > > > > > > > > > > > different when we are > > calculating > > > > > > > >> totalLogSize > > > > > > > >> > > vs. > > > > > > > >> > > > > when > > > > > > > >> > > > > > > we > > > > > > > >> > > > > > > > > are > > > > > > > >> > > > > > > > > > > > > > > determining > > > > > > > >> > > > > > > > > > > > > > > > the subset of segments to > > delete. > > > > > > > >> > > > > > > > > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > > 3. > > > > > > > >> > > > > > > > > > > > > > > > *"Also, what about time-based > > > > > retention? > > > > > > > To > > > > > > > >> > make > > > > > > > >> > > > that > > > > > > > >> > > > > > > > > efficient, > > > > > > > >> > > > > > > > > > > do > > > > > > > >> > > > > > > > > > > > > we > > > > > > > >> > > > > > > > > > > > > > > need > > > > > > > >> > > > > > > > > > > > > > > > to make some additional > > interface > > > > > > > >> changes?"*No. > > > > > > > >> > > > Note > > > > > > > >> > > > > > that > > > > > > > >> > > > > > > > > time > > > > > > > >> > > > > > > > > > > > > > complexity > > > > > > > >> > > > > > > > > > > > > > > > to determine the segments for > > > > > retention > > > > > > is > > > > > > > >> > > > different > > > > > > > >> > > > > > for > > > > > > > >> > > > > > > > time > > > > > > > >> > > > > > > > > > > based > > > > > > > >> > > > > > > > > > > > > vs. > > > > > > > >> > > > > > > > > > > > > > > > size based. For time based, the > > > time > > > > > > > >> complexity > > > > > > > >> > > is > > > > > > > >> > > > a > > > > > > > >> > > > > > > > > function of > > > > > > > >> > > > > > > > > > > > the > > > > > > > >> > > > > > > > > > > > > > > number > > > > > > > >> > > > > > > > > > > > > > > > of segments which are "eligible > > > for > > > > > > > >> deletion" > > > > > > > >> > > > (since > > > > > > > >> > > > > we > > > > > > > >> > > > > > > > only > > > > > > > >> > > > > > > > > read > > > > > > > >> > > > > > > > > > > > > > > metadata > > > > > > > >> > > > > > > > > > > > > > > > for segments which would be > > > deleted) > > > > > > > >> whereas in > > > > > > > >> > > > size > > > > > > > >> > > > > > > based > > > > > > > >> > > > > > > > > > > > retention, > > > > > > > >> > > > > > > > > > > > > > the > > > > > > > >> > > > > > > > > > > > > > > > time complexity is a function of > > > "all > > > > > > > >> segments" > > > > > > > >> > > > > > available > > > > > > > >> > > > > > > > in > > > > > > > >> > > > > > > > > > > remote > > > > > > > >> > > > > > > > > > > > > > > storage > > > > > > > >> > > > > > > > > > > > > > > > (metadata of all segments needs > > > to be > > > > > > read > > > > > > > >> to > > > > > > > >> > > > > calculate > > > > > > > >> > > > > > > the > > > > > > > >> > > > > > > > > total > > > > > > > >> > > > > > > > > > > > > > size). > > > > > > > >> > > > > > > > > > > > > > > As > > > > > > > >> > > > > > > > > > > > > > > > you may observe, this KIP will > > > bring > > > > > the > > > > > > > >> time > > > > > > > >> > > > > > complexity > > > > > > > >> > > > > > > > for > > > > > > > >> > > > > > > > > both > > > > > > > >> > > > > > > > > > > > > time > > > > > > > >> > > > > > > > > > > > > > > > based retention & size based > > > retention > > > > > > to > > > > > > > >> the > > > > > > > >> > > same > > > > > > > >> > > > > > > > function. > > > > > > > >> > > > > > > > > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > > 4. Also, please note that this > > > new API > > > > > > > >> > introduced > > > > > > > >> > > > in > > > > > > > >> > > > > > this > > > > > > > >> > > > > > > > KIP > > > > > > > >> > > > > > > > > > > also > > > > > > > >> > > > > > > > > > > > > > > enables > > > > > > > >> > > > > > > > > > > > > > > > us to provide a metric for total > > > size > > > > > of > > > > > > > >> data > > > > > > > >> > > > stored > > > > > > > >> > > > > in > > > > > > > >> > > > > > > > > remote > > > > > > > >> > > > > > > > > > > > > storage. > > > > > > > >> > > > > > > > > > > > > > > > Without the API, calculation of > > > this > > > > > > > metric > > > > > > > >> > will > > > > > > > >> > > > > become > > > > > > > >> > > > > > > > very > > > > > > > >> > > > > > > > > > > > > expensive > > > > > > > >> > > > > > > > > > > > > > > with > > > > > > > >> > > > > > > > > > > > > > > > *listRemoteLogSegments().* > > > > > > > >> > > > > > > > > > > > > > > > I understand that your > > motivation > > > here > > > > > > is > > > > > > > to > > > > > > > >> > > avoid > > > > > > > >> > > > > > > > polluting > > > > > > > >> > > > > > > > > the > > > > > > > >> > > > > > > > > > > > > > > interface > > > > > > > >> > > > > > > > > > > > > > > > with optimization specific APIs > > > and I > > > > > > will > > > > > > > >> > agree > > > > > > > >> > > > with > > > > > > > >> > > > > > > that > > > > > > > >> > > > > > > > > goal. > > > > > > > >> > > > > > > > > > > > But > > > > > > > >> > > > > > > > > > > > > I > > > > > > > >> > > > > > > > > > > > > > > > believe that this new API > > > proposed in > > > > > > the > > > > > > > >> KIP > > > > > > > >> > > > brings > > > > > > > >> > > > > in > > > > > > > >> > > > > > > > > > > significant > > > > > > > >> > > > > > > > > > > > > > > > improvement and there is no > > other > > > work > > > > > > > >> around > > > > > > > >> > > > > available > > > > > > > >> > > > > > > to > > > > > > > >> > > > > > > > > > > achieve > > > > > > > >> > > > > > > > > > > > > the > > > > > > > >> > > > > > > > > > > > > > > same > > > > > > > >> > > > > > > > > > > > > > > > performance. > > > > > > > >> > > > > > > > > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > > Regards, > > > > > > > >> > > > > > > > > > > > > > > > Divij Vaidya > > > > > > > >> > > > > > > > > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > > On Tue, Nov 8, 2022 at 12:12 AM > > > Jun > > > > > Rao > > > > > > > >> > > > > > > > > <j...@confluent.io.invalid > > > > > > > >> > > > > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > wrote: > > > > > > > >> > > > > > > > > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > > > Hi, Divij, > > > > > > > >> > > > > > > > > > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > > > Thanks for the KIP. Sorry for > > > the > > > > > late > > > > > > > >> reply. > > > > > > > >> > > > > > > > > > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > > > The motivation of the KIP is > > to > > > > > > improve > > > > > > > >> the > > > > > > > >> > > > > > efficiency > > > > > > > >> > > > > > > of > > > > > > > >> > > > > > > > > size > > > > > > > >> > > > > > > > > > > > > based > > > > > > > >> > > > > > > > > > > > > > > > > retention. I am not sure the > > > > > proposed > > > > > > > >> changes > > > > > > > >> > > are > > > > > > > >> > > > > > > enough. > > > > > > > >> > > > > > > > > For > > > > > > > >> > > > > > > > > > > > > > example, > > > > > > > >> > > > > > > > > > > > > > > if > > > > > > > >> > > > > > > > > > > > > > > > > the size exceeds the retention > > > size, > > > > > > we > > > > > > > >> need > > > > > > > >> > to > > > > > > > >> > > > > > > determine > > > > > > > >> > > > > > > > > the > > > > > > > >> > > > > > > > > > > > > subset > > > > > > > >> > > > > > > > > > > > > > of > > > > > > > >> > > > > > > > > > > > > > > > > segments to delete to bring > > the > > > size > > > > > > > >> within > > > > > > > >> > the > > > > > > > >> > > > > > > retention > > > > > > > >> > > > > > > > > > > limit. > > > > > > > >> > > > > > > > > > > > Do > > > > > > > >> > > > > > > > > > > > > > we > > > > > > > >> > > > > > > > > > > > > > > > need > > > > > > > >> > > > > > > > > > > > > > > > > to call > > > > > > > >> > > > > > > RemoteLogMetadataManager.listRemoteLogSegments() > > > > > > > >> > > > > > > > to > > > > > > > >> > > > > > > > > > > > > determine > > > > > > > >> > > > > > > > > > > > > > > > that? > > > > > > > >> > > > > > > > > > > > > > > > > Also, what about time-based > > > > > retention? > > > > > > > To > > > > > > > >> > make > > > > > > > >> > > > that > > > > > > > >> > > > > > > > > efficient, > > > > > > > >> > > > > > > > > > > do > > > > > > > >> > > > > > > > > > > > > we > > > > > > > >> > > > > > > > > > > > > > > need > > > > > > > >> > > > > > > > > > > > > > > > > to make some additional > > > interface > > > > > > > changes? > > > > > > > >> > > > > > > > > > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > > > An alternative approach is for > > > the > > > > > > RLMM > > > > > > > >> > > > implementor > > > > > > > >> > > > > > to > > > > > > > >> > > > > > > > make > > > > > > > >> > > > > > > > > > > sure > > > > > > > >> > > > > > > > > > > > > > > > > that > > > > > > > >> > > > > RemoteLogMetadataManager.listRemoteLogSegments() > > > > > > > >> > > > > > > is > > > > > > > >> > > > > > > > > fast > > > > > > > >> > > > > > > > > > > > > (e.g., > > > > > > > >> > > > > > > > > > > > > > > with > > > > > > > >> > > > > > > > > > > > > > > > > local caching). This way, we > > > could > > > > > > keep > > > > > > > >> the > > > > > > > >> > > > > interface > > > > > > > >> > > > > > > > > simple. > > > > > > > >> > > > > > > > > > > > Have > > > > > > > >> > > > > > > > > > > > > we > > > > > > > >> > > > > > > > > > > > > > > > > considered that? > > > > > > > >> > > > > > > > > > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > > > Thanks, > > > > > > > >> > > > > > > > > > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > > > Jun > > > > > > > >> > > > > > > > > > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > > > On Wed, Sep 28, 2022 at 6:28 > > AM > > > > > Divij > > > > > > > >> Vaidya > > > > > > > >> > < > > > > > > > >> > > > > > > > > > > > > > divijvaidy...@gmail.com> > > > > > > > >> > > > > > > > > > > > > > > > > wrote: > > > > > > > >> > > > > > > > > > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > > > > Hey folks > > > > > > > >> > > > > > > > > > > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > > > > Does anyone else have any > > > thoughts > > > > > > on > > > > > > > >> this > > > > > > > >> > > > > before I > > > > > > > >> > > > > > > > > propose > > > > > > > >> > > > > > > > > > > > this > > > > > > > >> > > > > > > > > > > > > > for > > > > > > > >> > > > > > > > > > > > > > > a > > > > > > > >> > > > > > > > > > > > > > > > > > vote? > > > > > > > >> > > > > > > > > > > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > > > > -- > > > > > > > >> > > > > > > > > > > > > > > > > > Divij Vaidya > > > > > > > >> > > > > > > > > > > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > > > > On Mon, Sep 5, 2022 at 12:57 > > > PM > > > > > > Satish > > > > > > > >> > > Duggana > > > > > > > >> > > > < > > > > > > > >> > > > > > > > > > > > > > > > satish.dugg...@gmail.com > > > > > > > >> > > > > > > > > > > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > > > > wrote: > > > > > > > >> > > > > > > > > > > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > > > > > Thanks for the KIP Divij! > > > > > > > >> > > > > > > > > > > > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > > > > > This is a nice improvement > > > to > > > > > > avoid > > > > > > > >> > > > > recalculation > > > > > > > >> > > > > > > of > > > > > > > >> > > > > > > > > size. > > > > > > > >> > > > > > > > > > > > > > > Customized > > > > > > > >> > > > > > > > > > > > > > > > > > RLMMs > > > > > > > >> > > > > > > > > > > > > > > > > > > can implement the best > > > possible > > > > > > > >> approach > > > > > > > >> > by > > > > > > > >> > > > > > caching > > > > > > > >> > > > > > > > or > > > > > > > >> > > > > > > > > > > > > > maintaining > > > > > > > >> > > > > > > > > > > > > > > > the > > > > > > > >> > > > > > > > > > > > > > > > > > size > > > > > > > >> > > > > > > > > > > > > > > > > > > in an efficient way. But > > > this is > > > > > > > not a > > > > > > > >> > big > > > > > > > >> > > > > > concern > > > > > > > >> > > > > > > > for > > > > > > > >> > > > > > > > > the > > > > > > > >> > > > > > > > > > > > > > default > > > > > > > >> > > > > > > > > > > > > > > > > topic > > > > > > > >> > > > > > > > > > > > > > > > > > > based RLMM as mentioned in > > > the > > > > > > KIP. > > > > > > > >> > > > > > > > > > > > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > > > > > ~Satish. > > > > > > > >> > > > > > > > > > > > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > > > > > On Wed, 13 Jul 2022 at > > > 18:48, > > > > > > Divij > > > > > > > >> > Vaidya > > > > > > > >> > > < > > > > > > > >> > > > > > > > > > > > > > > divijvaidy...@gmail.com> > > > > > > > >> > > > > > > > > > > > > > > > > > > wrote: > > > > > > > >> > > > > > > > > > > > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > > > > > > Thank you for your > > review > > > > > Luke. > > > > > > > >> > > > > > > > > > > > > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > > > > > > > Reg: is that would the > > > new > > > > > > > >> > > > > > `RemoteLogSizeBytes` > > > > > > > >> > > > > > > > > metric > > > > > > > >> > > > > > > > > > > > be a > > > > > > > >> > > > > > > > > > > > > > > > > > performance > > > > > > > >> > > > > > > > > > > > > > > > > > > > overhead? Although we > > > move the > > > > > > > >> > > calculation > > > > > > > >> > > > > to a > > > > > > > >> > > > > > > > > seperate > > > > > > > >> > > > > > > > > > > > API, > > > > > > > >> > > > > > > > > > > > > > we > > > > > > > >> > > > > > > > > > > > > > > > > still > > > > > > > >> > > > > > > > > > > > > > > > > > > > can't assume users will > > > > > > implement > > > > > > > a > > > > > > > >> > > > > > light-weight > > > > > > > >> > > > > > > > > method, > > > > > > > >> > > > > > > > > > > > > right? > > > > > > > >> > > > > > > > > > > > > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > > > > > > This metric would be > > > logged > > > > > > using > > > > > > > >> the > > > > > > > >> > > > > > information > > > > > > > >> > > > > > > > > that is > > > > > > > >> > > > > > > > > > > > > > already > > > > > > > >> > > > > > > > > > > > > > > > > being > > > > > > > >> > > > > > > > > > > > > > > > > > > > calculated for handling > > > remote > > > > > > > >> > retention > > > > > > > >> > > > > logic, > > > > > > > >> > > > > > > > > hence, no > > > > > > > >> > > > > > > > > > > > > > > > additional > > > > > > > >> > > > > > > > > > > > > > > > > > work > > > > > > > >> > > > > > > > > > > > > > > > > > > > is required to calculate > > > this > > > > > > > >> metric. > > > > > > > >> > > More > > > > > > > >> > > > > > > > > specifically, > > > > > > > >> > > > > > > > > > > > > > whenever > > > > > > > >> > > > > > > > > > > > > > > > > > > > RemoteLogManager calls > > > > > > > >> getRemoteLogSize > > > > > > > >> > > > API, > > > > > > > >> > > > > > this > > > > > > > >> > > > > > > > > metric > > > > > > > >> > > > > > > > > > > > > would > > > > > > > >> > > > > > > > > > > > > > be > > > > > > > >> > > > > > > > > > > > > > > > > > > captured. > > > > > > > >> > > > > > > > > > > > > > > > > > > > This API call is made > > > every > > > > > time > > > > > > > >> > > > > > RemoteLogManager > > > > > > > >> > > > > > > > > wants > > > > > > > >> > > > > > > > > > > to > > > > > > > >> > > > > > > > > > > > > > handle > > > > > > > >> > > > > > > > > > > > > > > > > > expired > > > > > > > >> > > > > > > > > > > > > > > > > > > > remote log segments > > (which > > > > > > should > > > > > > > be > > > > > > > >> > > > > periodic). > > > > > > > >> > > > > > > > Does > > > > > > > >> > > > > > > > > that > > > > > > > >> > > > > > > > > > > > > > address > > > > > > > >> > > > > > > > > > > > > > > > > your > > > > > > > >> > > > > > > > > > > > > > > > > > > > concern? > > > > > > > >> > > > > > > > > > > > > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > > > > > > Divij Vaidya > > > > > > > >> > > > > > > > > > > > > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > > > > > > On Tue, Jul 12, 2022 at > > > 11:01 > > > > > AM > > > > > > > >> Luke > > > > > > > >> > > Chen > > > > > > > >> > > > < > > > > > > > >> > > > > > > > > > > > > show...@gmail.com> > > > > > > > >> > > > > > > > > > > > > > > > > wrote: > > > > > > > >> > > > > > > > > > > > > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > > > > > > > Hi Divij, > > > > > > > >> > > > > > > > > > > > > > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > > > > > > > Thanks for the KIP! > > > > > > > >> > > > > > > > > > > > > > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > > > > > > > I think it makes sense > > > to > > > > > > > delegate > > > > > > > >> > the > > > > > > > >> > > > > > > > > responsibility > > > > > > > >> > > > > > > > > > > of > > > > > > > >> > > > > > > > > > > > > > > > > calculation > > > > > > > >> > > > > > > > > > > > > > > > > > to > > > > > > > >> > > > > > > > > > > > > > > > > > > > the > > > > > > > >> > > > > > > > > > > > > > > > > > > > > specific > > > > > > > RemoteLogMetadataManager > > > > > > > >> > > > > > > implementation. > > > > > > > >> > > > > > > > > > > > > > > > > > > > > But one thing I'm not > > > quite > > > > > > > sure, > > > > > > > >> is > > > > > > > >> > > that > > > > > > > >> > > > > > would > > > > > > > >> > > > > > > > > the new > > > > > > > >> > > > > > > > > > > > > > > > > > > > > `RemoteLogSizeBytes` > > > metric > > > > > > be a > > > > > > > >> > > > > performance > > > > > > > >> > > > > > > > > overhead? > > > > > > > >> > > > > > > > > > > > > > > > > > > > > Although we move the > > > > > > calculation > > > > > > > >> to a > > > > > > > >> > > > > > seperate > > > > > > > >> > > > > > > > > API, we > > > > > > > >> > > > > > > > > > > > > still > > > > > > > >> > > > > > > > > > > > > > > > can't > > > > > > > >> > > > > > > > > > > > > > > > > > > assume > > > > > > > >> > > > > > > > > > > > > > > > > > > > > users will implement a > > > > > > > >> light-weight > > > > > > > >> > > > method, > > > > > > > >> > > > > > > > right? > > > > > > > >> > > > > > > > > > > > > > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > > > > > > > Thank you. > > > > > > > >> > > > > > > > > > > > > > > > > > > > > Luke > > > > > > > >> > > > > > > > > > > > > > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > > > > > > > On Fri, Jul 1, 2022 at > > > 5:47 > > > > > PM > > > > > > > >> Divij > > > > > > > >> > > > > Vaidya < > > > > > > > >> > > > > > > > > > > > > > > > > divijvaidy...@gmail.com > > > > > > > >> > > > > > > > > > > > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > > > > > > > wrote: > > > > > > > >> > > > > > > > > > > > > > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > >> > > > > > > > > > > > > > >> > > > > > > > > > > > > >> > > > > > > > > > > > >> > > > > > > > > > > >> > > > > > > > > > >> > > > > > > > > >> > > > > > > > > > > > > > > > > > > > > > > > https://cwiki.apache.org/confluence/display/KAFKA/KIP-852%3A+Optimize+calculation+of+size+for+log+in+remote+tier > > > > > > > >> > > > > > > > > > > > > > > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > > > > > > > > Hey folks > > > > > > > >> > > > > > > > > > > > > > > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > > > > > > > > Please take a look > > at > > > this > > > > > > KIP > > > > > > > >> > which > > > > > > > >> > > > > > proposes > > > > > > > >> > > > > > > > an > > > > > > > >> > > > > > > > > > > > > extension > > > > > > > >> > > > > > > > > > > > > > to > > > > > > > >> > > > > > > > > > > > > > > > > > > KIP-405. > > > > > > > >> > > > > > > > > > > > > > > > > > > > > This > > > > > > > >> > > > > > > > > > > > > > > > > > > > > > is my first KIP with > > > > > Apache > > > > > > > >> Kafka > > > > > > > >> > > > > community > > > > > > > >> > > > > > > so > > > > > > > >> > > > > > > > > any > > > > > > > >> > > > > > > > > > > > > feedback > > > > > > > >> > > > > > > > > > > > > > > > would > > > > > > > >> > > > > > > > > > > > > > > > > > be > > > > > > > >> > > > > > > > > > > > > > > > > > > > > highly > > > > > > > >> > > > > > > > > > > > > > > > > > > > > > appreciated. > > > > > > > >> > > > > > > > > > > > > > > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > > > > > > > > Cheers! > > > > > > > >> > > > > > > > > > > > > > > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > > > > > > > > -- > > > > > > > >> > > > > > > > > > > > > > > > > > > > > > Divij Vaidya > > > > > > > >> > > > > > > > > > > > > > > > > > > > > > Sr. Software > > Engineer > > > > > > > >> > > > > > > > > > > > > > > > > > > > > > Amazon > > > > > > > >> > > > > > > > > > > > > > > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > >> > > > > > > > > > > > > > >> > > > > > > > > > > > > >> > > > > > > > > > > > >> > > > > > > > > > > >> > > > > > > > > > >> > > > > > > > > >> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > >