Hi Harsha/Satish, Hope you are doing well. Would you be able to please update the meeting notes section for the most recent 2 meetings (from 10/13 and 11/10)? It will be useful to share the context with the community. https://cwiki.apache.org/confluence/display/KAFKA/KIP-405%3A+Kafka+Tiered+Storage#KIP405:KafkaTieredStorage-MeetingNotes
Cheers, Kowshik On Tue, Nov 10, 2020 at 11:39 PM Kowshik Prakasam <kpraka...@confluent.io> wrote: > Hi Harsha, > > The goal we discussed is to aim for preview in AK 3.0. In order to get us > there, it will be useful to think about the order in which the code changes > will be implemented, reviewed and merged. Since you are driving the > development, do you want to layout the order of things? For example, do you > eventually want to break up the PR into multiple smaller ones? If so, you > could list the milestones there. Another perspective is that this can be > helpful to budget time suitably and to understand the progress. > Let us know how we can help. > > > Cheers, > Kowshik > > On Tue, Nov 10, 2020 at 3:26 PM Harsha Chintalapani <ka...@harsha.io> > wrote: > >> Thanks Kowshik for the link. Seems reasonable, as we discussed on the >> call, code and completion of this KIP will be taken up by us. >> Regarding Milestone 2, what you think it needs to be clarified there? >> I believe what we are promising in the KIP along with unit tests, systems >> tests will be delivered and we can call that as preview. We will be >> running this in our production and continue to provide the data and >> metrics >> to push this feature to GA. >> >> >> >> On Tue, Nov 10, 2020 at 10:07 AM, Kowshik Prakasam < >> kpraka...@confluent.io> >> wrote: >> >> > Hi Harsha/Satish, >> > >> > Thanks for the discussion today. Here is a link to the KIP-405 >> <https://issues.apache.org/jira/browse/KIP-405> development >> > milestones google doc we discussed in the meeting today: https://docs. >> > google.com/document/d/1B5_jaZvWWb2DUpgbgImq0k_IPZ4DWrR8Ru7YpuJrXdc/edit >> > . I have shared it with you. Please have a look and share your >> > feedback/improvements. As we discussed, things are clear until >> milestone 1. >> > Beyond that, we can discuss it again (perhaps in next sync or later), >> once >> > you have thought through the implementation plan/milestones and release >> > into preview in 3.0. >> > >> > Cheers, >> > Kowshik >> > >> > On Tue, Nov 10, 2020 at 6:56 AM Satish Duggana < >> satish.dugg...@gmail.com> >> > wrote: >> > >> > Hi Jun, >> > Thanks for your comments. Please find the inline replies below. >> > >> > 605.2 "Build the local leader epoch cache by cutting the leader epoch >> > sequence received from remote storage to [LSO, ELO]." I mentioned an >> issue >> > earlier. Suppose the leader's local start offset is 100. The follower >> finds >> > a remote segment covering offset range [80, 120). The producerState with >> > this remote segment is up to offset 120. To trim the producerState to >> > offset 100 requires more work since one needs to download the previous >> > producerState up to offset 80 and then replay the messages from 80 to >> 100. >> > It seems that it's simpler in this case for the follower just to take >> the >> > remote segment as it is and start fetching from offset 120. >> > >> > We chose that approach to avoid any edge cases here. It may be possible >> > that the remote log segment that is received may not have the same >> leader >> > epoch sequence from 100-120 as it contains on the leader(this can happen >> > due to unclean leader). It is safe to start from what the leader returns >> > here.Another way is to find the remote log segment >> > >> > 5016. Just to echo what Kowshik was saying. It seems that >> > RLMM.onPartitionLeadershipChanges() is only called on the replicas for a >> > partition, not on the replicas for the __remote_log_segment_metadata >> > partition. It's not clear how the leader of >> __remote_log_segment_metadata >> > obtains the metadata for remote segments for deletion. >> > >> > RLMM will always receive the callback for the remote log metadata topic >> > partitions hosted on the local broker and these will be subscribed. I >> will >> > make this clear in the KIP. >> > >> > 5100. KIP-516 <https://issues.apache.org/jira/browse/KIP-516> has been >> accepted and is being implemented now. Could you >> > update the KIP based on topicID? >> > >> > We mentioned KIP-516 <https://issues.apache.org/jira/browse/KIP-516> >> and how it helps. We will update this KIP with all >> > the changes it brings with KIP-516 >> <https://issues.apache.org/jira/browse/KIP-516>. >> > >> > 5101. RLMM: It would be useful to clarify how the following two APIs are >> > used. According to the wiki, the former is used for topic deletion and >> the >> > latter is used for retention. It seems that retention should use the >> former >> > since remote segments without a matching epoch in the leader >> (potentially >> > due to unclean leader election) also need to be garbage collected. The >> > latter seems to be used for the new leader to determine the last tiered >> > segment. >> > default Iterator<RemoteLogSegmentMetadata> >> > listRemoteLogSegments(TopicPartition topicPartition) >> > Iterator<RemoteLogSegmentMetadata> listRemoteLogSegments(TopicPartition >> > topicPartition, long leaderEpoch); >> > >> > Right,.that is what we are currently doing. We will update the javadocs >> > and wiki with that. Earlier, we did not want to remove the segments >> which >> > are not matched with leader epochs from the ladder partition as they >> may be >> > used later by a replica which can become a leader (unclean leader >> election) >> > and refer those segments. But that may leak these segments in remote >> > storage until the topic lifetime. We decided to cleanup the segments >> with >> > the oldest incase of size based retention also. >> > >> > 5102. RSM: >> > 5102.1 For methods like fetchLogSegmentData(), it seems that they can >> use >> > RemoteLogSegmentId instead of RemoteLogSegmentMetadata. >> > >> > It will be useful to have metadata for RSM to fetch log segment. It may >> > create location/path using id with other metadata too. >> > >> > 5102.2 In fetchLogSegmentData(), should we use long instead of Long? >> > >> > Wanted to keep endPosition as optional to read till the end of the >> segment >> > and avoid sentinels. >> > >> > 5102.3 Why only some of the methods have default implementation and >> others >> > Don't? >> > >> > Actually, RSM will not have any default implementations. Those 3 methods >> > were made default earlier for tests etc. Updated the wiki. >> > >> > 5102.4. Could we define RemoteLogSegmentMetadataUpdate and >> > DeletePartitionUpdate? >> > >> > Sure, they will be added. >> > >> > 5102.5 LogSegmentData: It seems that it's easier to pass in >> > leaderEpochIndex as a ByteBuffer or byte array than a file since it >> will be >> > generated in memory. >> > >> > Right, this is in plan. >> > >> > 5102.6 RemoteLogSegmentMetadata: It seems that it needs both baseOffset >> > and startOffset. For example, deleteRecords() could move the >> startOffset to >> > the middle of a segment. If we copy the full segment to remote storage, >> the >> > baseOffset and the startOffset will be different. >> > >> > Good point. startOffset is baseOffset by default, if not set explicitly. >> > >> > 5102.7 Could we define all the public methods for >> RemoteLogSegmentMetadata >> > and LogSegmentData? >> > >> > Sure, updated the wiki. >> > >> > 5102.8 Could we document whether endOffset in RemoteLogSegmentMetadata >> is >> > inclusive/exclusive? >> > >> > It is inclusive, will update. >> > >> > 5103. configs: >> > 5103.1 Could we define the default value of non-required configs (e.g >> the >> > size of new thread pools)? >> > >> > Sure, that makes sense. >> > >> > 5103.2 It seems that local.log.retention.ms should default to >> retention.ms >> > , >> > instead of remote.log.retention.minutes. Similarly, it seems that >> > local.log.retention.bytes should default to segment.bytes. >> > >> > Right, we do not have remote.log.retention as we discussed earlier. >> Thanks >> > for catching the typo. >> > >> > 5103.3 remote.log.manager.thread.pool.size: The description says "used >> in >> > scheduling tasks to copy segments, fetch remote log indexes and clean up >> > remote log segments". However, there is a separate config >> > remote.log.reader.threads for fetching remote data. It's weird to fetch >> > remote index and log in different thread pools since both are used for >> > serving fetch requests. >> > >> > Right, remote.log.manager.thread.pool is mainly used for copy/cleanup >> > activities. Fetch path always goes through remote.log.reader.threads. >> > >> > 5103.4 remote.log.manager.task.interval.ms: Is that the amount of time >> to >> > back off when there is no work to do? If so, perhaps it can be renamed >> as >> > backoff.ms. >> > >> > This is the delay interval for each iteration. It may be renamed to >> > remote.log.manager.task.delay.ms >> > >> > 5103.5 Are rlm_process_interval_ms and rlm_retry_interval_ms configs? If >> > so, they need to be listed in this section. >> > >> > remote.log.manager.task.interval.ms is the process internal, retry >> > interval is missing in the configs, which will be updated in the KIP. >> > >> > 5104. "RLM maintains a bounded cache(possibly LRU) of the index files of >> > remote log segments to avoid multiple index fetches from the remote >> > storage." Is the RLM in memory or on disk? If on disk, where is it >> stored? >> > Do we need a configuration to bound the size? >> > >> > It is stored on disk. They are stored in a directory >> > `remote-log-index-cache` under log dir. We plan to have a config for >> that >> > instead of default. We will have a configuration for that. >> > >> > 5105. The KIP uses local-log-start-offset and Earliest Local Offset in >> > different places. It would be useful to standardize the terminology. >> > >> > Sure. >> > >> > 5106. The section on "In BuildingRemoteLogAux state". It listed two >> > options without saying which option is chosen. >> > We already mentioned in the KIP that we chose option-2. >> > >> > 5107. Follower to leader transition: It has step 2, but not step 1. >> Step-1 >> > is there but it is not explicitly highlighted. It is previous table to >> > step-2. >> > >> > 5108. If a consumer fetches from the remote data and the remote storage >> is >> > not available, what error code is used in the fetch response? >> > >> > Good point. We have not yet defined the error for this case. We need to >> > define an error message and send the same in fetch response. >> > >> > 5109. "ListOffsets: For timestamps >= 0, it returns the first message >> > offset whose timestamp is >= to the given timestamp in the request. That >> > means it checks in remote log time indexes first, after which local log >> > time indexes are checked." Could you document which method in RLMM is >> used >> > for this? >> > >> > Okay. >> > >> > 5110. Stopreplica: "it sets all the remote log segment metadata of that >> > partition with a delete marker and publishes them to RLMM." This seems >> > outdated given the new topic deletion logic. >> > >> > Will update with KIP-516 >> <https://issues.apache.org/jira/browse/KIP-516> related points. >> > >> > 5111. "RLM follower fetches the earliest offset for the earliest leader >> > epoch by calling RLMM.earliestLogOffset(TopicPartition topicPartition, >> int >> > leaderEpoch) and updates that as the log start offset." Do we need that >> > since replication propagates logStartOffset already? >> > >> > Good point. Right, existing replication protocol takes care of updating >> > the followers’s log start offset received from the leader. >> > >> > 5112. Is the default maxWaitMs of 500ms enough for fetching from remote >> > storage? >> > >> > Remote reads may fail within the current default wait time, but >> subsequent >> > fetches would be able to serve as that data is stored in the local >> cache. >> > This cache is currently implemented in RSMs. But we plan to pull this >> into >> > the remote log messaging layer in future. >> > >> > 5113. "Committed offsets can be stored in a local file to avoid reading >> > the messages again when a broker is restarted." Could you describe the >> > format and the location of the file? Also, could the same message be >> > processed by RLMM again after broker restart? If so, how do we handle >> that? >> > >> > Sure, we will update in the KIP. >> > >> > 5114. Message format >> > 5114.1 There are two records named RemoteLogSegmentMetadataRecord with >> > apiKey 0 and 1. >> > >> > Nice catch, that was a typo. Fixed in the wiki. >> > >> > 5114.2 RemoteLogSegmentMetadataRecord: Could we document whether >> endOffset >> > is inclusive/exclusive? >> > It is inclusive, will update. >> > >> > 5114.3 RemoteLogSegmentMetadataRecord: Could you explain LeaderEpoch a >> bit >> > more? Is that the epoch of the leader when it copies the segment to >> remote >> > storage? Also, how will this field be used? >> > >> > Right, this is the leader epoch of the broker which copied this segment. >> > This is helpful in reason about which broker copied the segment to >> remote >> > storage. >> > >> > 5114.4 EventTimestamp: Could you explain this a bit more? Each record in >> > Kafka already has a timestamp field. Could we just use that? >> > >> > This is the timestamp at which the respective event occurred. Added this >> > to RemoteLogSegmentMetadata as RLMM can be any other implementation. We >> > thought about that but it looked cleaner to use at the message structure >> > level instead of getting that from the consumer record and using that to >> > build the respective event. >> > >> > 5114.5 SegmentSizeInBytes: Could this just be int32? >> > >> > Right, it looks like config allows only int value >= 14. >> > >> > 5115. RemoteLogCleaner(RLC): This could be confused with the log cleaner >> > for compaction. Perhaps it can be renamed to sth like >> > RemotePartitionRemover. >> > >> > I am fine with RemotePartitionRemover or RemoteLogDeletionManager(we >> have >> > other manager classes like RLM, RLMM). >> > >> > 5116. "RLC receives the delete_partition_marked and processes it if it >> is >> > not yet processed earlier." How does it know whether >> > delete_partition_marked has been processed earlier? >> > >> > This is to handle duplicate delete_partition_marked events. RLC >> internally >> > maintains a state for the delete_partition events and if it already has >> an >> > existing event then it ignores if it is already being processed. >> > >> > 5117. Should we add a new MessageFormatter to read the tier metadata >> > topic? >> > >> > Right, this is in plan but did not mention it in the KIP. This will be >> > useful for debugging purposes too. >> > >> > 5118. "Maximum remote log reader thread pool task queue size. If the >> task >> > queue is full, broker will stop reading remote log segments." What do we >> > return to the fetch request in this case? >> > >> > We return an error response for that partition. >> > >> > 5119. It would be useful to list all things not supported in the first >> > version in a Future work or Limitations section. For example, compacted >> > topic, JBOD, changing remote.log.storage.enable from true to false, etc. >> > >> > We already have a non-goals section which is filled with some of these >> > details. Do we need another limitations section? >> > >> > Thanks, >> > Satish. >> > >> > On Wed, Nov 4, 2020 at 11:27 PM Jun Rao <j...@confluent.io> wrote: >> > >> > Hi, Satish, >> > >> > Thanks for the updated KIP. A few more comments below. >> > >> > 605.2 "Build the local leader epoch cache by cutting the leader epoch >> > sequence received from remote storage to [LSO, ELO]." I mentioned an >> > >> > issue >> > >> > earlier. Suppose the leader's local start offset is 100. The follower >> > >> > finds >> > >> > a remote segment covering offset range [80, 120). The producerState with >> > this remote segment is up to offset 120. To trim the producerState to >> > offset 100 requires more work since one needs to download the previous >> > producerState up to offset 80 and then replay the messages from 80 to >> > >> > 100. >> > >> > It seems that it's simpler in this case for the follower just to take >> the >> > remote segment as it is and start fetching from offset 120. >> > >> > 5016. Just to echo what Kowshik was saying. It seems that >> > RLMM.onPartitionLeadershipChanges() is only called on the replicas for a >> > partition, not on the replicas for the __remote_log_segment_metadata >> > partition. It's not clear how the leader of >> __remote_log_segment_metadata >> > obtains the metadata for remote segments for deletion. >> > >> > 5100. KIP-516 <https://issues.apache.org/jira/browse/KIP-516> has been >> accepted and is being implemented now. Could you >> > update the KIP based on topicID? >> > >> > 5101. RLMM: It would be useful to clarify how the following two APIs are >> > used. According to the wiki, the former is used for topic deletion and >> > >> > the >> > >> > latter is used for retention. It seems that retention should use the >> > >> > former >> > >> > since remote segments without a matching epoch in the leader >> (potentially >> > due to unclean leader election) also need to be garbage collected. The >> > latter seems to be used for the new leader to determine the last tiered >> > segment. >> > default Iterator<RemoteLogSegmentMetadata> >> > listRemoteLogSegments(TopicPartition topicPartition) >> > Iterator<RemoteLogSegmentMetadata> >> > >> > listRemoteLogSegments(TopicPartition >> > >> > topicPartition, long leaderEpoch); >> > >> > 5102. RSM: >> > 5102.1 For methods like fetchLogSegmentData(), it seems that they can >> use >> > RemoteLogSegmentId instead of RemoteLogSegmentMetadata. 5102.2 In >> > fetchLogSegmentData(), should we use long instead of Long? 5102.3 Why >> only >> > some of the methods have default implementation and >> > >> > others >> > >> > don't? >> > 5102.4. Could we define RemoteLogSegmentMetadataUpdate and >> > DeletePartitionUpdate? >> > 5102.5 LogSegmentData: It seems that it's easier to pass in >> > leaderEpochIndex as a ByteBuffer or byte array than a file since it >> > >> > will >> > >> > be generated in memory. >> > 5102.6 RemoteLogSegmentMetadata: It seems that it needs both baseOffset >> > >> > and >> > >> > startOffset. For example, deleteRecords() could move the startOffset to >> > >> > the >> > >> > middle of a segment. If we copy the full segment to remote storage, the >> > baseOffset and the startOffset will be different. >> > 5102.7 Could we define all the public methods for >> > >> > RemoteLogSegmentMetadata >> > >> > and LogSegmentData? >> > 5102.8 Could we document whether endOffset in RemoteLogSegmentMetadata >> is >> > inclusive/exclusive? >> > >> > 5103. configs: >> > 5103.1 Could we define the default value of non-required configs (e.g >> the >> > size of new thread pools)? >> > 5103.2 It seems that local.log.retention.ms should default to >> > >> > retention.ms, >> > >> > instead of remote.log.retention.minutes. Similarly, it seems that >> > local.log.retention.bytes should default to segment.bytes. 5103.3 >> > remote.log.manager.thread.pool.size: The description says "used in >> > scheduling tasks to copy segments, fetch remote log indexes and clean up >> > remote log segments". However, there is a separate config >> > remote.log.reader.threads for fetching remote data. It's weird to fetch >> > remote index and log in different thread pools since both are used for >> > serving fetch requests. >> > 5103.4 remote.log.manager.task.interval.ms: Is that the amount of time >> > >> > to >> > >> > back off when there is no work to do? If so, perhaps it can be renamed >> as >> > backoff.ms. >> > 5103.5 Are rlm_process_interval_ms and rlm_retry_interval_ms configs? If >> > so, they need to be listed in this section. >> > >> > 5104. "RLM maintains a bounded cache(possibly LRU) of the index files of >> > remote log segments to avoid multiple index fetches from the remote >> > storage." Is the RLM in memory or on disk? If on disk, where is it >> > >> > stored? >> > >> > Do we need a configuration to bound the size? >> > >> > 5105. The KIP uses local-log-start-offset and Earliest Local Offset in >> > different places. It would be useful to standardize the terminology. >> > >> > 5106. The section on "In BuildingRemoteLogAux state". It listed two >> > >> > options >> > >> > without saying which option is chosen. >> > >> > 5107. Follower to leader transition: It has step 2, but not step 1. >> > >> > 5108. If a consumer fetches from the remote data and the remote storage >> > >> > is >> > >> > not available, what error code is used in the fetch response? >> > >> > 5109. "ListOffsets: For timestamps >= 0, it returns the first message >> > offset whose timestamp is >= to the given timestamp in the request. That >> > means it checks in remote log time indexes first, after which local log >> > time indexes are checked." Could you document which method in RLMM is >> > >> > used >> > >> > for this? >> > >> > 5110. Stopreplica: "it sets all the remote log segment metadata of that >> > partition with a delete marker and publishes them to RLMM." This seems >> > outdated given the new topic deletion logic. >> > >> > 5111. "RLM follower fetches the earliest offset for the earliest leader >> > epoch by calling RLMM.earliestLogOffset(TopicPartition topicPartition, >> > >> > int >> > >> > leaderEpoch) and updates that as the log start offset." Do we need that >> > since replication propagates logStartOffset already? >> > >> > 5112. Is the default maxWaitMs of 500ms enough for fetching from remote >> > storage? >> > >> > 5113. "Committed offsets can be stored in a local file to avoid reading >> > >> > the >> > >> > messages again when a broker is restarted." Could you describe the >> format >> > and the location of the file? Also, could the same message be processed >> > >> > by >> > >> > RLMM again after broker restart? If so, how do we handle that? >> > >> > 5114. Message format >> > 5114.1 There are two records named RemoteLogSegmentMetadataRecord with >> > apiKey 0 and 1. >> > 5114.2 RemoteLogSegmentMetadataRecord: Could we document whether >> > >> > endOffset >> > >> > is inclusive/exclusive? >> > 5114.3 RemoteLogSegmentMetadataRecord: Could you explain LeaderEpoch a >> > >> > bit >> > >> > more? Is that the epoch of the leader when it copies the segment to >> > >> > remote >> > >> > storage? Also, how will this field be used? >> > 5114.4 EventTimestamp: Could you explain this a bit more? Each record in >> > Kafka already has a timestamp field. Could we just use that? 5114.5 >> > SegmentSizeInBytes: Could this just be int32? >> > >> > 5115. RemoteLogCleaner(RLC): This could be confused with the log cleaner >> > for compaction. Perhaps it can be renamed to sth like >> > RemotePartitionRemover. >> > >> > 5116. "RLC receives the delete_partition_marked and processes it if it >> is >> > not yet processed earlier." How does it know whether >> > delete_partition_marked has been processed earlier? >> > >> > 5117. Should we add a new MessageFormatter to read the tier metadata >> > >> > topic? >> > >> > 5118. "Maximum remote log reader thread pool task queue size. If the >> task >> > queue is full, broker will stop reading remote log segments." What do we >> > return to the fetch request in this case? >> > >> > 5119. It would be useful to list all things not supported in the first >> > version in a Future work or Limitations section. For example, compacted >> > topic, JBOD, changing remote.log.storage.enable from true to false, etc. >> > >> > Thanks, >> > >> > Jun >> > >> > On Tue, Oct 27, 2020 at 5:57 PM Kowshik Prakasam < >> kpraka...@confluent.io >> > >> > wrote: >> > >> > Hi Satish, >> > >> > Thanks for the updates to the KIP. Here are my first batch of >> > comments/suggestions on the latest version of the KIP. >> > >> > 5012. In the RemoteStorageManager interface, there is an API defined >> > >> > for >> > >> > each file type. For example, fetchOffsetIndex, fetchTimestampIndex >> > >> > etc. To >> > >> > avoid the duplication, I'd suggest we can instead have a FileType enum >> > >> > and >> > >> > a common get API based on the FileType. >> > >> > 5013. There are some references to the Google doc in the KIP. I wasn't >> > >> > sure >> > >> > if the Google doc is expected to be in sync with the contents of the >> > >> > wiki. >> > >> > Going forward, it seems easier if just the KIP is maintained as the >> > >> > source >> > >> > of truth. In this regard, could you please move all the references to >> > >> > the >> > >> > Google doc, maybe to a separate References section at the bottom of the >> > KIP? >> > >> > 5014. There are some TODO sections in the KIP. Would these be filled >> > >> > up in >> > >> > future iterations? >> > >> > 5015. Under "Topic deletion lifecycle", I'm trying to understand why >> > >> > do we >> > >> > need delete_partition_marked as well as the delete_partition_started >> > messages. I couldn't spot a drawback if supposing we simplified the >> > >> > design >> > >> > such that the controller would only write delete_partition_started >> > >> > message, >> > >> > and RemoteLogCleaner (RLC) instance picks it up for processing. What >> > >> > am I >> > >> > missing? >> > >> > 5016. Under "Topic deletion lifecycle", step (4) is mentioned as "RLC >> > >> > gets >> > >> > all the remote log segments for the partition and each of these remote >> > >> > log >> > >> > segments is deleted with the next steps.". Since the RLC instance runs >> > >> > on >> > >> > each tier topic partition leader, how does the RLC then get the list of >> > remote log segments to be deleted? It will be useful to add that >> > >> > detail to >> > >> > the KIP. >> > >> > 5017. Under "Public Interfaces -> Configs", there is a line mentioning >> > >> > "We >> > >> > will support flipping remote.log.storage.enable in next versions." It >> > >> > will >> > >> > be useful to mention this in the "Future Work" section of the KIP too. >> > >> > 5018. The KIP introduces a number of configuration parameters. It will >> > >> > be >> > >> > useful to mention in the KIP if the user should assume these as static >> > configuration in the server.properties file, or dynamic configuration >> > >> > which >> > >> > can be modified without restarting the broker. >> > >> > 5019. Maybe this is planned as a future update to the KIP, but I >> > >> > thought >> > >> > I'd mention it here. Could you please add details to the KIP on why >> > >> > RocksDB >> > >> > was chosen as the default cache implementation of RLMM, and how it is >> > >> > going >> > >> > to be used? Were alternatives compared/considered? For example, it >> > >> > would be >> > >> > useful to explain/evaluate the following: 1) debuggability of the >> > >> > RocksDB >> > >> > JNI interface, 2) performance, 3) portability across platforms and 4) >> > interface parity of RocksDB’s JNI api with it's underlying C/C++ api. >> > >> > 5020. Following up on (5019), for the RocksDB cache, it will be useful >> > >> > to >> > >> > explain the relationship/mapping between the following in the KIP: 1) >> > >> > # of >> > >> > tiered partitions, 2) # of partitions of metadata topic >> > __remote_log_metadata and 3) # of RocksDB instances. i.e. is the plan >> > >> > to >> > >> > have a RocksDB instance per tiered partition, or per metadata topic >> > partition, or just 1 for per broker? >> > >> > 5021. I was looking at the implementation prototype (PR link: https:// >> > github.com/apache/kafka/pull/7561). It seems that a boolean attribute >> is >> > being introduced into the Log layer to check if remote log capability is >> > enabled. While the boolean footprint is small at the >> > >> > moment, >> > >> > this can easily grow in the future and become harder to test/maintain, >> > considering that the Log layer is already pretty >> > >> > complex. We >> > >> > should start thinking about how to manage such changes to the Log layer >> > (for the purpose of improved testability, better separation of >> > >> > concerns and >> > >> > readability). One proposal I have is to take a step back and define a >> > higher level Log interface. Then, the Broker code can be changed to use >> > this interface. It can be changed such that only a handle to the >> > >> > interface >> > >> > is exposed to other components (such as LogCleaner, ReplicaManager >> > >> > etc.) >> > >> > and not the underlying Log object. This approach keeps the user of the >> > >> > Log >> > >> > layer agnostic of the whereabouts of the data. Underneath the >> > >> > interface, >> > >> > the implementing classes can completely separate local log capabilities >> > from the remote log. For example, the Log class can be simplified to >> > >> > only >> > >> > manage logic surrounding local log segments and metadata. >> > >> > Additionally, a >> > >> > wrapper class can be provided (implementing the higher level Log >> > >> > interface) >> > >> > which will contain any/all logic surrounding tiered data. The wrapper >> > class will wrap around an instance of the Log class delegating the >> > >> > local >> > >> > log logic to it. Finally, a handle to the wrapper class can be exposed >> > >> > to >> > >> > the other components wherever they need a handle to the higher level >> > >> > Log >> > >> > interface. >> > >> > Cheers, >> > Kowshik >> > >> > On Mon, Oct 26, 2020 at 9:52 PM Satish Duggana < >> > >> > satish.dugg...@gmail.com> >> > >> > wrote: >> > >> > Hi, >> > KIP is updated with 1) topic deletion lifecycle and its related items >> > 2) Protocol changes(mainly related to ListOffsets) and other minor >> > changes. >> > Please go through them and let us know your comments. >> > >> > Thanks, >> > Satish. >> > >> > On Mon, Sep 28, 2020 at 9:10 PM Satish Duggana < >> > >> > satish.dugg...@gmail.com >> > >> > wrote: >> > >> > Hi Dhruvil, >> > Thanks for looking into the KIP and sending your comments. Sorry >> > >> > for >> > >> > the late reply, missed it in the mail thread. >> > >> > 1. Could you describe how retention would work with this KIP and >> > >> > which >> > >> > threads are responsible for driving this work? I believe there are >> > >> > 3 >> > >> > kinds >> > >> > of retention processes we are looking at: >> > (a) Regular retention for data in tiered storage as per >> > >> > configured ` >> > >> > retention.ms` / `retention.bytes`. >> > (b) Local retention for data in local storage as per configured ` local. >> > log.retention.ms` / `local.log.retention.bytes` >> > (c) Possibly regular retention for data in local storage, if the >> > >> > tiering >> > >> > task is lagging or for data that is below the log start offset. >> > >> > Local log retention is done by the existing log cleanup tasks. >> > >> > These >> > >> > are not done for segments that are not yet copied to remote >> > >> > storage. >> > >> > Remote log cleanup is done by the leader partition’s RLMTask. >> > >> > 2. When does a segment become eligible to be tiered? Is it as soon >> > >> > as >> > >> > the >> > >> > segment is rolled and the end offset is less than the last stable >> > >> > offset >> > >> > as >> > >> > mentioned in the KIP? I wonder if we need to consider other >> > >> > parameters >> > >> > too, >> > >> > like the highwatermark so that we are guaranteed that what we are >> > >> > tiering >> > >> > has been committed to the log and accepted by the ISR. >> > >> > AFAIK, last stable offset is always <= highwatermark. This will >> > >> > make >> > >> > sure we are always tiering the message segments which have been accepted >> > by ISR and transactionally completed. >> > >> > 3. The section on "Follower Fetch Scenarios" is useful but is a bit >> > difficult to parse at the moment. It would be useful to summarize >> > >> > the >> > >> > changes we need in the ReplicaFetcher. >> > >> > It may become difficult for users to read/follow if we add code >> > >> > changes >> > >> > here. >> > >> > 4. Related to the above, it's a bit unclear how we are planning on >> > restoring the producer state for a new replica. Could you expand on >> > >> > that? >> > >> > It is mentioned in the KIP BuildingRemoteLogAuxState is introduced >> > >> > to >> > >> > build the state like leader epoch sequence and producer snapshots before >> > it starts fetching the data from the leader. We will make it clear in >> the >> > KIP. >> > >> > 5. Similarly, it would be worth summarizing the behavior on unclean >> > >> > leader >> > >> > election. There are several scenarios to consider here: data loss >> > >> > from >> > >> > local log, data loss from remote log, data loss from metadata >> > >> > topic, >> > >> > etc. >> > >> > It's worth describing these in detail. >> > >> > We mentioned the cases about unclean leader election in the >> > >> > follower >> > >> > fetch scenarios. >> > If there are errors while fetching data from remote store or >> > >> > metadata >> > >> > store, it will work the same way as it works with local log. It returns >> > the error back to the caller. Please let us know if I am missing your >> point >> > here. >> > >> > 7. For a READ_COMMITTED FetchRequest, how do we retrieve and >> > >> > return the >> > >> > aborted transaction metadata? >> > >> > When a fetch for a remote log is accessed, we will fetch aborted >> > transactions along with the segment if it is not found in the local >> index >> > cache. This includes the case of transaction index not >> > >> > existing >> > >> > in the remote log segment. That means, the cache entry can be >> > >> > empty or >> > >> > have a list of aborted transactions. >> > >> > 8. The `LogSegmentData` class assumes that we have a log segment, >> > >> > offset >> > >> > index, time index, transaction index, producer snapshot and leader >> > >> > epoch >> > >> > index. How do we deal with cases where we do not have one or more >> > >> > of >> > >> > these? >> > >> > For example, we may not have a transaction index or producer >> > >> > snapshot >> > >> > for a >> > >> > particular segment. The former is optional, and the latter is only >> > >> > kept >> > >> > for >> > >> > up to the 3 latest segments. >> > >> > This is a good point, we discussed this in the last meeting. Transaction >> > index is optional and we will copy them only if it >> > >> > exists. >> > >> > We want to keep all the producer snapshots at each log segment >> > >> > rolling >> > >> > and they can be removed if the log copying is successful and it >> > >> > still >> > >> > maintains the existing latest 3 segments, We only delete the >> > >> > producer >> > >> > snapshots which have been copied to remote log segments on leader. >> > Follower will keep the log segments beyond the segments which have >> > >> > not >> > >> > been copied to remote storage. We will update the KIP with these >> details. >> > >> > Thanks, >> > Satish. >> > >> > On Thu, Sep 17, 2020 at 1:47 AM Dhruvil Shah <dhru...@confluent.io >> > >> > wrote: >> > >> > Hi Satish, Harsha, >> > >> > Thanks for the KIP. Few questions below: >> > >> > 1. Could you describe how retention would work with this KIP and >> > >> > which >> > >> > threads are responsible for driving this work? I believe there >> > >> > are 3 >> > >> > kinds >> > >> > of retention processes we are looking at: >> > (a) Regular retention for data in tiered storage as per >> > >> > configured >> > >> > ` >> > >> > retention.ms` / `retention.bytes`. >> > (b) Local retention for data in local storage as per >> > >> > configured ` >> > >> > local.log.retention.ms` / `local.log.retention.bytes` >> > (c) Possibly regular retention for data in local storage, if >> > >> > the >> > >> > tiering >> > >> > task is lagging or for data that is below the log start offset. >> > >> > 2. When does a segment become eligible to be tiered? Is it as >> > >> > soon as >> > >> > the >> > >> > segment is rolled and the end offset is less than the last stable >> > >> > offset as >> > >> > mentioned in the KIP? I wonder if we need to consider other >> > >> > parameters >> > >> > too, >> > >> > like the highwatermark so that we are guaranteed that what we are >> > >> > tiering >> > >> > has been committed to the log and accepted by the ISR. >> > >> > 3. The section on "Follower Fetch Scenarios" is useful but is a >> > >> > bit >> > >> > difficult to parse at the moment. It would be useful to >> > >> > summarize the >> > >> > changes we need in the ReplicaFetcher. >> > >> > 4. Related to the above, it's a bit unclear how we are planning >> > >> > on >> > >> > restoring the producer state for a new replica. Could you expand >> > >> > on >> > >> > that? >> > >> > 5. Similarly, it would be worth summarizing the behavior on >> > >> > unclean >> > >> > leader >> > >> > election. There are several scenarios to consider here: data loss >> > >> > from >> > >> > local log, data loss from remote log, data loss from metadata >> > >> > topic, >> > >> > etc. >> > >> > It's worth describing these in detail. >> > >> > 6. It would be useful to add details about how we plan on using >> > >> > RocksDB in >> > >> > the default implementation of `RemoteLogMetadataManager`. >> > >> > 7. For a READ_COMMITTED FetchRequest, how do we retrieve and >> > >> > return >> > >> > the >> > >> > aborted transaction metadata? >> > >> > 8. The `LogSegmentData` class assumes that we have a log segment, >> > >> > offset >> > >> > index, time index, transaction index, producer snapshot and >> > >> > leader >> > >> > epoch >> > >> > index. How do we deal with cases where we do not have one or >> > >> > more of >> > >> > these? >> > >> > For example, we may not have a transaction index or producer >> > >> > snapshot >> > >> > for a >> > >> > particular segment. The former is optional, and the latter is >> > >> > only >> > >> > kept for >> > >> > up to the 3 latest segments. >> > >> > Thanks, >> > Dhruvil >> > >> > On Mon, Sep 7, 2020 at 6:54 PM Harsha Ch <harsha...@gmail.com> >> > >> > wrote: >> > >> > Hi All, >> > >> > We are all working through the last meeting feedback. I'll >> > >> > cancel >> > >> > the >> > >> > tomorrow 's meeting and we can meanwhile continue our >> > >> > discussion in >> > >> > mailing >> > >> > list. We can start the regular meeting from next week onwards. >> > >> > Thanks, >> > >> > Harsha >> > >> > On Fri, Sep 04, 2020 at 8:41 AM, Satish Duggana < >> > >> > satish.dugg...@gmail.com >> > >> > wrote: >> > >> > Hi Jun, >> > Thanks for your thorough review and comments. Please find the >> > >> > inline >> > >> > replies below. >> > >> > 600. The topic deletion logic needs more details. >> > 600.1 The KIP mentions "The controller considers the topic >> > >> > partition is >> > >> > deleted only when it determines that there are no log >> > >> > segments >> > >> > for >> > >> > that >> > >> > topic partition by using RLMM". How is this done? >> > >> > It uses RLMM#listSegments() returns all the segments for the >> > >> > given >> > >> > topic >> > >> > partition. >> > >> > 600.2 "If the delete option is enabled then the leader will >> > >> > stop >> > >> > RLM task >> > >> > and stop processing and it sets all the remote log segment >> > >> > metadata of >> > >> > that partition with a delete marker and publishes them to >> > >> > RLMM." >> > >> > We >> > >> > discussed this earlier. When a topic is being deleted, there >> > >> > may >> > >> > not be a >> > >> > leader for the deleted partition. >> > >> > This is a good point. As suggested in the meeting, we will >> > >> > add a >> > >> > separate >> > >> > section for topic/partition deletion lifecycle and this >> > >> > scenario >> > >> > will be >> > >> > addressed. >> > >> > 601. Unclean leader election >> > 601.1 Scenario 1: new empty follower >> > After step 1, the follower restores up to offset 3. So why >> > >> > does >> > >> > it >> > >> > have >> > >> > LE-2 <https://issues.apache.org/jira/browse/LE-2> < >> https://issues.apache.org/jira/browse/LE-2> at offset >> > >> > 5? >> > >> > Nice catch. It was showing the leader epoch fetched from the >> > >> > remote >> > >> > storage. It should be shown with the truncated till offset 3. >> > >> > Updated the >> > >> > KIP. >> > >> > 601.2 senario 5: After Step 3, leader A has inconsistent data >> > >> > between its >> > >> > local and the tiered data. For example. offset 3 has msg 3 >> > >> > LE-0 <https://issues.apache.org/jira/browse/LE-0> >> > >> > <https://issues.apache.org/jira/browse/LE-0> locally, >> > >> > but msg 5 LE-1 <https://issues.apache.org/jira/browse/LE-1> < >> https://issues.apache.org/jira/browse/LE-1> >> > >> > in >> > >> > the remote store. While it's ok for the unclean leader >> > >> > to lose data, it should still return consistent data, whether >> > >> > it's >> > >> > from >> > >> > the local or the remote store. >> > >> > There is no inconsistency here as LE-0 >> <https://issues.apache.org/jira/browse/LE-0> >> > >> > <https://issues.apache.org/jira/browse/LE-0> offsets are [0, 4] and >> > >> > LE-2 <https://issues.apache.org/jira/browse/LE-2> >> > >> > <https://issues.apache.org/jira/browse/LE-2>: >> > >> > [5, ]. It will always get the right records for the given >> > >> > offset >> > >> > and >> > >> > leader epoch. In case of remote, RSM is invoked to get the >> > >> > remote >> > >> > log >> > >> > segment that contains the given offset with the leader epoch. >> > >> > 601.4 It seems that retention is based on >> > listRemoteLogSegments(TopicPartition topicPartition, long >> > >> > leaderEpoch). >> > >> > When there is an unclean leader election, it's possible for >> > >> > the >> > >> > new >> > >> > leader >> > >> > to not to include certain epochs in its epoch cache. How are >> > >> > remote >> > >> > segments associated with those epochs being cleaned? >> > >> > That is a good point. This leader will also cleanup the >> > >> > epochs >> > >> > earlier to >> > >> > its start leader epoch and delete those segments. It gets the >> > >> > earliest >> > >> > epoch for a partition and starts deleting segments from that >> > >> > leader >> > >> > epoch. >> > >> > We need one more API in RLMM to get the earliest leader >> > >> > epoch. >> > >> > 601.5 The KIP discusses the handling of unclean leader >> > >> > elections >> > >> > for user >> > >> > topics. What about unclean leader elections on >> > __remote_log_segment_metadata? >> > This is the same as other system topics like >> > >> > consumer_offsets, >> > >> > __transaction_state topics. As discussed in the meeting, we >> > >> > will >> > >> > add the >> > >> > behavior of __remote_log_segment_metadata topic’s unclean >> > >> > leader >> > >> > truncation. >> > >> > 602. It would be useful to clarify the limitations in the >> > >> > initial >> > >> > release. >> > >> > The KIP mentions not supporting compacted topics. What about >> > >> > JBOD >> > >> > and >> > >> > changing the configuration of a topic from delete to compact >> > >> > after >> > >> > remote. >> > >> > log. storage. enable ( http://remote.log.storage.enable/ ) >> > >> > is >> > >> > enabled? >> > >> > This was updated in the KIP earlier. >> > >> > 603. RLM leader tasks: >> > 603.1"It checks for rolled over LogSegments (which have the >> > >> > last >> > >> > message >> > >> > offset less than last stable offset of that topic partition) >> > >> > and >> > >> > copies >> > >> > them along with their offset/time/transaction indexes and >> > >> > leader >> > >> > epoch >> > >> > cache to the remote tier." It needs to copy the producer >> > >> > snapshot >> > >> > too. >> > >> > Right. It copies producer snapshots too as mentioned in >> > >> > LogSegmentData. >> > >> > 603.2 "Local logs are not cleaned up till those segments are >> > >> > copied >> > >> > successfully to remote even though their retention time/size >> > >> > is >> > >> > reached" >> > >> > This seems weird. If the tiering stops because the remote >> > >> > store >> > >> > is >> > >> > not >> > >> > available, we don't want the local data to grow forever. >> > >> > It was clarified in the discussion that the comment was more >> > >> > about >> > >> > the >> > >> > local storage goes beyond the log.retention. The above >> > >> > statement >> > >> > is about >> > >> > local.log.retention but not for the complete log.retention. >> > >> > When >> > >> > it >> > >> > reaches the log.retention then it will delete the local logs >> > >> > even >> > >> > though >> > >> > those are not copied to remote storage. >> > >> > 604. "RLM maintains a bounded cache(possibly LRU) of the >> > >> > index >> > >> > files of >> > >> > remote log segments to avoid multiple index fetches from the >> > >> > remote >> > >> > storage. These indexes can be used in the same way as local >> > >> > segment >> > >> > indexes are used." Could you provide more details on this? >> > >> > Are >> > >> > the >> > >> > indexes >> > >> > cached in memory or on disk? If on disk, where are they >> > >> > stored? >> > >> > Are the >> > >> > cached indexes bound by a certain size? >> > >> > These are cached on disk and stored in log.dir with a name >> > “__remote_log_index_cache”. They are bound by the total size. >> > >> > This >> > >> > will >> > >> > be >> > >> > exposed as a user configuration, >> > >> > 605. BuildingRemoteLogAux >> > 605.1 In this section, two options are listed. Which one is >> > >> > chosen? >> > >> > Option-2, updated the KIP. >> > >> > 605.2 In option 2, it says "Build the local leader epoch >> > >> > cache by >> > >> > cutting >> > >> > the leader epoch sequence received from remote storage to >> > >> > [LSO, >> > >> > ELO]. >> > >> > (LSO >> > >> > = log start offset)." We need to do the same thing for the >> > >> > producer >> > >> > snapshot. However, it's hard to cut the producer snapshot to >> > >> > an >> > >> > earlier >> > >> > offset. Another option is to simply take the lastOffset from >> > >> > the >> > >> > remote >> > >> > segment and use that as the starting fetch offset in the >> > >> > follower. >> > >> > This >> > >> > avoids the need for cutting. >> > >> > Right, this was mentioned in the “transactional support” >> > >> > section >> > >> > about >> > >> > adding these details. >> > >> > 606. ListOffsets: Since we need a version bump, could you >> > >> > document >> > >> > it >> > >> > under a protocol change section? >> > >> > Sure, we will update the KIP. >> > >> > 607. "LogStartOffset of a topic can point to either of local >> > >> > segment or >> > >> > remote segment but it is initialised and maintained in the >> > >> > Log >> > >> > class like >> > >> > now. This is already maintained in `Log` class while loading >> > >> > the >> > >> > logs and >> > >> > it can also be fetched from RemoteLogMetadataManager." What >> > >> > will >> > >> > happen >> > >> > to >> > >> > the existing logic (e.g. log recovery) that currently >> > >> > depends on >> > >> > logStartOffset but assumes it's local? >> > >> > They use a field called localLogStartOffset which is the >> > >> > local >> > >> > log >> > >> > start >> > >> > offset.. >> > >> > 608. Handle expired remote segment: How does it pick up new >> > >> > logStartOffset >> > >> > from deleteRecords? >> > >> > Good point. This was not addressed in the KIP. Will update >> > >> > the >> > >> > KIP >> > >> > on how >> > >> > the RLM task handles this scenario. >> > >> > 609. RLMM message format: >> > 609.1 It includes both MaxTimestamp and EventTimestamp. Where >> > >> > does >> > >> > it get >> > >> > both since the message in the log only contains one >> > >> > timestamp? >> > >> > `EventTimeStamp` is the timestamp at which that segment >> > >> > metadata >> > >> > event is >> > >> > generated. This is more for audits. >> > >> > 609.2 If we change just the state (e.g. to DELETE_STARTED), >> > >> > it >> > >> > seems it's >> > >> > wasteful to have to include all other fields not changed. >> > >> > This is a good point. We thought about incremental updates. >> > >> > But >> > >> > we >> > >> > want >> > >> > to >> > >> > make sure all the events are in the expected order and take >> > >> > action >> > >> > based >> > >> > on the latest event. Will think through the approaches in >> > >> > detail >> > >> > and >> > >> > update here. >> > >> > 609.3 Could you document which process makes the following >> > >> > transitions >> > >> > DELETE_MARKED, DELETE_STARTED, DELETE_FINISHED? >> > >> > Okay, will document more details. >> > >> > 610. remote.log.reader.max.pending.tasks: "Maximum remote log >> > >> > reader >> > >> > thread pool task queue size. If the task queue is full, >> > >> > broker >> > >> > will stop >> > >> > reading remote log segments." What does the broker do if the >> > >> > queue >> > >> > is >> > >> > full? >> > >> > It returns an error for this topic partition. >> > >> > 611. What do we return if the request offset/epoch doesn't >> > >> > exist >> > >> > in the >> > >> > following API? >> > RemoteLogSegmentMetadata >> > >> > remoteLogSegmentMetadata(TopicPartition >> > >> > topicPartition, long offset, int epochForOffset) >> > >> > This returns null. But we prefer to update the return type as >> > >> > Optional >> > >> > and >> > >> > return Empty if that does not exist. >> > >> > Thanks, >> > Satish. >> > >> > On Tue, Sep 1, 2020 at 9:45 AM Jun Rao < jun@ confluent. io >> > >> > ( >> > >> > j...@confluent.io ) > wrote: >> > >> > Hi, Satish, >> > >> > Thanks for the updated KIP. Made another pass. A few more >> > >> > comments >> > >> > below. >> > >> > 600. The topic deletion logic needs more details. >> > 600.1 The KIP mentions "The controller considers the topic >> > >> > partition is >> > >> > deleted only when it determines that there are no log >> > >> > segments >> > >> > for that >> > >> > topic partition by using RLMM". How is this done? 600.2 "If >> > >> > the >> > >> > delete >> > >> > option is enabled then the leader will stop RLM task and >> > >> > stop >> > >> > processing >> > >> > and it sets all the remote log segment metadata of that >> > >> > partition >> > >> > with a >> > >> > delete marker and publishes them to RLMM." We discussed this >> > >> > earlier. >> > >> > When >> > >> > a topic is being deleted, there may not be a leader for the >> > >> > deleted >> > >> > partition. >> > >> > 601. Unclean leader election >> > 601.1 Scenario 1: new empty follower >> > After step 1, the follower restores up to offset 3. So why >> > >> > does >> > >> > it have >> > >> > LE-2 <https://issues.apache.org/jira/browse/LE-2> < >> https://issues.apache.org/jira/browse/LE-2> at >> > >> > offset 5? >> > >> > 601.2 senario 5: After Step 3, leader A has inconsistent >> > >> > data >> > >> > between >> > >> > its >> > >> > local and the tiered data. For example. offset 3 has msg 3 >> > >> > LE-0 <https://issues.apache.org/jira/browse/LE-0> >> > >> > <https://issues.apache.org/jira/browse/LE-0> locally, >> > >> > but msg 5 LE-1 <https://issues.apache.org/jira/browse/LE-1> < >> https://issues.apache.org/jira/browse/LE-1> >> > >> > in >> > >> > the remote store. While it's ok for the unclean leader >> > >> > to lose data, it should still return consistent data, >> > >> > whether >> > >> > it's from >> > >> > the local or the remote store. >> > 601.3 The follower picks up log start offset using the >> > >> > following >> > >> > api. >> > >> > Suppose that we have 3 remote segments (LE, >> > >> > SegmentStartOffset) >> > >> > as (2, >> > >> > 10), >> > (3, 20) and (7, 15) due to an unclean leader election. >> > >> > Using the >> > >> > following >> > >> > api will cause logStartOffset to go backward from 20 to 15. >> > >> > How >> > >> > do we >> > >> > prevent that? >> > earliestLogOffset(TopicPartition topicPartition, int >> > >> > leaderEpoch) >> > >> > 601.4 >> > >> > It >> > >> > seems that retention is based on >> > listRemoteLogSegments(TopicPartition topicPartition, long >> > >> > leaderEpoch). >> > >> > When there is an unclean leader election, it's possible for >> > >> > the >> > >> > new >> > >> > leader >> > >> > to not to include certain epochs in its epoch cache. How are >> > >> > remote >> > >> > segments associated with those epochs being cleaned? 601.5 >> > >> > The >> > >> > KIP >> > >> > discusses the handling of unclean leader elections for user >> > >> > topics. What >> > >> > about unclean leader elections on >> > __remote_log_segment_metadata? >> > >> > 602. It would be useful to clarify the limitations in the >> > >> > initial >> > >> > release. >> > >> > The KIP mentions not supporting compacted topics. What about >> > >> > JBOD >> > >> > and >> > >> > changing the configuration of a topic from delete to compact >> > >> > after >> > >> > remote. >> > >> > log. storage. enable ( http://remote.log.storage.enable/ ) >> > >> > is >> > >> > enabled? >> > >> > 603. RLM leader tasks: >> > 603.1"It checks for rolled over LogSegments (which have the >> > >> > last >> > >> > message >> > >> > offset less than last stable offset of that topic >> > >> > partition) and >> > >> > copies >> > >> > them along with their offset/time/transaction indexes and >> > >> > leader >> > >> > epoch >> > >> > cache to the remote tier." It needs to copy the producer >> > >> > snapshot >> > >> > too. >> > >> > 603.2 "Local logs are not cleaned up till those segments are >> > >> > copied >> > >> > successfully to remote even though their retention >> > >> > time/size is >> > >> > reached" >> > >> > This seems weird. If the tiering stops because the remote >> > >> > store >> > >> > is not >> > >> > available, we don't want the local data to grow forever. >> > >> > 604. "RLM maintains a bounded cache(possibly LRU) of the >> > >> > index >> > >> > files of >> > >> > remote log segments to avoid multiple index fetches from the >> > >> > remote >> > >> > storage. These indexes can be used in the same way as local >> > >> > segment >> > >> > indexes are used." Could you provide more details on this? >> > >> > Are >> > >> > the >> > >> > indexes >> > >> > cached in memory or on disk? If on disk, where are they >> > >> > stored? >> > >> > Are the >> > >> > cached indexes bound by a certain size? >> > >> > 605. BuildingRemoteLogAux >> > 605.1 In this section, two options are listed. Which one is >> > >> > chosen? >> > >> > 605.2 >> > >> > In option 2, it says "Build the local leader epoch cache by >> > >> > cutting the >> > >> > leader epoch sequence received from remote storage to [LSO, >> > >> > ELO]. >> > >> > (LSO >> > >> > = log start offset)." We need to do the same thing for the >> > >> > producer >> > >> > snapshot. However, it's hard to cut the producer snapshot >> > >> > to an >> > >> > earlier >> > >> > offset. Another option is to simply take the lastOffset >> > >> > from the >> > >> > remote >> > >> > segment and use that as the starting fetch offset in the >> > >> > follower. This >> > >> > avoids the need for cutting. >> > >> > 606. ListOffsets: Since we need a version bump, could you >> > >> > document it >> > >> > under a protocol change section? >> > >> > 607. "LogStartOffset of a topic can point to either of local >> > >> > segment or >> > >> > remote segment but it is initialised and maintained in the >> > >> > Log >> > >> > class >> > >> > like >> > >> > now. This is already maintained in `Log` class while >> > >> > loading the >> > >> > logs >> > >> > and >> > >> > it can also be fetched from RemoteLogMetadataManager." What >> > >> > will >> > >> > happen >> > >> > to >> > >> > the existing logic (e.g. log recovery) that currently >> > >> > depends on >> > >> > logStartOffset but assumes it's local? >> > >> > 608. Handle expired remote segment: How does it pick up new >> > >> > logStartOffset >> > >> > from deleteRecords? >> > >> > 609. RLMM message format: >> > 609.1 It includes both MaxTimestamp and EventTimestamp. >> > >> > Where >> > >> > does it >> > >> > get >> > >> > both since the message in the log only contains one >> > >> > timestamp? >> > >> > 609.2 If >> > >> > we >> > >> > change just the state (e.g. to DELETE_STARTED), it seems >> > >> > it's >> > >> > wasteful >> > >> > to >> > >> > have to include all other fields not changed. 609.3 Could >> > >> > you >> > >> > document >> > >> > which process makes the following transitions DELETE_MARKED, >> > DELETE_STARTED, DELETE_FINISHED? >> > >> > 610. remote.log.reader.max.pending.tasks: "Maximum remote >> > >> > log >> > >> > reader >> > >> > thread pool task queue size. If the task queue is full, >> > >> > broker >> > >> > will stop >> > >> > reading remote log segments." What does the broker do if the >> > >> > queue is >> > >> > full? >> > >> > 611. What do we return if the request offset/epoch doesn't >> > >> > exist >> > >> > in the >> > >> > following API? >> > RemoteLogSegmentMetadata >> > >> > remoteLogSegmentMetadata(TopicPartition >> > >> > topicPartition, long offset, int epochForOffset) >> > >> > Jun >> > >> > On Mon, Aug 31, 2020 at 11:19 AM Satish Duggana < satish. >> > >> > duggana@ >> > >> > gmail. com >> > >> > ( satish.dugg...@gmail.com ) > wrote: >> > >> > KIP is updated with >> > - Remote log segment metadata topic message format/schema. >> > - Added remote log segment metadata state transitions and >> > >> > explained how >> > >> > the deletion of segments is handled, including the case of >> > >> > partition >> > >> > deletions. >> > - Added a few more limitations in the "Non goals" section. >> > >> > Thanks, >> > Satish. >> > >> > On Thu, Aug 27, 2020 at 12:42 AM Harsha Ch < harsha. ch@ >> > >> > gmail. >> > >> > com ( >> > >> > harsha...@gmail.com ) > wrote: >> > >> > Updated the KIP with Meeting Notes section >> > >> > https:/ / cwiki. apache. org/ confluence/ display/ KAFKA/ >> > >> > KIP-405 <https://issues.apache.org/jira/browse/KIP-405> < >> https://issues.apache.org/jira/browse/KIP-405> >> > >> > %3A+Kafka+Tiered+Storage#KIP405:KafkaTieredStorage-MeetingNotes >> > >> > ( >> > >> > https://cwiki.apache.org/confluence/display/KAFKA/ >> > KIP-405 <https://issues.apache.org/jira/browse/KIP-405> >> %3A+Kafka+Tiered+Storage#KIP405:KafkaTieredStorage-MeetingNotes >> > >> > ) >> > >> > On Tue, Aug 25, 2020 at 1:03 PM Jun Rao < jun@ >> > >> > confluent. io >> > >> > ( >> > >> > j...@confluent.io ) > wrote: >> > >> > Hi, Harsha, >> > >> > Thanks for the summary. Could you add the summary and the >> > >> > recording >> > >> > link to >> > >> > the last section of >> > >> > https:/ / cwiki. apache. org/ confluence/ display/ KAFKA/ >> > >> > Kafka+Improvement+Proposals >> > >> > ( >> > >> > https://cwiki.apache.org/confluence/display/KAFKA/ >> > Kafka+Improvement+Proposals >> > >> > ) >> > >> > ? >> > >> > Jun >> > >> > On Tue, Aug 25, 2020 at 11:12 AM Harsha Chintalapani < >> > >> > kafka@ >> > >> > harsha. io ( >> > >> > ka...@harsha.io ) > wrote: >> > >> > Thanks everyone for attending the meeting today. >> > Here is the recording >> > >> > https:/ / drive. google. com/ file/ d/ >> > >> > 14PRM7U0OopOOrJR197VlqvRX5SXNtmKj/ view?usp=sharing >> > >> > ( >> > >> > https://drive.google.com/file/d/14PRM7U0OopOOrJR197VlqvRX5SXNtmKj/ >> > view?usp=sharing >> > >> > ) >> > >> > Notes: >> > >> > 1. KIP is updated with follower fetch protocol and >> > >> > ready to >> > >> > reviewed >> > >> > 2. Satish to capture schema of internal metadata topic >> > >> > in >> > >> > the >> > >> > KIP >> > >> > 3. We will update the KIP with details of different >> > >> > cases >> > >> > 4. Test plan will be captured in a doc and will add to >> > >> > the >> > >> > KIP >> > >> > 5. Add a section "Limitations" to capture the >> > >> > capabilities >> > >> > that >> > >> > will >> > >> > be >> > >> > introduced with this KIP and what will not be covered in >> > >> > this >> > >> > KIP. >> > >> > Please add to it I missed anything. Will produce a >> > >> > formal >> > >> > meeting >> > >> > notes >> > >> > from next meeting onwards. >> > >> > Thanks, >> > Harsha >> > >> > On Mon, Aug 24, 2020 at 9:42 PM, Ying Zheng < yingz@ >> > >> > uber. >> > >> > com. >> > >> > invalid ( >> > >> > yi...@uber.com.invalid ) > wrote: >> > >> > We did some basic feature tests at Uber. The test >> > >> > cases and >> > >> > results are >> > >> > shared in this google doc: >> > https:/ / docs. google. com/ spreadsheets/ d/ ( >> > https://docs.google.com/spreadsheets/d/ ) >> > >> > 1XhNJqjzwXvMCcAOhEH0sSXU6RTvyoSf93DHF-YMfGLk/edit?usp=sharing >> > >> > The performance test results were already shared in >> > >> > the KIP >> > >> > last >> > >> > month. >> > >> > On Mon, Aug 24, 2020 at 11:10 AM Harsha Ch < harsha. >> > >> > ch@ >> > >> > gmail. >> > >> > com ( >> > >> > harsha...@gmail.com ) > >> > >> > wrote: >> > >> > "Understand commitments towards driving design & >> > >> > implementation of >> > >> > the >> > >> > KIP >> > >> > further and how it aligns with participant interests in >> > >> > contributing to >> > >> > the >> > >> > efforts (ex: in the context of Uber’s Q3/Q4 roadmap)." >> > >> > What >> > >> > is that >> > >> > about? >> > >> > On Mon, Aug 24, 2020 at 11:05 AM Kowshik Prakasam < >> > >> > kprakasam@ confluent. io ( kpraka...@confluent.io ) > >> > >> > wrote: >> > >> > Hi Harsha, >> > >> > The following google doc contains a proposal for >> > >> > temporary >> > >> > agenda >> > >> > for >> > >> > the >> > >> > KIP-405 <https://issues.apache.org/jira/browse/KIP-405> < >> https://issues.apache.org/jira/browse/KIP-405> >> > >> > < >> > >> > https:/ / issues. apache. org/ jira/ browse/ KIP-405 >> <https://issues.apache.org/jira/browse/KIP-405> >> > <https://issues.apache.org/jira/browse/KIP-405> ( >> > >> > https://issues.apache.org/jira/browse/KIP-405 ) > sync >> > >> > meeting >> > >> > tomorrow: >> > >> > https:/ / docs. google. com/ document/ d/ ( >> > https://docs.google.com/document/d/ ) >> > 1pqo8X5LU8TpwfC_iqSuVPezhfCfhGkbGN2TqiPA3LBU/edit >> > >> > . >> > Please could you add it to the Google calendar invite? >> > >> > Thank you. >> > >> > Cheers, >> > Kowshik >> > >> > On Thu, Aug 20, 2020 at 10:58 AM Harsha Ch < harsha. >> > >> > ch@ >> > >> > gmail. >> > >> > com ( >> > >> > harsha...@gmail.com ) > >> > >> > wrote: >> > >> > Hi All, >> > >> > Scheduled a meeting for Tuesday 9am - 10am. I can >> > >> > record >> > >> > and >> > >> > upload for >> > >> > community to be able to follow the discussion. >> > >> > Jun, please add the required folks on confluent side. >> > >> > Thanks, >> > >> > Harsha >> > >> > On Thu, Aug 20, 2020 at 12:33 AM, Alexandre Dupriez < >> > >> > alexandre.dupriez@ >> > >> > gmail. com ( http://gmail.com/ ) > wrote: >> > >> > Hi Jun, >> > >> > Many thanks for your initiative. >> > >> > If you like, I am happy to attend at the time you >> > >> > suggested. >> > >> > Many thanks, >> > Alexandre >> > >> > Le mer. 19 août 2020 à 22:00, Harsha Ch < harsha. ch@ >> > >> > gmail. com ( >> > >> > harsha. >> > >> > ch@ gmail. com ( c...@gmail.com ) ) > a écrit : >> > >> > Hi Jun, >> > Thanks. This will help a lot. Tuesday will work for us. >> > -Harsha >> > >> > On Wed, Aug 19, 2020 at 1:24 PM Jun Rao < jun@ >> > >> > confluent. >> > >> > io ( >> > >> > jun@ >> > >> > confluent. io ( http://confluent.io/ ) ) > wrote: >> > >> > Hi, Satish, Ying, Harsha, >> > >> > Do you think it would be useful to have a regular >> > >> > virtual >> > >> > meeting >> > >> > to >> > >> > discuss this KIP? The goal of the meeting will be >> > >> > sharing >> > >> > design/development progress and discussing any open >> > >> > issues >> > >> > to >> > >> > accelerate >> > >> > this KIP. If so, will every Tuesday (from next week) >> > >> > 9am-10am >> > >> > PT >> > >> > work for you? I can help set up a Zoom meeting, invite >> > >> > everyone who >> > >> > might >> > >> > be interested, have it recorded and shared, etc. >> > >> > Thanks, >> > >> > Jun >> > >> > On Tue, Aug 18, 2020 at 11:01 AM Satish Duggana < >> > >> > satish. duggana@ gmail. com ( satish. duggana@ gmail. >> > >> > com >> > >> > ( >> > >> > satish.dugg...@gmail.com ) ) > >> > >> > wrote: >> > >> > Hi Kowshik, >> > >> > Thanks for looking into the KIP and sending your >> > >> > comments. >> > >> > 5001. Under the section "Follower fetch protocol in >> > >> > detail", >> > >> > the >> > >> > next-local-offset is the offset upto which the >> > >> > segments are >> > >> > copied >> > >> > >> >