KIP is updated with - Remote log segment metadata topic message format/schema. - Added remote log segment metadata state transitions and explained how the deletion of segments is handled, including the case of partition deletions. - Added a few more limitations in the "Non goals" section.
Thanks, Satish. On Thu, Aug 27, 2020 at 12:42 AM Harsha Ch <harsha...@gmail.com> wrote: > > Updated the KIP with Meeting Notes section > https://cwiki.apache.org/confluence/display/KAFKA/KIP-405%3A+Kafka+Tiered+Storage#KIP405:KafkaTieredStorage-MeetingNotes > > On Tue, Aug 25, 2020 at 1:03 PM Jun Rao <j...@confluent.io> wrote: > > > Hi, Harsha, > > > > Thanks for the summary. Could you add the summary and the recording link to > > the last section of > > > > https://cwiki.apache.org/confluence/display/KAFKA/Kafka+Improvement+Proposals > > ? > > > > Jun > > > > On Tue, Aug 25, 2020 at 11:12 AM Harsha Chintalapani <ka...@harsha.io> > > wrote: > > > > > Thanks everyone for attending the meeting today. > > > Here is the recording > > > > > > > > https://drive.google.com/file/d/14PRM7U0OopOOrJR197VlqvRX5SXNtmKj/view?usp=sharing > > > > > > Notes: > > > > > > 1. KIP is updated with follower fetch protocol and ready to reviewed > > > 2. Satish to capture schema of internal metadata topic in the KIP > > > 3. We will update the KIP with details of different cases > > > 4. Test plan will be captured in a doc and will add to the KIP > > > 5. Add a section "Limitations" to capture the capabilities that will > > be > > > introduced with this KIP and what will not be covered in this KIP. > > > > > > Please add to it I missed anything. Will produce a formal meeting notes > > > from next meeting onwards. > > > > > > Thanks, > > > Harsha > > > > > > > > > > > > On Mon, Aug 24, 2020 at 9:42 PM, Ying Zheng <yi...@uber.com.invalid> > > > wrote: > > > > > > > We did some basic feature tests at Uber. The test cases and results are > > > > shared in this google doc: > > > > https://docs.google.com/spreadsheets/d/ > > > > 1XhNJqjzwXvMCcAOhEH0sSXU6RTvyoSf93DHF-YMfGLk/edit?usp=sharing > > > > > > > > The performance test results were already shared in the KIP last month. > > > > > > > > On Mon, Aug 24, 2020 at 11:10 AM Harsha Ch <harsha...@gmail.com> > > wrote: > > > > > > > > "Understand commitments towards driving design & implementation of the > > > KIP > > > > further and how it aligns with participant interests in contributing to > > > the > > > > efforts (ex: in the context of Uber’s Q3/Q4 roadmap)." What is that > > > about? > > > > > > > > On Mon, Aug 24, 2020 at 11:05 AM Kowshik Prakasam < > > > kpraka...@confluent.io> > > > > wrote: > > > > > > > > Hi Harsha, > > > > > > > > The following google doc contains a proposal for temporary agenda for > > the > > > > KIP-405 <https://issues.apache.org/jira/browse/KIP-405> sync meeting > > > > tomorrow: > > > > > > > > https://docs.google.com/document/d/ > > > > 1pqo8X5LU8TpwfC_iqSuVPezhfCfhGkbGN2TqiPA3LBU/edit > > > > > > > > . > > > > Please could you add it to the Google calendar invite? > > > > > > > > Thank you. > > > > > > > > Cheers, > > > > Kowshik > > > > > > > > On Thu, Aug 20, 2020 at 10:58 AM Harsha Ch <harsha...@gmail.com> > > wrote: > > > > > > > > Hi All, > > > > > > > > Scheduled a meeting for Tuesday 9am - 10am. I can record and upload for > > > > community to be able to follow the discussion. > > > > > > > > Jun, please add the required folks on confluent side. > > > > > > > > Thanks, > > > > > > > > Harsha > > > > > > > > On Thu, Aug 20, 2020 at 12:33 AM, Alexandre Dupriez < > > alexandre.dupriez@ > > > > gmail.com > wrote: > > > > > > > > Hi Jun, > > > > > > > > Many thanks for your initiative. > > > > > > > > If you like, I am happy to attend at the time you suggested. > > > > > > > > Many thanks, > > > > Alexandre > > > > > > > > Le mer. 19 août 2020 à 22:00, Harsha Ch < harsha. ch@ gmail. com ( > > > harsha. > > > > c...@gmail.com ) > a écrit : > > > > > > > > Hi Jun, > > > > Thanks. This will help a lot. Tuesday will work for us. > > > > -Harsha > > > > > > > > On Wed, Aug 19, 2020 at 1:24 PM Jun Rao < jun@ confluent. io ( jun@ > > > > confluent.io ) > wrote: > > > > > > > > Hi, Satish, Ying, Harsha, > > > > > > > > Do you think it would be useful to have a regular virtual meeting to > > > > discuss this KIP? The goal of the meeting will be sharing > > > > design/development progress and discussing any open issues to > > > > > > > > accelerate > > > > > > > > this KIP. If so, will every Tuesday (from next week) 9am-10am > > > > > > > > PT > > > > > > > > work for you? I can help set up a Zoom meeting, invite everyone who > > > > > > > > might > > > > > > > > be interested, have it recorded and shared, etc. > > > > > > > > Thanks, > > > > > > > > Jun > > > > > > > > On Tue, Aug 18, 2020 at 11:01 AM Satish Duggana < > > > > > > > > satish. duggana@ gmail. com ( satish.dugg...@gmail.com ) > > > > > > > > > wrote: > > > > > > > > Hi Kowshik, > > > > > > > > Thanks for looking into the KIP and sending your comments. > > > > > > > > 5001. Under the section "Follower fetch protocol in detail", the > > > > next-local-offset is the offset upto which the segments are copied > > > > > > > > to > > > > > > > > remote storage. Instead, would last-tiered-offset be a better name > > > > > > > > than > > > > > > > > next-local-offset? last-tiered-offset seems to naturally align well > > > > > > > > with > > > > > > > > the definition provided in the KIP. > > > > > > > > Both next-local-offset and local-log-start-offset were introduced > > > > > > > > to > > > > > > > > talk > > > > > > > > about offsets related to local log. We are fine with > > > > > > > > last-tiered-offset > > > > > > > > too as you suggested. > > > > > > > > 5002. After leadership is established for a partition, the leader > > > > > > > > would > > > > > > > > begin uploading a segment to remote storage. If successful, the > > > > > > > > leader > > > > > > > > would write the updated RemoteLogSegmentMetadata to the metadata > > > > > > > > topic > > > > > > > > (via > > > > > > > > RLMM.putRemoteLogSegmentData). However, for defensive reasons, it > > > > > > > > seems > > > > > > > > useful that before the first time the segment is uploaded by the > > > > > > > > leader > > > > > > > > for > > > > > > > > a partition, the leader should ensure to catch up to all the > > > > > > > > metadata > > > > > > > > events written so far in the metadata topic for that partition (ex: > > > > > > > > by > > > > > > > > previous leader). To achieve this, the leader could start a lease > > > > > > > > (using > > > > > > > > an > > > > > > > > establish_leader metadata event) before commencing tiering, and > > > > > > > > wait > > > > > > > > until > > > > > > > > the event is read back. For example, this seems useful to avoid > > > > > > > > cases > > > > > > > > where > > > > > > > > zombie leaders can be active for the same partition. This can also > > > > > > > > prove > > > > > > > > useful to help avoid making decisions on which segments to be > > > > > > > > uploaded > > > > > > > > for > > > > > > > > a partition, until the current leader has caught up to a complete > > > > > > > > view > > > > > > > > of > > > > > > > > all segments uploaded for the partition so far (otherwise this may > > > > > > > > cause > > > > > > > > same segment being uploaded twice -- once by the previous leader > > > > > > > > and > > > > > > > > then > > > > > > > > by the new leader). > > > > > > > > We allow copying segments to remote storage which may have common > > > > > > > > offsets. > > > > > > > > Please go through the KIP to understand the follower fetch > > > > > > > > protocol(1) and > > > > > > > > follower to leader transition(2). > > > > > > > > https:/ / cwiki. apache. org/ confluence/ display/ KAFKA/ KIP-405 > > > > > > > > <https://issues.apache.org/jira/browse/KIP-405> > > > > %3A+Kafka+Tiered+Storage#KIP405:KafkaTieredStorage-FollowerReplication > > > > > > > > ( > > > > > > > > https://cwiki.apache.org/confluence/display/KAFKA/ > > > > > > > > > KIP-405%3A+Kafka+Tiered+Storage#KIP405:KafkaTieredStorage-FollowerReplication > > > > > > > > ) > > > > > > > > https:/ / cwiki. apache. org/ confluence/ display/ KAFKA/ KIP-405 > > > > > > > > <https://issues.apache.org/jira/browse/KIP-405> > > > > > > > > > > > > > %3A+Kafka+Tiered+Storage#KIP405:KafkaTieredStorage-Followertoleadertransition > > > > > > > > > > > > ( > > > > > > > > https://cwiki.apache.org/confluence/display/KAFKA/ > > > > > > > > > KIP-405%3A+Kafka+Tiered+Storage#KIP405:KafkaTieredStorage-Followertoleadertransition > > > > > > > > ) > > > > > > > > 5003. There is a natural interleaving between uploading a segment > > > > > > > > to > > > > > > > > remote > > > > > > > > store, and, writing a metadata event for the same (via > > > > RLMM.putRemoteLogSegmentData). There can be cases where a remote > > > > > > > > segment > > > > > > > > is > > > > > > > > uploaded, then the leader fails and a corresponding metadata event > > > > > > > > never > > > > > > > > gets written. In such cases, the orphaned remote segment has to be > > > > eventually deleted (since there is no confirmation of the upload). > > > > > > > > To > > > > > > > > handle this, we could use 2 separate metadata events viz. > > > > > > > > copy_initiated > > > > > > > > and copy_completed, so that copy_initiated events that don't have a > > > > corresponding copy_completed event can be treated as garbage and > > > > > > > > deleted > > > > > > > > from the remote object store by the broker. > > > > > > > > We are already updating RMM with RemoteLogSegmentMetadata pre and > > > > > > > > post > > > > > > > > copying of log segments. We had a flag in RemoteLogSegmentMetadata > > > > > > > > whether > > > > > > > > it is copied or not. But we are making changes in > > > > > > > > RemoteLogSegmentMetadata > > > > > > > > to introduce a state field in RemoteLogSegmentMetadata which will > > > > > > > > have the > > > > > > > > respective started and finished states. This includes for other > > > > > > > > operations > > > > > > > > like delete too. > > > > > > > > 5004. In the default implementation of RLMM (using the internal > > > > > > > > topic > > > > > > > > __remote_log_metadata), a separate topic called > > > > __remote_segments_to_be_deleted is going to be used just to track > > > > > > > > failures > > > > > > > > in removing remote log segments. A separate topic (effectively > > > > > > > > another > > > > > > > > metadata stream) introduces some maintenance overhead and design > > > > complexity. It seems to me that the same can be achieved just by > > > > > > > > using > > > > > > > > just > > > > > > > > the __remote_log_metadata topic with the following steps: 1) the > > > > > > > > leader > > > > > > > > writes a delete_initiated metadata event, 2) the leader deletes the > > > > > > > > segment > > > > > > > > and 3) the leader writes a delete_completed metadata event. Tiered > > > > > > > > segments > > > > > > > > that have delete_initiated message and not delete_completed > > > > > > > > message, > > > > > > > > can > > > > > > > > be > > > > > > > > considered to be a failure and retried. > > > > > > > > Jun suggested in earlier mail to keep this simple . We decided not > > > > > > > > to have > > > > > > > > this topic as mentioned in our earlier replies, updated the KIP. > > > > > > > > As I > > > > > > > > mentioned in an earlier comment, we are adding state entries for > > > > > > > > delete > > > > > > > > operations too. > > > > > > > > 5005. When a Kafka cluster is provisioned for the first time with > > > > > > > > KIP-405 <https://issues.apache.org/jira/browse/KIP-405> > > > > > > > > tiered storage enabled, could you explain in the KIP about how the > > > > bootstrap for __remote_log_metadata topic will be performed in the > > > > > > > > the > > > > > > > > default RLMM implementation? > > > > > > > > __remote_log_segment_metadata topic is created by default with the > > > > respective topic like partitions/replication-factor etc. Can you be > > > > > > > > more > > > > > > > > specific on what you are looking for? > > > > > > > > 5008. The system-wide configuration ' remote. log. storage. enable > > > > > > > > ( > > > > > > > > http://remote.log.storage.enable/ ) ' is used > > > > > > > > to > > > > > > > > enable tiered storage. Can this be made a topic-level > > > > > > > > configuration, > > > > > > > > so > > > > > > > > that the user can enable/disable tiered storage at a topic level > > > > > > > > rather > > > > > > > > than a system-wide default for an entire Kafka cluster? > > > > > > > > Yes, we mentioned in an earlier mail thread that it will be > > > > > > > > supported at > > > > > > > > topic level too, updated the KIP. > > > > > > > > 5009. Whenever a topic with tiered storage enabled is deleted, the > > > > underlying actions require the topic data to be deleted in local > > > > > > > > store > > > > > > > > as > > > > > > > > well as remote store, and eventually the topic metadata needs to be > > > > > > > > deleted > > > > > > > > too. What is the role of the controller in deleting a topic and > > > > > > > > it's > > > > > > > > contents, while the topic has tiered storage enabled? > > > > > > > > When a topic partition is deleted, there will be an event for that > > > > > > > > in RLMM > > > > > > > > for its deletion and the controller considers that topic is deleted > > > > > > > > only > > > > > > > > when all the remote log segments are also deleted. > > > > > > > > 5010. RLMM APIs are currently synchronous, for example > > > > RLMM.putRemoteLogSegmentData waits until the put operation is > > > > > > > > completed > > > > > > > > in > > > > > > > > the remote metadata store. It may also block until the leader has > > > > > > > > caught > > > > > > > > up > > > > > > > > to the metadata (not sure). Could we make these apis asynchronous > > > > > > > > (ex: > > > > > > > > based on java.util.concurrent.Future) to provide room for tapping > > > > performance improvements such as non-blocking i/o? 5011. The same > > > > > > > > question > > > > > > > > as 5009 on sync vs async api for RSM. Have we considered the > > > > > > > > pros/cons of > > > > > > > > making the RSM apis asynchronous? > > > > > > > > Async methods are used to do other tasks while the result is not > > > > available. In this case, we need to have the result before > > > > > > > > proceeding to > > > > > > > > take next actions. These APIs are evolving and these can be updated > > > > > > > > as and > > > > > > > > when needed instead of having them as asynchronous now. > > > > > > > > Thanks, > > > > Satish. > > > > > > > > On Fri, Aug 14, 2020 at 4:30 AM Kowshik Prakasam < > > > > > > > > kprakasam@ confluent. io ( kpraka...@confluent.io ) > > > > > > > > wrote: > > > > > > > > Hi Harsha/Satish, > > > > > > > > Thanks for the great KIP. Below are the first set of > > > > > > > > questions/suggestions > > > > > > > > I had after making a pass on the KIP. > > > > > > > > 5001. Under the section "Follower fetch protocol in detail", the > > > > next-local-offset is the offset upto which the segments are copied > > > > > > > > to > > > > > > > > remote storage. Instead, would last-tiered-offset be a better name > > > > > > > > than > > > > > > > > next-local-offset? last-tiered-offset seems to naturally align > > > > > > > > well > > > > > > > > with > > > > > > > > the definition provided in the KIP. > > > > > > > > 5002. After leadership is established for a partition, the leader > > > > > > > > would > > > > > > > > begin uploading a segment to remote storage. If successful, the > > > > > > > > leader > > > > > > > > would write the updated RemoteLogSegmentMetadata to the metadata > > > > > > > > topic > > > > > > > > (via > > > > > > > > RLMM.putRemoteLogSegmentData). However, for defensive reasons, it > > > > > > > > seems > > > > > > > > useful that before the first time the segment is uploaded by the > > > > > > > > leader > > > > > > > > for > > > > > > > > a partition, the leader should ensure to catch up to all the > > > > > > > > metadata > > > > > > > > events written so far in the metadata topic for that partition > > > > > > > > (ex: > > > > > > > > by > > > > > > > > previous leader). To achieve this, the leader could start a lease > > > > > > > > (using > > > > > > > > an > > > > > > > > establish_leader metadata event) before commencing tiering, and > > > > > > > > wait > > > > > > > > until > > > > > > > > the event is read back. For example, this seems useful to avoid > > > > > > > > cases > > > > > > > > where > > > > > > > > zombie leaders can be active for the same partition. This can also > > > > > > > > prove > > > > > > > > useful to help avoid making decisions on which segments to be > > > > > > > > uploaded > > > > > > > > for > > > > > > > > a partition, until the current leader has caught up to a complete > > > > > > > > view > > > > > > > > of > > > > > > > > all segments uploaded for the partition so far (otherwise this may > > > > > > > > cause > > > > > > > > same segment being uploaded twice -- once by the previous leader > > > > > > > > and > > > > > > > > then > > > > > > > > by the new leader). > > > > > > > > 5003. There is a natural interleaving between uploading a segment > > > > > > > > to > > > > > > > > remote > > > > > > > > store, and, writing a metadata event for the same (via > > > > RLMM.putRemoteLogSegmentData). There can be cases where a remote > > > > > > > > segment > > > > > > > > is > > > > > > > > uploaded, then the leader fails and a corresponding metadata event > > > > > > > > never > > > > > > > > gets written. In such cases, the orphaned remote segment has to be > > > > eventually deleted (since there is no confirmation of the upload). > > > > > > > > To > > > > > > > > handle this, we could use 2 separate metadata events viz. > > > > > > > > copy_initiated > > > > > > > > and copy_completed, so that copy_initiated events that don't have > > > > > > > > a > > > > > > > > corresponding copy_completed event can be treated as garbage and > > > > > > > > deleted > > > > > > > > from the remote object store by the broker. > > > > > > > > 5004. In the default implementation of RLMM (using the internal > > > > > > > > topic > > > > > > > > __remote_log_metadata), a separate topic called > > > > __remote_segments_to_be_deleted is going to be used just to track > > > > > > > > failures > > > > > > > > in removing remote log segments. A separate topic (effectively > > > > > > > > another > > > > > > > > metadata stream) introduces some maintenance overhead and design > > > > complexity. It seems to me that the same can be achieved just by > > > > > > > > using > > > > > > > > just > > > > > > > > the __remote_log_metadata topic with the following steps: 1) the > > > > > > > > leader > > > > > > > > writes a delete_initiated metadata event, 2) the leader deletes > > > > > > > > the > > > > > > > > segment > > > > > > > > and 3) the leader writes a delete_completed metadata event. Tiered > > > > > > > > segments > > > > > > > > that have delete_initiated message and not delete_completed > > > > > > > > message, > > > > > > > > can > > > > > > > > be > > > > > > > > considered to be a failure and retried. > > > > > > > > 5005. When a Kafka cluster is provisioned for the first time with > > > > > > > > KIP-405 <https://issues.apache.org/jira/browse/KIP-405> > > > > > > > > tiered storage enabled, could you explain in the KIP about how the > > > > bootstrap for __remote_log_metadata topic will be performed in the > > > > > > > > the > > > > > > > > default RLMM implementation? > > > > > > > > 5006. I currently do not see details on the KIP on why RocksDB was > > > > > > > > chosen > > > > > > > > as the default cache implementation, and how it is going to be > > > > > > > > used. > > > > > > > > Were > > > > > > > > alternatives compared/considered? For example, it would be useful > > > > > > > > to > > > > > > > > explain/evaulate the following: 1) debuggability of the RocksDB > > > > > > > > JNI > > > > > > > > interface, 2) performance, 3) portability across platforms and 4) > > > > > > > > interface > > > > > > > > parity of RocksDB’s JNI api with it's underlying C/C++ api. > > > > > > > > 5007. For the RocksDB cache (the default implementation of RLMM), > > > > > > > > what > > > > > > > > is > > > > > > > > the relationship/mapping between the following: 1) # of tiered > > > > > > > > partitions, > > > > > > > > 2) # of partitions of metadata topic __remote_log_metadata and 3) > > > > > > > > # > > > > > > > > of > > > > > > > > RocksDB instances? i.e. is the plan to have a RocksDB instance per > > > > > > > > tiered > > > > > > > > partition, or per metadata topic partition, or just 1 for per > > > > > > > > broker? > > > > > > > > 5008. The system-wide configuration ' remote. log. storage. > > > > > > > > enable ( > > > > > > > > http://remote.log.storage.enable/ ) ' is > > > > > > > > used > > > > > > > > to > > > > > > > > enable tiered storage. Can this be made a topic-level > > > > > > > > configuration, > > > > > > > > so > > > > > > > > that the user can enable/disable tiered storage at a topic level > > > > > > > > rather > > > > > > > > than a system-wide default for an entire Kafka cluster? > > > > > > > > 5009. Whenever a topic with tiered storage enabled is deleted, the > > > > underlying actions require the topic data to be deleted in local > > > > > > > > store > > > > > > > > as > > > > > > > > well as remote store, and eventually the topic metadata needs to > > > > > > > > be > > > > > > > > deleted > > > > > > > > too. What is the role of the controller in deleting a topic and > > > > > > > > it's > > > > > > > > contents, while the topic has tiered storage enabled? > > > > > > > > 5010. RLMM APIs are currently synchronous, for example > > > > RLMM.putRemoteLogSegmentData waits until the put operation is > > > > > > > > completed > > > > > > > > in > > > > > > > > the remote metadata store. It may also block until the leader has > > > > > > > > caught > > > > > > > > up > > > > > > > > to the metadata (not sure). Could we make these apis asynchronous > > > > > > > > (ex: > > > > > > > > based on java.util.concurrent.Future) to provide room for tapping > > > > performance improvements such as non-blocking i/o? > > > > > > > > 5011. The same question as 5009 on sync vs async api for RSM. Have > > > > > > > > we > > > > > > > > considered the pros/cons of making the RSM apis asynchronous? > > > > > > > > Cheers, > > > > Kowshik > > > > > > > > On Thu, Aug 6, 2020 at 11:02 AM Satish Duggana < > > > > > > > > satish. duggana@ gmail. com ( satish.dugg...@gmail.com ) > > > > > > > > wrote: > > > > > > > > Hi Jun, > > > > Thanks for your comments. > > > > > > > > At the high level, that approach sounds reasonable to > > > > > > > > me. It would be useful to document how RLMM handles overlapping > > > > > > > > archived > > > > > > > > offset ranges and how those overlapping segments are deleted > > > > > > > > through > > > > > > > > retention. > > > > > > > > Sure, we will document that in the KIP. > > > > > > > > How is the remaining part of the KIP coming along? To me, the > > > > > > > > two > > > > > > > > biggest > > > > > > > > missing items are (1) more detailed documentation on how all the > > > > > > > > new > > > > > > > > APIs > > > > > > > > are being used and (2) metadata format and usage in the internal > > > > > > > > topic > > > > > > > > __remote_log_metadata. > > > > > > > > We are working on updating APIs based on the recent discussions > > > > > > > > and get > > > > > > > > the perf numbers by plugging in rocksdb as a cache store for > > > > > > > > RLMM. > > > > > > > > We will update the KIP with the updated APIs and with the above > > > > > > > > requested > > > > > > > > details in a few days and let you know. > > > > > > > > Thanks, > > > > Satish. > > > > > > > > On Wed, Aug 5, 2020 at 12:49 AM Jun Rao < jun@ confluent. io ( jun@ > > > > confluent.io ) > wrote: > > > > > > > > Hi, Ying, Satish, > > > > > > > > Thanks for the reply. At the high level, that approach sounds > > > > > > > > reasonable > > > > > > > > to > > > > > > > > me. It would be useful to document how RLMM handles overlapping > > > > > > > > archived > > > > > > > > offset ranges and how those overlapping segments are deleted > > > > > > > > through > > > > > > > > retention. > > > > > > > > How is the remaining part of the KIP coming along? To me, the > > > > > > > > two > > > > > > > > biggest > > > > > > > > missing items are (1) more detailed documentation on how all the > > > > > > > > new > > > > > > > > APIs > > > > > > > > are being used and (2) metadata format and usage in the internal > > > > > > > > topic > > > > > > > > __remote_log_metadata. > > > > > > > > Thanks, > > > > > > > > Jun > > > > > > > > On Tue, Aug 4, 2020 at 8:32 AM Satish Duggana < > > > > > > > > satish. duggana@ gmail. com ( satish.dugg...@gmail.com ) > > > > > > > > > wrote: > > > > > > > > Hi Jun, > > > > Thanks for your comment, > > > > > > > > 1001. Using the new leader as the source of truth may be fine > > > > > > > > too. > > > > > > > > What's > > > > > > > > not clear to me is when a follower takes over as the new > > > > > > > > leader, > > > > > > > > from > > > > > > > > which > > > > > > > > offset does it start archiving to the block storage. I assume > > > > > > > > that > > > > > > > > the > > > > > > > > new > > > > > > > > leader starts from the latest archived ooffset by the previous > > > > > > > > leader, > > > > > > > > but > > > > > > > > it seems that's not the case. It would be useful to document > > > > > > > > this > > > > > > > > in > > > > > > > > the > > > > > > > > Wiki. > > > > > > > > When a follower becomes a leader it needs to findout the offset > > > > > > > > from > > > > > > > > which the segments to be copied to remote storage. This is > > > > > > > > found > > > > > > > > by > > > > > > > > traversing from the the latest leader epoch from leader epoch > > > > > > > > history > > > > > > > > and find the highest offset of a segment with that epoch copied > > > > > > > > into > > > > > > > > remote storage by using respective RLMM APIs. If it can not > > > > > > > > find > > > > > > > > an > > > > > > > > entry then it checks for the previous leader epoch till it > > > > > > > > finds > > > > > > > > an > > > > > > > > entry, If there are no entries till the earliest leader epoch > > > > > > > > in > > > > > > > > leader epoch cache then it starts copying the segments from the > > > > > > > > earliest > > > > > > > > epoch entry’s offset. > > > > Added an example in the KIP here[1]. We will update RLMM APIs > > > > > > > > in > > > > > > > > the > > > > > > > > KIP. > > > > > > > > https:/ / cwiki. apache. org/ confluence/ display/ KAFKA/ KIP-405 > > > > > > > > <https://issues.apache.org/jira/browse/KIP-405> > > > > > > > > > > > > > %3A+Kafka+Tiered+Storage#KIP405:KafkaTieredStorage-Followertoleadertransition > > > > > > > > > > > > ( > > > > > > > > https://cwiki.apache.org/confluence/display/KAFKA/ > > > > > > > > > KIP-405%3A+Kafka+Tiered+Storage#KIP405:KafkaTieredStorage-Followertoleadertransition > > > > > > > > ) > > > > > > > > Satish. > > > > > > > > On Tue, Aug 4, 2020 at 9:00 PM Satish Duggana < > > > > > > > > satish. duggana@ gmail. com ( satish.dugg...@gmail.com ) > > > > > > > > > wrote: > > > > > > > > Hi Ying, > > > > Thanks for your comment. > > > > > > > > 1001. Using the new leader as the source of truth may be fine > > > > > > > > too. > > > > > > > > What's > > > > > > > > not clear to me is when a follower takes over as the new > > > > > > > > leader, > > > > > > > > from > > > > > > > > which > > > > > > > > offset does it start archiving to the block storage. I assume > > > > > > > > that > > > > > > > > the > > > > > > > > new > > > > > > > > leader starts from the latest archived ooffset by the > > > > > > > > previous > > > > > > > > leader, > > > > > > > > but > > > > > > > > it seems that's not the case. It would be useful to document > > > > > > > > this in > > > > > > > > the > > > > > > > > Wiki. > > > > > > > > When a follower becomes a leader it needs to findout the > > > > > > > > offset > > > > > > > > from > > > > > > > > which the segments to be copied to remote storage. This is > > > > > > > > found > > > > > > > > by > > > > > > > > traversing from the the latest leader epoch from leader epoch > > > > > > > > history > > > > > > > > and find the highest offset of a segment with that epoch > > > > > > > > copied > > > > > > > > into > > > > > > > > remote storage by using respective RLMM APIs. If it can not > > > > > > > > find > > > > > > > > an > > > > > > > > entry then it checks for the previous leader epoch till it > > > > > > > > finds > > > > > > > > an > > > > > > > > entry, If there are no entries till the earliest leader epoch > > > > > > > > in > > > > > > > > leader epoch cache then it starts copying the segments from > > > > > > > > the > > > > > > > > earliest epoch entry’s offset. > > > > Added an example in the KIP here[1]. We will update RLMM APIs > > > > > > > > in > > > > > > > > the > > > > > > > > KIP. > > > > > > > > https:/ / cwiki. apache. org/ confluence/ display/ KAFKA/ KIP-405 > > > > > > > > <https://issues.apache.org/jira/browse/KIP-405> > > > > > > > > > > > > > %3A+Kafka+Tiered+Storage#KIP405:KafkaTieredStorage-Followertoleadertransition > > > > > > > > > > > > ( > > > > > > > > https://cwiki.apache.org/confluence/display/KAFKA/ > > > > > > > > > KIP-405%3A+Kafka+Tiered+Storage#KIP405:KafkaTieredStorage-Followertoleadertransition > > > > > > > > ) > > > > > > > > Satish. > > > > > > > > On Tue, Aug 4, 2020 at 10:28 AM Ying Zheng > > > > > > > > < yingz@ uber. com. invalid ( yi...@uber.com.invalid ) > > > > > > > > > wrote: > > > > > > > > Hi Jun, > > > > > > > > Thank you for the comment! The current KIP is not very > > > > > > > > clear > > > > > > > > about > > > > > > > > this > > > > > > > > part. > > > > > > > > 1001. The new leader will start archiving from the earliest > > > > > > > > local > > > > > > > > segment > > > > > > > > that is not fully > > > > covered by the "valid" remote data. "valid" means the > > > > > > > > (offset, > > > > > > > > leader > > > > > > > > epoch) pair is valid > > > > based on the leader-epoch history. > > > > > > > > There are some edge cases where the same offset range (with > > > > > > > > the > > > > > > > > same > > > > > > > > leader > > > > > > > > epoch) can > > > > be copied to the remote storage more than once. But this > > > > > > > > kind > > > > > > > > of > > > > > > > > duplication shouldn't be a > > > > problem. > > > > > > > > Staish is going to explain the details in the KIP with > > > > > > > > examples. > > > > > > > > On Fri, Jul 31, 2020 at 2:55 PM Jun Rao < jun@ confluent. > > > > > > > > io ( > > > > > > > > j...@confluent.io ) > > > > > > > > > wrote: > > > > > > > > Hi, Ying, > > > > > > > > Thanks for the reply. > > > > > > > > 1001. Using the new leader as the source of truth may be > > > > > > > > fine > > > > > > > > too. > > > > > > > > What's > > > > > > > > not clear to me is when a follower takes over as the new > > > > > > > > leader, > > > > > > > > from which > > > > > > > > offset does it start archiving to the block storage. I > > > > > > > > assume > > > > > > > > that > > > > > > > > the new > > > > > > > > leader starts from the latest archived ooffset by the > > > > > > > > previous > > > > > > > > leader, but > > > > > > > > it seems that's not the case. It would be useful to > > > > > > > > document > > > > > > > > this in > > > > > > > > the > > > > > > > > wiki. > > > > > > > > Jun > > > > > > > > On Tue, Jul 28, 2020 at 12:11 PM Ying Zheng > > > > > > > > < yingz@ uber. com. invalid ( yi...@uber.com.invalid ) > > > > > > > > > wrote: > > > > > > > > 1001. > > > > > > > > We did consider this approach. The concerns are > > > > 1) This makes unclean-leader-election rely on remote > > > > > > > > storage. > > > > > > > > In > > > > > > > > case > > > > > > > > the > > > > > > > > remote storage > > > > is unavailable, Kafka will not be able to finish the > > > > > > > > > > > > >