We did some basic feature tests at Uber. The test cases and results are shared in this google doc: https://docs.google.com/spreadsheets/d/1XhNJqjzwXvMCcAOhEH0sSXU6RTvyoSf93DHF-YMfGLk/edit?usp=sharing
The performance test results were already shared in the KIP last month. On Mon, Aug 24, 2020 at 11:10 AM Harsha Ch <harsha...@gmail.com> wrote: > "Understand commitments towards driving design & implementation of the KIP > further and how it aligns with participant interests in contributing to the > efforts (ex: in the context of Uber’s Q3/Q4 roadmap)." > What is that about? > > On Mon, Aug 24, 2020 at 11:05 AM Kowshik Prakasam <kpraka...@confluent.io> > wrote: > > > Hi Harsha, > > > > The following google doc contains a proposal for temporary agenda for the > > KIP-405 <https://issues.apache.org/jira/browse/KIP-405> sync meeting > > tomorrow: > > > https://docs.google.com/document/d/1pqo8X5LU8TpwfC_iqSuVPezhfCfhGkbGN2TqiPA3LBU/edit > > . > > Please could you add it to the Google calendar invite? > > > > Thank you. > > > > > > Cheers, > > Kowshik > > > > On Thu, Aug 20, 2020 at 10:58 AM Harsha Ch <harsha...@gmail.com> wrote: > > > >> Hi All, > >> > >> Scheduled a meeting for Tuesday 9am - 10am. I can record and upload for > >> community to be able to follow the discussion. > >> > >> Jun, please add the required folks on confluent side. > >> > >> Thanks, > >> > >> Harsha > >> > >> On Thu, Aug 20, 2020 at 12:33 AM, Alexandre Dupriez < > >> alexandre.dupr...@gmail.com > wrote: > >> > >> > > >> > > >> > > >> > Hi Jun, > >> > > >> > > >> > > >> > Many thanks for your initiative. > >> > > >> > > >> > > >> > If you like, I am happy to attend at the time you suggested. > >> > > >> > > >> > > >> > Many thanks, > >> > Alexandre > >> > > >> > > >> > > >> > Le mer. 19 août 2020 à 22:00, Harsha Ch < harsha. ch@ gmail. com ( > >> > harsha...@gmail.com ) > a écrit : > >> > > >> > > >> >> > >> >> > >> >> Hi Jun, > >> >> Thanks. This will help a lot. Tuesday will work for us. > >> >> -Harsha > >> >> > >> >> > >> >> > >> >> On Wed, Aug 19, 2020 at 1:24 PM Jun Rao < jun@ confluent. io ( > >> >> j...@confluent.io ) > wrote: > >> >> > >> >> > >> >>> > >> >>> > >> >>> Hi, Satish, Ying, Harsha, > >> >>> > >> >>> > >> >>> > >> >>> Do you think it would be useful to have a regular virtual meeting to > >> >>> discuss this KIP? The goal of the meeting will be sharing > >> >>> design/development progress and discussing any open issues to > >> accelerate > >> >>> this KIP. If so, will every Tuesday (from next week) 9am-10am > >> >>> > >> >>> > >> >> > >> >> > >> >> > >> >> PT > >> >> > >> >> > >> >>> > >> >>> > >> >>> work for you? I can help set up a Zoom meeting, invite everyone who > >> might > >> >>> be interested, have it recorded and shared, etc. > >> >>> > >> >>> > >> >>> > >> >>> Thanks, > >> >>> > >> >>> > >> >>> > >> >>> Jun > >> >>> > >> >>> > >> >>> > >> >>> On Tue, Aug 18, 2020 at 11:01 AM Satish Duggana < > >> >>> > >> >>> > >> >> > >> >> > >> >> > >> >> satish. duggana@ gmail. com ( satish.dugg...@gmail.com ) > > >> >> > >> >> > >> >>> > >> >>> > >> >>> wrote: > >> >>> > >> >>> > >> >>>> > >> >>>> > >> >>>> Hi Kowshik, > >> >>>> > >> >>>> > >> >>>> > >> >>>> Thanks for looking into the KIP and sending your comments. > >> >>>> > >> >>>> > >> >>>> > >> >>>> 5001. Under the section "Follower fetch protocol in detail", the > >> >>>> next-local-offset is the offset upto which the segments are copied > to > >> >>>> remote storage. Instead, would last-tiered-offset be a better name > >> than > >> >>>> next-local-offset? last-tiered-offset seems to naturally align well > >> >>>> > >> >>>> > >> >>> > >> >>> > >> >> > >> >> > >> >> > >> >> with > >> >> > >> >> > >> >>> > >> >>>> > >> >>>> > >> >>>> the definition provided in the KIP. > >> >>>> > >> >>>> > >> >>>> > >> >>>> Both next-local-offset and local-log-start-offset were introduced > to > >> talk > >> >>>> about offsets related to local log. We are fine with > >> last-tiered-offset > >> >>>> too as you suggested. > >> >>>> > >> >>>> > >> >>>> > >> >>>> 5002. After leadership is established for a partition, the leader > >> would > >> >>>> begin uploading a segment to remote storage. If successful, the > >> leader > >> >>>> would write the updated RemoteLogSegmentMetadata to the metadata > >> topic > >> >>>> > >> >>>> > >> >>> > >> >>> > >> >>> > >> >>> (via > >> >>> > >> >>> > >> >>>> > >> >>>> > >> >>>> RLMM.putRemoteLogSegmentData). However, for defensive reasons, it > >> seems > >> >>>> useful that before the first time the segment is uploaded by the > >> leader > >> >>>> > >> >>>> > >> >>> > >> >>> > >> >>> > >> >>> for > >> >>> > >> >>> > >> >>>> > >> >>>> > >> >>>> a partition, the leader should ensure to catch up to all the > metadata > >> >>>> events written so far in the metadata topic for that partition (ex: > >> by > >> >>>> previous leader). To achieve this, the leader could start a lease > >> >>>> > >> >>>> > >> >>> > >> >>> > >> >> > >> >> > >> >> > >> >> (using > >> >> > >> >> > >> >>> > >> >>> > >> >>> an > >> >>> > >> >>> > >> >>>> > >> >>>> > >> >>>> establish_leader metadata event) before commencing tiering, and > wait > >> >>>> > >> >>>> > >> >>> > >> >>> > >> >>> > >> >>> until > >> >>> > >> >>> > >> >>>> > >> >>>> > >> >>>> the event is read back. For example, this seems useful to avoid > cases > >> >>>> > >> >>>> > >> >>> > >> >>> > >> >>> > >> >>> where > >> >>> > >> >>> > >> >>>> > >> >>>> > >> >>>> zombie leaders can be active for the same partition. This can also > >> >>>> > >> >>>> > >> >>> > >> >>> > >> >> > >> >> > >> >> > >> >> prove > >> >> > >> >> > >> >>> > >> >>>> > >> >>>> > >> >>>> useful to help avoid making decisions on which segments to be > >> uploaded > >> >>>> > >> >>>> > >> >>> > >> >>> > >> >>> > >> >>> for > >> >>> > >> >>> > >> >>>> > >> >>>> > >> >>>> a partition, until the current leader has caught up to a complete > >> view > >> >>>> > >> >>>> > >> >>> > >> >>> > >> >> > >> >> > >> >> > >> >> of > >> >> > >> >> > >> >>> > >> >>>> > >> >>>> > >> >>>> all segments uploaded for the partition so far (otherwise this may > >> >>>> > >> >>>> > >> >>> > >> >>> > >> >> > >> >> > >> >> > >> >> cause > >> >> > >> >> > >> >>> > >> >>>> > >> >>>> > >> >>>> same segment being uploaded twice -- once by the previous leader > and > >> >>>> > >> >>>> > >> >>> > >> >>> > >> >> > >> >> > >> >> > >> >> then > >> >> > >> >> > >> >>> > >> >>>> > >> >>>> > >> >>>> by the new leader). > >> >>>> > >> >>>> > >> >>>> > >> >>>> We allow copying segments to remote storage which may have common > >> offsets. > >> >>>> Please go through the KIP to understand the follower fetch > >> protocol(1) and > >> >>>> follower to leader transition(2). > >> >>>> > >> >>>> > >> >>> > >> >>> > >> >> > >> >> > >> >> > >> >> https:/ / cwiki. apache. org/ confluence/ display/ KAFKA/ KIP-405 > >> <https://issues.apache.org/jira/browse/KIP-405> > >> %3A+Kafka+Tiered+Storage#KIP405:KafkaTieredStorage-FollowerReplication > >> >> ( > >> >> > >> > https://cwiki.apache.org/confluence/display/KAFKA/KIP-405%3A+Kafka+Tiered+Storage#KIP405:KafkaTieredStorage-FollowerReplication > >> >> ) > >> >> > >> >> > >> >> > >> >> https:/ / cwiki. apache. org/ confluence/ display/ KAFKA/ KIP-405 > >> <https://issues.apache.org/jira/browse/KIP-405> > >> > %3A+Kafka+Tiered+Storage#KIP405:KafkaTieredStorage-Followertoleadertransition > >> >> ( > >> >> > >> > https://cwiki.apache.org/confluence/display/KAFKA/KIP-405%3A+Kafka+Tiered+Storage#KIP405:KafkaTieredStorage-Followertoleadertransition > >> >> ) > >> >> > >> >> > >> >>> > >> >>>> > >> >>>> > >> >>>> 5003. There is a natural interleaving between uploading a segment > to > >> >>>> > >> >>>> > >> >>> > >> >>> > >> >>> > >> >>> remote > >> >>> > >> >>> > >> >>>> > >> >>>> > >> >>>> store, and, writing a metadata event for the same (via > >> >>>> RLMM.putRemoteLogSegmentData). There can be cases where a remote > >> >>>> > >> >>>> > >> >>> > >> >>> > >> >> > >> >> > >> >> > >> >> segment > >> >> > >> >> > >> >>> > >> >>> > >> >>> is > >> >>> > >> >>> > >> >>>> > >> >>>> > >> >>>> uploaded, then the leader fails and a corresponding metadata event > >> >>>> > >> >>>> > >> >>> > >> >>> > >> >> > >> >> > >> >> > >> >> never > >> >> > >> >> > >> >>> > >> >>>> > >> >>>> > >> >>>> gets written. In such cases, the orphaned remote segment has to be > >> >>>> eventually deleted (since there is no confirmation of the upload). > To > >> >>>> handle this, we could use 2 separate metadata events viz. > >> >>>> > >> >>>> > >> >>> > >> >>> > >> >> > >> >> > >> >> > >> >> copy_initiated > >> >> > >> >> > >> >>> > >> >>>> > >> >>>> > >> >>>> and copy_completed, so that copy_initiated events that don't have a > >> >>>> corresponding copy_completed event can be treated as garbage and > >> >>>> > >> >>>> > >> >>> > >> >>> > >> >> > >> >> > >> >> > >> >> deleted > >> >> > >> >> > >> >>> > >> >>>> > >> >>>> > >> >>>> from the remote object store by the broker. > >> >>>> > >> >>>> > >> >>>> > >> >>>> We are already updating RMM with RemoteLogSegmentMetadata pre and > >> post > >> >>>> copying of log segments. We had a flag in RemoteLogSegmentMetadata > >> whether > >> >>>> it is copied or not. But we are making changes in > >> RemoteLogSegmentMetadata > >> >>>> to introduce a state field in RemoteLogSegmentMetadata which will > >> have the > >> >>>> respective started and finished states. This includes for other > >> operations > >> >>>> like delete too. > >> >>>> > >> >>>> > >> >>>> > >> >>>> 5004. In the default implementation of RLMM (using the internal > topic > >> >>>> __remote_log_metadata), a separate topic called > >> >>>> __remote_segments_to_be_deleted is going to be used just to track > >> >>>> > >> >>>> > >> >>> > >> >>> > >> >>> > >> >>> failures > >> >>> > >> >>> > >> >>>> > >> >>>> > >> >>>> in removing remote log segments. A separate topic (effectively > >> another > >> >>>> metadata stream) introduces some maintenance overhead and design > >> >>>> complexity. It seems to me that the same can be achieved just by > >> using > >> >>>> > >> >>>> > >> >>> > >> >>> > >> >>> > >> >>> just > >> >>> > >> >>> > >> >>>> > >> >>>> > >> >>>> the __remote_log_metadata topic with the following steps: 1) the > >> leader > >> >>>> writes a delete_initiated metadata event, 2) the leader deletes the > >> >>>> > >> >>>> > >> >>> > >> >>> > >> >>> > >> >>> segment > >> >>> > >> >>> > >> >>>> > >> >>>> > >> >>>> and 3) the leader writes a delete_completed metadata event. Tiered > >> >>>> > >> >>>> > >> >>> > >> >>> > >> >>> > >> >>> segments > >> >>> > >> >>> > >> >>>> > >> >>>> > >> >>>> that have delete_initiated message and not delete_completed > message, > >> >>>> > >> >>>> > >> >>> > >> >>> > >> >> > >> >> > >> >> > >> >> can > >> >> > >> >> > >> >>> > >> >>> > >> >>> be > >> >>> > >> >>> > >> >>>> > >> >>>> > >> >>>> considered to be a failure and retried. > >> >>>> > >> >>>> > >> >>>> > >> >>>> Jun suggested in earlier mail to keep this simple . We decided not > >> to have > >> >>>> this topic as mentioned in our earlier replies, updated the KIP. > As I > >> >>>> mentioned in an earlier comment, we are adding state entries for > >> delete > >> >>>> operations too. > >> >>>> > >> >>>> > >> >>>> > >> >>>> 5005. When a Kafka cluster is provisioned for the first time with > >> >>>> > >> >>>> > >> >>> > >> >>> > >> >> > >> >> > >> >> > >> >> KIP-405 <https://issues.apache.org/jira/browse/KIP-405> > >> >> > >> >> > >> >>> > >> >>>> > >> >>>> > >> >>>> tiered storage enabled, could you explain in the KIP about how the > >> >>>> bootstrap for __remote_log_metadata topic will be performed in the > >> the > >> >>>> default RLMM implementation? > >> >>>> > >> >>>> > >> >>>> > >> >>>> __remote_log_segment_metadata topic is created by default with the > >> >>>> respective topic like partitions/replication-factor etc. Can you be > >> more > >> >>>> specific on what you are looking for? > >> >>>> > >> >>>> > >> >>>> > >> >>>> 5008. The system-wide configuration ' remote. log. storage. enable > ( > >> >>>> http://remote.log.storage.enable/ ) ' is used > >> >>>> > >> >>>> > >> >>> > >> >>> > >> >>> > >> >>> to > >> >>> > >> >>> > >> >>>> > >> >>>> > >> >>>> enable tiered storage. Can this be made a topic-level > configuration, > >> so > >> >>>> that the user can enable/disable tiered storage at a topic level > >> rather > >> >>>> than a system-wide default for an entire Kafka cluster? > >> >>>> > >> >>>> > >> >>>> > >> >>>> Yes, we mentioned in an earlier mail thread that it will be > >> supported at > >> >>>> topic level too, updated the KIP. > >> >>>> > >> >>>> > >> >>>> > >> >>>> 5009. Whenever a topic with tiered storage enabled is deleted, the > >> >>>> underlying actions require the topic data to be deleted in local > >> store > >> >>>> > >> >>>> > >> >>> > >> >>> > >> >> > >> >> > >> >> > >> >> as > >> >> > >> >> > >> >>> > >> >>>> > >> >>>> > >> >>>> well as remote store, and eventually the topic metadata needs to be > >> >>>> > >> >>>> > >> >>> > >> >>> > >> >>> > >> >>> deleted > >> >>> > >> >>> > >> >>>> > >> >>>> > >> >>>> too. What is the role of the controller in deleting a topic and > it's > >> >>>> contents, while the topic has tiered storage enabled? > >> >>>> > >> >>>> > >> >>>> > >> >>>> When a topic partition is deleted, there will be an event for that > >> in RLMM > >> >>>> for its deletion and the controller considers that topic is deleted > >> only > >> >>>> when all the remote log segments are also deleted. > >> >>>> > >> >>>> > >> >>>> > >> >>>> 5010. RLMM APIs are currently synchronous, for example > >> >>>> RLMM.putRemoteLogSegmentData waits until the put operation is > >> completed > >> >>>> > >> >>>> > >> >>> > >> >>> > >> >>> > >> >>> in > >> >>> > >> >>> > >> >>>> > >> >>>> > >> >>>> the remote metadata store. It may also block until the leader has > >> >>>> > >> >>>> > >> >>> > >> >>> > >> >> > >> >> > >> >> > >> >> caught > >> >> > >> >> > >> >>> > >> >>> > >> >>> up > >> >>> > >> >>> > >> >>>> > >> >>>> > >> >>>> to the metadata (not sure). Could we make these apis asynchronous > >> (ex: > >> >>>> based on java.util.concurrent.Future) to provide room for tapping > >> >>>> performance improvements such as non-blocking i/o? 5011. The same > >> question > >> >>>> as 5009 on sync vs async api for RSM. Have we considered the > >> pros/cons of > >> >>>> making the RSM apis asynchronous? > >> >>>> > >> >>>> > >> >>>> > >> >>>> Async methods are used to do other tasks while the result is not > >> >>>> available. In this case, we need to have the result before > >> proceeding to > >> >>>> take next actions. These APIs are evolving and these can be updated > >> as and > >> >>>> when needed instead of having them as asynchronous now. > >> >>>> > >> >>>> > >> >>>> > >> >>>> Thanks, > >> >>>> Satish. > >> >>>> > >> >>>> > >> >>>> > >> >>>> On Fri, Aug 14, 2020 at 4:30 AM Kowshik Prakasam < > >> >>>> > >> >>>> > >> >>> > >> >>> > >> >> > >> >> > >> >> > >> >> kprakasam@ confluent. io ( kpraka...@confluent.io ) > >> >> > >> >> > >> >>> > >> >>>> > >> >>>> > >> >>>> wrote: > >> >>>> > >> >>>> > >> >>>>> > >> >>>>> > >> >>>>> Hi Harsha/Satish, > >> >>>>> > >> >>>>> > >> >>>>> > >> >>>>> Thanks for the great KIP. Below are the first set of > >> >>>>> > >> >>>>> > >> >>>> > >> >>>> > >> >>>> > >> >>>> questions/suggestions > >> >>>> > >> >>>> > >> >>>>> > >> >>>>> > >> >>>>> I had after making a pass on the KIP. > >> >>>>> > >> >>>>> > >> >>>>> > >> >>>>> 5001. Under the section "Follower fetch protocol in detail", the > >> >>>>> next-local-offset is the offset upto which the segments are copied > >> to > >> >>>>> remote storage. Instead, would last-tiered-offset be a better name > >> >>>>> > >> >>>>> > >> >>>> > >> >>>> > >> >>> > >> >>> > >> >> > >> >> > >> >> > >> >> than > >> >> > >> >> > >> >>> > >> >>>> > >> >>>>> > >> >>>>> > >> >>>>> next-local-offset? last-tiered-offset seems to naturally align > well > >> >>>>> > >> >>>>> > >> >>>> > >> >>>> > >> >>> > >> >>> > >> >>> > >> >>> with > >> >>> > >> >>> > >> >>>> > >> >>>>> > >> >>>>> > >> >>>>> the definition provided in the KIP. > >> >>>>> > >> >>>>> > >> >>>>> > >> >>>>> 5002. After leadership is established for a partition, the leader > >> >>>>> > >> >>>>> > >> >>>> > >> >>>> > >> >>> > >> >>> > >> >> > >> >> > >> >> > >> >> would > >> >> > >> >> > >> >>> > >> >>>> > >> >>>>> > >> >>>>> > >> >>>>> begin uploading a segment to remote storage. If successful, the > >> >>>>> > >> >>>>> > >> >>>> > >> >>>> > >> >>> > >> >>> > >> >> > >> >> > >> >> > >> >> leader > >> >> > >> >> > >> >>> > >> >>>> > >> >>>>> > >> >>>>> > >> >>>>> would write the updated RemoteLogSegmentMetadata to the metadata > >> >>>>> > >> >>>>> > >> >>>> > >> >>>> > >> >>> > >> >>> > >> >> > >> >> > >> >> > >> >> topic > >> >> > >> >> > >> >>> > >> >>>> > >> >>>> > >> >>>> (via > >> >>>> > >> >>>> > >> >>>>> > >> >>>>> > >> >>>>> RLMM.putRemoteLogSegmentData). However, for defensive reasons, it > >> >>>>> > >> >>>>> > >> >>>> > >> >>>> > >> >>> > >> >>> > >> >> > >> >> > >> >> > >> >> seems > >> >> > >> >> > >> >>> > >> >>>> > >> >>>>> > >> >>>>> > >> >>>>> useful that before the first time the segment is uploaded by the > >> >>>>> > >> >>>>> > >> >>>> > >> >>>> > >> >>> > >> >>> > >> >> > >> >> > >> >> > >> >> leader > >> >> > >> >> > >> >>> > >> >>>> > >> >>>> > >> >>>> for > >> >>>> > >> >>>> > >> >>>>> > >> >>>>> > >> >>>>> a partition, the leader should ensure to catch up to all the > >> metadata > >> >>>>> events written so far in the metadata topic for that partition > (ex: > >> >>>>> > >> >>>>> > >> >>>> > >> >>>> > >> >>> > >> >>> > >> >> > >> >> > >> >> > >> >> by > >> >> > >> >> > >> >>> > >> >>>> > >> >>>>> > >> >>>>> > >> >>>>> previous leader). To achieve this, the leader could start a lease > >> >>>>> > >> >>>>> > >> >>>> > >> >>>> > >> >>> > >> >>> > >> >>> > >> >>> (using > >> >>> > >> >>> > >> >>>> > >> >>>> > >> >>>> an > >> >>>> > >> >>>> > >> >>>>> > >> >>>>> > >> >>>>> establish_leader metadata event) before commencing tiering, and > wait > >> >>>>> > >> >>>>> > >> >>>> > >> >>>> > >> >>>> > >> >>>> until > >> >>>> > >> >>>> > >> >>>>> > >> >>>>> > >> >>>>> the event is read back. For example, this seems useful to avoid > >> cases > >> >>>>> > >> >>>>> > >> >>>> > >> >>>> > >> >>>> > >> >>>> where > >> >>>> > >> >>>> > >> >>>>> > >> >>>>> > >> >>>>> zombie leaders can be active for the same partition. This can also > >> >>>>> > >> >>>>> > >> >>>> > >> >>>> > >> >>> > >> >>> > >> >>> > >> >>> prove > >> >>> > >> >>> > >> >>>> > >> >>>>> > >> >>>>> > >> >>>>> useful to help avoid making decisions on which segments to be > >> >>>>> > >> >>>>> > >> >>>> > >> >>>> > >> >>> > >> >>> > >> >> > >> >> > >> >> > >> >> uploaded > >> >> > >> >> > >> >>> > >> >>>> > >> >>>> > >> >>>> for > >> >>>> > >> >>>> > >> >>>>> > >> >>>>> > >> >>>>> a partition, until the current leader has caught up to a complete > >> >>>>> > >> >>>>> > >> >>>> > >> >>>> > >> >>> > >> >>> > >> >> > >> >> > >> >> > >> >> view > >> >> > >> >> > >> >>> > >> >>> > >> >>> of > >> >>> > >> >>> > >> >>>> > >> >>>>> > >> >>>>> > >> >>>>> all segments uploaded for the partition so far (otherwise this may > >> >>>>> > >> >>>>> > >> >>>> > >> >>>> > >> >>> > >> >>> > >> >>> > >> >>> cause > >> >>> > >> >>> > >> >>>> > >> >>>>> > >> >>>>> > >> >>>>> same segment being uploaded twice -- once by the previous leader > and > >> >>>>> > >> >>>>> > >> >>>> > >> >>>> > >> >>> > >> >>> > >> >>> > >> >>> then > >> >>> > >> >>> > >> >>>> > >> >>>>> > >> >>>>> > >> >>>>> by the new leader). > >> >>>>> > >> >>>>> > >> >>>>> > >> >>>>> 5003. There is a natural interleaving between uploading a segment > to > >> >>>>> > >> >>>>> > >> >>>> > >> >>>> > >> >>>> > >> >>>> remote > >> >>>> > >> >>>> > >> >>>>> > >> >>>>> > >> >>>>> store, and, writing a metadata event for the same (via > >> >>>>> RLMM.putRemoteLogSegmentData). There can be cases where a remote > >> >>>>> > >> >>>>> > >> >>>> > >> >>>> > >> >>> > >> >>> > >> >>> > >> >>> segment > >> >>> > >> >>> > >> >>>> > >> >>>> > >> >>>> is > >> >>>> > >> >>>> > >> >>>>> > >> >>>>> > >> >>>>> uploaded, then the leader fails and a corresponding metadata event > >> >>>>> > >> >>>>> > >> >>>> > >> >>>> > >> >>> > >> >>> > >> >>> > >> >>> never > >> >>> > >> >>> > >> >>>> > >> >>>>> > >> >>>>> > >> >>>>> gets written. In such cases, the orphaned remote segment has to be > >> >>>>> eventually deleted (since there is no confirmation of the upload). > >> To > >> >>>>> handle this, we could use 2 separate metadata events viz. > >> >>>>> > >> >>>>> > >> >>>> > >> >>>> > >> >>> > >> >>> > >> >>> > >> >>> copy_initiated > >> >>> > >> >>> > >> >>>> > >> >>>>> > >> >>>>> > >> >>>>> and copy_completed, so that copy_initiated events that don't have > a > >> >>>>> corresponding copy_completed event can be treated as garbage and > >> >>>>> > >> >>>>> > >> >>>> > >> >>>> > >> >>> > >> >>> > >> >>> > >> >>> deleted > >> >>> > >> >>> > >> >>>> > >> >>>>> > >> >>>>> > >> >>>>> from the remote object store by the broker. > >> >>>>> > >> >>>>> > >> >>>>> > >> >>>>> 5004. In the default implementation of RLMM (using the internal > >> topic > >> >>>>> __remote_log_metadata), a separate topic called > >> >>>>> __remote_segments_to_be_deleted is going to be used just to track > >> >>>>> > >> >>>>> > >> >>>> > >> >>>> > >> >>>> > >> >>>> failures > >> >>>> > >> >>>> > >> >>>>> > >> >>>>> > >> >>>>> in removing remote log segments. A separate topic (effectively > >> >>>>> > >> >>>>> > >> >>>> > >> >>>> > >> >>> > >> >>> > >> >> > >> >> > >> >> > >> >> another > >> >> > >> >> > >> >>> > >> >>>> > >> >>>>> > >> >>>>> > >> >>>>> metadata stream) introduces some maintenance overhead and design > >> >>>>> complexity. It seems to me that the same can be achieved just by > >> >>>>> > >> >>>>> > >> >>>> > >> >>>> > >> >>> > >> >>> > >> >> > >> >> > >> >> > >> >> using > >> >> > >> >> > >> >>> > >> >>>> > >> >>>> > >> >>>> just > >> >>>> > >> >>>> > >> >>>>> > >> >>>>> > >> >>>>> the __remote_log_metadata topic with the following steps: 1) the > >> >>>>> > >> >>>>> > >> >>>> > >> >>>> > >> >>> > >> >>> > >> >> > >> >> > >> >> > >> >> leader > >> >> > >> >> > >> >>> > >> >>>> > >> >>>>> > >> >>>>> > >> >>>>> writes a delete_initiated metadata event, 2) the leader deletes > the > >> >>>>> > >> >>>>> > >> >>>> > >> >>>> > >> >>>> > >> >>>> segment > >> >>>> > >> >>>> > >> >>>>> > >> >>>>> > >> >>>>> and 3) the leader writes a delete_completed metadata event. Tiered > >> >>>>> > >> >>>>> > >> >>>> > >> >>>> > >> >>>> > >> >>>> segments > >> >>>> > >> >>>> > >> >>>>> > >> >>>>> > >> >>>>> that have delete_initiated message and not delete_completed > message, > >> >>>>> > >> >>>>> > >> >>>> > >> >>>> > >> >>> > >> >>> > >> >>> > >> >>> can > >> >>> > >> >>> > >> >>>> > >> >>>> > >> >>>> be > >> >>>> > >> >>>> > >> >>>>> > >> >>>>> > >> >>>>> considered to be a failure and retried. > >> >>>>> > >> >>>>> > >> >>>>> > >> >>>>> 5005. When a Kafka cluster is provisioned for the first time with > >> >>>>> > >> >>>>> > >> >>>> > >> >>>> > >> >>> > >> >>> > >> >>> > >> >>> KIP-405 <https://issues.apache.org/jira/browse/KIP-405> > >> >>> > >> >>> > >> >>>> > >> >>>>> > >> >>>>> > >> >>>>> tiered storage enabled, could you explain in the KIP about how the > >> >>>>> bootstrap for __remote_log_metadata topic will be performed in the > >> >>>>> > >> >>>>> > >> >>>> > >> >>>> > >> >>> > >> >>> > >> >> > >> >> > >> >> > >> >> the > >> >> > >> >> > >> >>> > >> >>>> > >> >>>>> > >> >>>>> > >> >>>>> default RLMM implementation? > >> >>>>> > >> >>>>> > >> >>>>> > >> >>>>> 5006. I currently do not see details on the KIP on why RocksDB was > >> >>>>> > >> >>>>> > >> >>>> > >> >>>> > >> >>> > >> >>> > >> >>> > >> >>> chosen > >> >>> > >> >>> > >> >>>> > >> >>>>> > >> >>>>> > >> >>>>> as the default cache implementation, and how it is going to be > used. > >> >>>>> > >> >>>>> > >> >>>> > >> >>>> > >> >>> > >> >>> > >> >>> > >> >>> Were > >> >>> > >> >>> > >> >>>> > >> >>>>> > >> >>>>> > >> >>>>> alternatives compared/considered? For example, it would be useful > to > >> >>>>> explain/evaulate the following: 1) debuggability of the RocksDB > JNI > >> >>>>> interface, 2) performance, 3) portability across platforms and 4) > >> >>>>> > >> >>>>> > >> >>>> > >> >>>> > >> >>>> > >> >>>> interface > >> >>>> > >> >>>> > >> >>>>> > >> >>>>> > >> >>>>> parity of RocksDB’s JNI api with it's underlying C/C++ api. > >> >>>>> > >> >>>>> > >> >>>>> > >> >>>>> 5007. For the RocksDB cache (the default implementation of RLMM), > >> >>>>> > >> >>>>> > >> >>>> > >> >>>> > >> >>> > >> >>> > >> >> > >> >> > >> >> > >> >> what > >> >> > >> >> > >> >>> > >> >>> > >> >>> is > >> >>> > >> >>> > >> >>>> > >> >>>>> > >> >>>>> > >> >>>>> the relationship/mapping between the following: 1) # of tiered > >> >>>>> > >> >>>>> > >> >>>> > >> >>>> > >> >>>> > >> >>>> partitions, > >> >>>> > >> >>>> > >> >>>>> > >> >>>>> > >> >>>>> 2) # of partitions of metadata topic __remote_log_metadata and 3) > # > >> >>>>> > >> >>>>> > >> >>>> > >> >>>> > >> >>> > >> >>> > >> >> > >> >> > >> >> > >> >> of > >> >> > >> >> > >> >>> > >> >>>> > >> >>>>> > >> >>>>> > >> >>>>> RocksDB instances? i.e. is the plan to have a RocksDB instance per > >> >>>>> > >> >>>>> > >> >>>> > >> >>>> > >> >>> > >> >>> > >> >>> > >> >>> tiered > >> >>> > >> >>> > >> >>>> > >> >>>>> > >> >>>>> > >> >>>>> partition, or per metadata topic partition, or just 1 for per > >> broker? > >> >>>>> > >> >>>>> > >> >>>>> > >> >>>>> 5008. The system-wide configuration ' remote. log. storage. > enable ( > >> >>>>> http://remote.log.storage.enable/ ) ' is > >> >>>>> > >> >>>>> > >> >>>> > >> >>>> > >> >>> > >> >>> > >> >> > >> >> > >> >> > >> >> used > >> >> > >> >> > >> >>> > >> >>>> > >> >>>> > >> >>>> to > >> >>>> > >> >>>> > >> >>>>> > >> >>>>> > >> >>>>> enable tiered storage. Can this be made a topic-level > configuration, > >> >>>>> > >> >>>>> > >> >>>> > >> >>>> > >> >>> > >> >>> > >> >> > >> >> > >> >> > >> >> so > >> >> > >> >> > >> >>> > >> >>>> > >> >>>>> > >> >>>>> > >> >>>>> that the user can enable/disable tiered storage at a topic level > >> >>>>> > >> >>>>> > >> >>>> > >> >>>> > >> >>> > >> >>> > >> >> > >> >> > >> >> > >> >> rather > >> >> > >> >> > >> >>> > >> >>>> > >> >>>>> > >> >>>>> > >> >>>>> than a system-wide default for an entire Kafka cluster? > >> >>>>> > >> >>>>> > >> >>>>> > >> >>>>> 5009. Whenever a topic with tiered storage enabled is deleted, the > >> >>>>> underlying actions require the topic data to be deleted in local > >> >>>>> > >> >>>>> > >> >>>> > >> >>>> > >> >>> > >> >>> > >> >> > >> >> > >> >> > >> >> store > >> >> > >> >> > >> >>> > >> >>> > >> >>> as > >> >>> > >> >>> > >> >>>> > >> >>>>> > >> >>>>> > >> >>>>> well as remote store, and eventually the topic metadata needs to > be > >> >>>>> > >> >>>>> > >> >>>> > >> >>>> > >> >>>> > >> >>>> deleted > >> >>>> > >> >>>> > >> >>>>> > >> >>>>> > >> >>>>> too. What is the role of the controller in deleting a topic and > it's > >> >>>>> contents, while the topic has tiered storage enabled? > >> >>>>> > >> >>>>> > >> >>>>> > >> >>>>> 5010. RLMM APIs are currently synchronous, for example > >> >>>>> RLMM.putRemoteLogSegmentData waits until the put operation is > >> >>>>> > >> >>>>> > >> >>>> > >> >>>> > >> >>> > >> >>> > >> >> > >> >> > >> >> > >> >> completed > >> >> > >> >> > >> >>> > >> >>>> > >> >>>> > >> >>>> in > >> >>>> > >> >>>> > >> >>>>> > >> >>>>> > >> >>>>> the remote metadata store. It may also block until the leader has > >> >>>>> > >> >>>>> > >> >>>> > >> >>>> > >> >>> > >> >>> > >> >>> > >> >>> caught > >> >>> > >> >>> > >> >>>> > >> >>>> > >> >>>> up > >> >>>> > >> >>>> > >> >>>>> > >> >>>>> > >> >>>>> to the metadata (not sure). Could we make these apis asynchronous > >> >>>>> > >> >>>>> > >> >>>> > >> >>>> > >> >>> > >> >>> > >> >> > >> >> > >> >> > >> >> (ex: > >> >> > >> >> > >> >>> > >> >>>> > >> >>>>> > >> >>>>> > >> >>>>> based on java.util.concurrent.Future) to provide room for tapping > >> >>>>> performance improvements such as non-blocking i/o? > >> >>>>> > >> >>>>> > >> >>>>> > >> >>>>> 5011. The same question as 5009 on sync vs async api for RSM. Have > >> we > >> >>>>> considered the pros/cons of making the RSM apis asynchronous? > >> >>>>> > >> >>>>> > >> >>>>> > >> >>>>> Cheers, > >> >>>>> Kowshik > >> >>>>> > >> >>>>> > >> >>>>> > >> >>>>> On Thu, Aug 6, 2020 at 11:02 AM Satish Duggana < > >> >>>>> > >> >>>>> > >> >>>> > >> >>>> > >> >>> > >> >>> > >> >>> > >> >>> satish. duggana@ gmail. com ( satish.dugg...@gmail.com ) > >> >>> > >> >>> > >> >>>> > >> >>>>> > >> >>>>> > >> >>>>> wrote: > >> >>>>> > >> >>>>> > >> >>>>>> > >> >>>>>> > >> >>>>>> Hi Jun, > >> >>>>>> Thanks for your comments. > >> >>>>>> > >> >>>>>> > >> >>>>>>> > >> >>>>>>> > >> >>>>>>> At the high level, that approach sounds reasonable to > >> >>>>>>> > >> >>>>>>> > >> >>>>>> > >> >>>>>> > >> >>>>>> > >> >>>>>> me. It would be useful to document how RLMM handles overlapping > >> >>>>>> > >> >>>>>> > >> >>>>> > >> >>>>> > >> >>>> > >> >>>> > >> >>>> > >> >>>> archived > >> >>>> > >> >>>> > >> >>>>> > >> >>>>>> > >> >>>>>> > >> >>>>>> offset ranges and how those overlapping segments are deleted > >> >>>>>> > >> >>>>>> > >> >>>>> > >> >>>>> > >> >>>> > >> >>>> > >> >>> > >> >>> > >> >> > >> >> > >> >> > >> >> through > >> >> > >> >> > >> >>> > >> >>>> > >> >>>>> > >> >>>>>> > >> >>>>>> > >> >>>>>> retention. > >> >>>>>> > >> >>>>>> > >> >>>>>> > >> >>>>>> Sure, we will document that in the KIP. > >> >>>>>> > >> >>>>>> > >> >>>>>>> > >> >>>>>>> > >> >>>>>>> How is the remaining part of the KIP coming along? To me, the > two > >> >>>>>>> > >> >>>>>>> > >> >>>>>> > >> >>>>>> > >> >>>>> > >> >>>>> > >> >>>> > >> >>>> > >> >>>> > >> >>>> biggest > >> >>>> > >> >>>> > >> >>>>> > >> >>>>>> > >> >>>>>> > >> >>>>>> missing items are (1) more detailed documentation on how all the > >> >>>>>> > >> >>>>>> > >> >>>>> > >> >>>>> > >> >>>> > >> >>>> > >> >>> > >> >>> > >> >> > >> >> > >> >> > >> >> new > >> >> > >> >> > >> >>> > >> >>>> > >> >>>> > >> >>>> APIs > >> >>>> > >> >>>> > >> >>>>> > >> >>>>>> > >> >>>>>> > >> >>>>>> are being used and (2) metadata format and usage in the internal > >> topic > >> >>>>>> __remote_log_metadata. > >> >>>>>> > >> >>>>>> > >> >>>>>> > >> >>>>>> We are working on updating APIs based on the recent discussions > >> and get > >> >>>>>> the perf numbers by plugging in rocksdb as a cache store for > >> >>>>>> > >> >>>>>> > >> >>>>> > >> >>>>> > >> >>>> > >> >>>> > >> >>> > >> >>> > >> >>> > >> >>> RLMM. > >> >>> > >> >>> > >> >>>> > >> >>>>> > >> >>>>>> > >> >>>>>> > >> >>>>>> We will update the KIP with the updated APIs and with the above > >> requested > >> >>>>>> details in a few days and let you know. > >> >>>>>> > >> >>>>>> > >> >>>>>> > >> >>>>>> Thanks, > >> >>>>>> Satish. > >> >>>>>> > >> >>>>>> > >> >>>>>> > >> >>>>>> On Wed, Aug 5, 2020 at 12:49 AM Jun Rao < jun@ confluent. io ( > >> >>>>>> j...@confluent.io ) > wrote: > >> >>>>>> > >> >>>>>> > >> >>>>>>> > >> >>>>>>> > >> >>>>>>> Hi, Ying, Satish, > >> >>>>>>> > >> >>>>>>> > >> >>>>>>> > >> >>>>>>> Thanks for the reply. At the high level, that approach sounds > >> >>>>>>> > >> >>>>>>> > >> >>>>>> > >> >>>>>> > >> >>>>> > >> >>>>> > >> >>>> > >> >>>> > >> >>>> > >> >>>> reasonable > >> >>>> > >> >>>> > >> >>>>> > >> >>>>>> > >> >>>>>> > >> >>>>>> to > >> >>>>>> > >> >>>>>> > >> >>>>>>> > >> >>>>>>> > >> >>>>>>> me. It would be useful to document how RLMM handles overlapping > >> >>>>>>> > >> >>>>>>> > >> >>>>>> > >> >>>>>> > >> >>>>> > >> >>>>> > >> >>>> > >> >>>> > >> >>>> > >> >>>> archived > >> >>>> > >> >>>> > >> >>>>> > >> >>>>>> > >> >>>>>>> > >> >>>>>>> > >> >>>>>>> offset ranges and how those overlapping segments are deleted > >> >>>>>>> > >> >>>>>>> > >> >>>>>> > >> >>>>>> > >> >>>>> > >> >>>>> > >> >>>> > >> >>>> > >> >>> > >> >>> > >> >>> > >> >>> through > >> >>> > >> >>> > >> >>>> > >> >>>>> > >> >>>>>> > >> >>>>>>> > >> >>>>>>> > >> >>>>>>> retention. > >> >>>>>>> > >> >>>>>>> > >> >>>>>>> > >> >>>>>>> How is the remaining part of the KIP coming along? To me, the > two > >> >>>>>>> > >> >>>>>>> > >> >>>>>> > >> >>>>>> > >> >>>>> > >> >>>>> > >> >>>> > >> >>>> > >> >>>> > >> >>>> biggest > >> >>>> > >> >>>> > >> >>>>> > >> >>>>>> > >> >>>>>>> > >> >>>>>>> > >> >>>>>>> missing items are (1) more detailed documentation on how all the > >> >>>>>>> > >> >>>>>>> > >> >>>>>> > >> >>>>>> > >> >>>>> > >> >>>>> > >> >>>> > >> >>>> > >> >>> > >> >>> > >> >>> > >> >>> new > >> >>> > >> >>> > >> >>>> > >> >>>> > >> >>>> APIs > >> >>>> > >> >>>> > >> >>>>> > >> >>>>>> > >> >>>>>>> > >> >>>>>>> > >> >>>>>>> are being used and (2) metadata format and usage in the internal > >> topic > >> >>>>>>> __remote_log_metadata. > >> >>>>>>> > >> >>>>>>> > >> >>>>>>> > >> >>>>>>> Thanks, > >> >>>>>>> > >> >>>>>>> > >> >>>>>>> > >> >>>>>>> Jun > >> >>>>>>> > >> >>>>>>> > >> >>>>>>> > >> >>>>>>> On Tue, Aug 4, 2020 at 8:32 AM Satish Duggana < > >> >>>>>>> > >> >>>>>>> > >> >>>>>> > >> >>>>>> > >> >>>>> > >> >>>>> > >> >>>> > >> >>>> > >> >>>> > >> >>>> satish. duggana@ gmail. com ( satish.dugg...@gmail.com ) > > >> >>>> > >> >>>> > >> >>>>> > >> >>>>>> > >> >>>>>>> > >> >>>>>>> > >> >>>>>>> wrote: > >> >>>>>>> > >> >>>>>>> > >> >>>>>>>> > >> >>>>>>>> > >> >>>>>>>> Hi Jun, > >> >>>>>>>> Thanks for your comment, > >> >>>>>>>> > >> >>>>>>>> > >> >>>>>>>> > >> >>>>>>>> 1001. Using the new leader as the source of truth may be fine > >> >>>>>>>> > >> >>>>>>>> > >> >>>>>>> > >> >>>>>>> > >> >>>>>> > >> >>>>>> > >> >>>>> > >> >>>>> > >> >>>> > >> >>>> > >> >>> > >> >>> > >> >>> > >> >>> too. > >> >>> > >> >>> > >> >>>> > >> >>>>> > >> >>>>>> > >> >>>>>> > >> >>>>>> What's > >> >>>>>> > >> >>>>>> > >> >>>>>>> > >> >>>>>>>> > >> >>>>>>>> > >> >>>>>>>> not clear to me is when a follower takes over as the new > >> >>>>>>>> > >> >>>>>>>> > >> >>>>>>> > >> >>>>>>> > >> >>>>>> > >> >>>>>> > >> >>>>> > >> >>>>> > >> >>>> > >> >>>> > >> >>> > >> >>> > >> >> > >> >> > >> >> > >> >> leader, > >> >> > >> >> > >> >>> > >> >>>> > >> >>>> > >> >>>> from > >> >>>> > >> >>>> > >> >>>>> > >> >>>>>> > >> >>>>>> > >> >>>>>> which > >> >>>>>> > >> >>>>>> > >> >>>>>>> > >> >>>>>>>> > >> >>>>>>>> > >> >>>>>>>> offset does it start archiving to the block storage. I assume > >> >>>>>>>> > >> >>>>>>>> > >> >>>>>>> > >> >>>>>>> > >> >>>>>> > >> >>>>>> > >> >>>>> > >> >>>>> > >> >>>> > >> >>>> > >> >>> > >> >>> > >> >>> > >> >>> that > >> >>> > >> >>> > >> >>>> > >> >>>> > >> >>>> the > >> >>>> > >> >>>> > >> >>>>> > >> >>>>>> > >> >>>>>> > >> >>>>>> new > >> >>>>>> > >> >>>>>> > >> >>>>>>> > >> >>>>>>>> > >> >>>>>>>> > >> >>>>>>>> leader starts from the latest archived ooffset by the previous > >> >>>>>>>> > >> >>>>>>>> > >> >>>>>>> > >> >>>>>>> > >> >>>>>> > >> >>>>>> > >> >>>>> > >> >>>>> > >> >>>> > >> >>>> > >> >>>> > >> >>>> leader, > >> >>>> > >> >>>> > >> >>>>> > >> >>>>>> > >> >>>>>> > >> >>>>>> but > >> >>>>>> > >> >>>>>> > >> >>>>>>> > >> >>>>>>>> > >> >>>>>>>> > >> >>>>>>>> it seems that's not the case. It would be useful to document > >> >>>>>>>> > >> >>>>>>>> > >> >>>>>>> > >> >>>>>>> > >> >>>>>> > >> >>>>>> > >> >>>>> > >> >>>>> > >> >>>> > >> >>>> > >> >>> > >> >>> > >> >> > >> >> > >> >> > >> >> this > >> >> > >> >> > >> >>> > >> >>>> > >> >>>> > >> >>>> in > >> >>>> > >> >>>> > >> >>>>> > >> >>>>>> > >> >>>>>> > >> >>>>>> the > >> >>>>>> > >> >>>>>> > >> >>>>>>> > >> >>>>>>>> > >> >>>>>>>> > >> >>>>>>>> Wiki. > >> >>>>>>>> > >> >>>>>>>> > >> >>>>>>>> > >> >>>>>>>> When a follower becomes a leader it needs to findout the offset > >> >>>>>>>> > >> >>>>>>>> > >> >>>>>>> > >> >>>>>>> > >> >>>>>> > >> >>>>>> > >> >>>>> > >> >>>>> > >> >>>> > >> >>>> > >> >>>> > >> >>>> from > >> >>>> > >> >>>> > >> >>>>> > >> >>>>>> > >> >>>>>>> > >> >>>>>>>> > >> >>>>>>>> > >> >>>>>>>> which the segments to be copied to remote storage. This is > >> >>>>>>>> > >> >>>>>>>> > >> >>>>>>> > >> >>>>>>> > >> >>>>>> > >> >>>>>> > >> >>>>> > >> >>>>> > >> >>>> > >> >>>> > >> >>> > >> >>> > >> >> > >> >> > >> >> > >> >> found > >> >> > >> >> > >> >>> > >> >>> > >> >>> by > >> >>> > >> >>> > >> >>>> > >> >>>>> > >> >>>>>> > >> >>>>>>> > >> >>>>>>>> > >> >>>>>>>> > >> >>>>>>>> traversing from the the latest leader epoch from leader epoch > >> >>>>>>>> > >> >>>>>>>> > >> >>>>>>> > >> >>>>>>> > >> >>>>>> > >> >>>>>> > >> >>>>> > >> >>>>> > >> >>>> > >> >>>> > >> >>>> > >> >>>> history > >> >>>> > >> >>>> > >> >>>>> > >> >>>>>> > >> >>>>>>> > >> >>>>>>>> > >> >>>>>>>> > >> >>>>>>>> and find the highest offset of a segment with that epoch copied > >> >>>>>>>> > >> >>>>>>>> > >> >>>>>>> > >> >>>>>>> > >> >>>>>> > >> >>>>>> > >> >>>>> > >> >>>>> > >> >>>> > >> >>>> > >> >>>> > >> >>>> into > >> >>>> > >> >>>> > >> >>>>> > >> >>>>>> > >> >>>>>>> > >> >>>>>>>> > >> >>>>>>>> > >> >>>>>>>> remote storage by using respective RLMM APIs. If it can not > >> >>>>>>>> > >> >>>>>>>> > >> >>>>>>> > >> >>>>>>> > >> >>>>>> > >> >>>>>> > >> >>>>> > >> >>>>> > >> >>>> > >> >>>> > >> >>> > >> >>> > >> >> > >> >> > >> >> > >> >> find > >> >> > >> >> > >> >>> > >> >>> > >> >>> an > >> >>> > >> >>> > >> >>>> > >> >>>>> > >> >>>>>> > >> >>>>>>> > >> >>>>>>>> > >> >>>>>>>> > >> >>>>>>>> entry then it checks for the previous leader epoch till it > >> >>>>>>>> > >> >>>>>>>> > >> >>>>>>> > >> >>>>>>> > >> >>>>>> > >> >>>>>> > >> >>>>> > >> >>>>> > >> >>>> > >> >>>> > >> >>> > >> >>> > >> >> > >> >> > >> >> > >> >> finds > >> >> > >> >> > >> >>> > >> >>> > >> >>> an > >> >>> > >> >>> > >> >>>> > >> >>>>> > >> >>>>>> > >> >>>>>>> > >> >>>>>>>> > >> >>>>>>>> > >> >>>>>>>> entry, If there are no entries till the earliest leader epoch > >> >>>>>>>> > >> >>>>>>>> > >> >>>>>>> > >> >>>>>>> > >> >>>>>> > >> >>>>>> > >> >>>>> > >> >>>>> > >> >>>> > >> >>>> > >> >>> > >> >>> > >> >> > >> >> > >> >> > >> >> in > >> >> > >> >> > >> >>> > >> >>>> > >> >>>>> > >> >>>>>> > >> >>>>>>> > >> >>>>>>>> > >> >>>>>>>> > >> >>>>>>>> leader epoch cache then it starts copying the segments from the > >> earliest > >> >>>>>>>> epoch entry’s offset. > >> >>>>>>>> Added an example in the KIP here[1]. We will update RLMM APIs > >> >>>>>>>> > >> >>>>>>>> > >> >>>>>>> > >> >>>>>>> > >> >>>>>> > >> >>>>>> > >> >>>>> > >> >>>>> > >> >>>> > >> >>>> > >> >>> > >> >>> > >> >> > >> >> > >> >> > >> >> in > >> >> > >> >> > >> >>> > >> >>>> > >> >>>> > >> >>>> the > >> >>>> > >> >>>> > >> >>>>> > >> >>>>>> > >> >>>>>> > >> >>>>>> KIP. > >> >>>>>> > >> >>>>>> > >> >>>>> > >> >>>>> > >> >>>> > >> >>>> > >> >>> > >> >>> > >> >> > >> >> > >> >> > >> >> https:/ / cwiki. apache. org/ confluence/ display/ KAFKA/ KIP-405 > >> <https://issues.apache.org/jira/browse/KIP-405> > >> > %3A+Kafka+Tiered+Storage#KIP405:KafkaTieredStorage-Followertoleadertransition > >> >> ( > >> >> > >> > https://cwiki.apache.org/confluence/display/KAFKA/KIP-405%3A+Kafka+Tiered+Storage#KIP405:KafkaTieredStorage-Followertoleadertransition > >> >> ) > >> >> > >> >> > >> >>> > >> >>>> > >> >>>>> > >> >>>>>> > >> >>>>>>> > >> >>>>>>>> > >> >>>>>>>> > >> >>>>>>>> Satish. > >> >>>>>>>> > >> >>>>>>>> > >> >>>>>>>> > >> >>>>>>>> On Tue, Aug 4, 2020 at 9:00 PM Satish Duggana < > >> >>>>>>>> > >> >>>>>>>> > >> >>>>>>> > >> >>>>>>> > >> >>>>>> > >> >>>>>> > >> >>>>>> > >> >>>>>> satish. duggana@ gmail. com ( satish.dugg...@gmail.com ) > > >> >>>>>> > >> >>>>>> > >> >>>>>>> > >> >>>>>>>> > >> >>>>>>>> > >> >>>>>>>> wrote: > >> >>>>>>>> > >> >>>>>>>> > >> >>>>>>>>> > >> >>>>>>>>> > >> >>>>>>>>> Hi Ying, > >> >>>>>>>>> Thanks for your comment. > >> >>>>>>>>> > >> >>>>>>>>> > >> >>>>>>>>> > >> >>>>>>>>> 1001. Using the new leader as the source of truth may be fine > >> >>>>>>>>> > >> >>>>>>>>> > >> >>>>>>>> > >> >>>>>>>> > >> >>>>>>> > >> >>>>>>> > >> >>>>>> > >> >>>>>> > >> >>>>> > >> >>>>> > >> >>>> > >> >>>> > >> >>>> > >> >>>> too. > >> >>>> > >> >>>> > >> >>>>> > >> >>>>>> > >> >>>>>> > >> >>>>>> What's > >> >>>>>> > >> >>>>>> > >> >>>>>>> > >> >>>>>>>> > >> >>>>>>>>> > >> >>>>>>>>> > >> >>>>>>>>> not clear to me is when a follower takes over as the new > >> >>>>>>>>> > >> >>>>>>>>> > >> >>>>>>>> > >> >>>>>>>> > >> >>>>>>> > >> >>>>>>> > >> >>>>>> > >> >>>>>> > >> >>>>> > >> >>>>> > >> >>>> > >> >>>> > >> >>> > >> >>> > >> >>> > >> >>> leader, > >> >>> > >> >>> > >> >>>> > >> >>>> > >> >>>> from > >> >>>> > >> >>>> > >> >>>>> > >> >>>>>> > >> >>>>>>> > >> >>>>>>>> > >> >>>>>>>> > >> >>>>>>>> which > >> >>>>>>>> > >> >>>>>>>> > >> >>>>>>>>> > >> >>>>>>>>> > >> >>>>>>>>> offset does it start archiving to the block storage. I assume > >> >>>>>>>>> > >> >>>>>>>>> > >> >>>>>>>> > >> >>>>>>>> > >> >>>>>>> > >> >>>>>>> > >> >>>>>> > >> >>>>>> > >> >>>>> > >> >>>>> > >> >>>> > >> >>>> > >> >>>> > >> >>>> that > >> >>>> > >> >>>> > >> >>>>> > >> >>>>>> > >> >>>>>> > >> >>>>>> the > >> >>>>>> > >> >>>>>> > >> >>>>>>> > >> >>>>>>>> > >> >>>>>>>> > >> >>>>>>>> new > >> >>>>>>>> > >> >>>>>>>> > >> >>>>>>>>> > >> >>>>>>>>> > >> >>>>>>>>> leader starts from the latest archived ooffset by the > >> >>>>>>>>> > >> >>>>>>>>> > >> >>>>>>>> > >> >>>>>>>> > >> >>>>>>> > >> >>>>>>> > >> >>>>>> > >> >>>>>> > >> >>>>> > >> >>>>> > >> >>>> > >> >>>> > >> >>> > >> >>> > >> >> > >> >> > >> >> > >> >> previous > >> >> > >> >> > >> >>> > >> >>>> > >> >>>>> > >> >>>>>> > >> >>>>>> > >> >>>>>> leader, > >> >>>>>> > >> >>>>>> > >> >>>>>>> > >> >>>>>>>> > >> >>>>>>>> > >> >>>>>>>> but > >> >>>>>>>> > >> >>>>>>>> > >> >>>>>>>>> > >> >>>>>>>>> > >> >>>>>>>>> it seems that's not the case. It would be useful to document > >> >>>>>>>>> > >> >>>>>>>>> > >> >>>>>>>> > >> >>>>>>>> > >> >>>>>>> > >> >>>>>>> > >> >>>>>> > >> >>>>>> > >> >>>>> > >> >>>>> > >> >>>> > >> >>>> > >> >>>> > >> >>>> this in > >> >>>> > >> >>>> > >> >>>>> > >> >>>>>> > >> >>>>>> > >> >>>>>> the > >> >>>>>> > >> >>>>>> > >> >>>>>>> > >> >>>>>>>> > >> >>>>>>>>> > >> >>>>>>>>> > >> >>>>>>>>> Wiki. > >> >>>>>>>>> > >> >>>>>>>>> > >> >>>>>>>>> > >> >>>>>>>>> When a follower becomes a leader it needs to findout the > >> >>>>>>>>> > >> >>>>>>>>> > >> >>>>>>>> > >> >>>>>>>> > >> >>>>>>> > >> >>>>>>> > >> >>>>>> > >> >>>>>> > >> >>>>> > >> >>>>> > >> >>>> > >> >>>> > >> >>> > >> >>> > >> >> > >> >> > >> >> > >> >> offset > >> >> > >> >> > >> >>> > >> >>>> > >> >>>> > >> >>>> from > >> >>>> > >> >>>> > >> >>>>> > >> >>>>>> > >> >>>>>>> > >> >>>>>>>> > >> >>>>>>>>> > >> >>>>>>>>> > >> >>>>>>>>> which the segments to be copied to remote storage. This is > >> >>>>>>>>> > >> >>>>>>>>> > >> >>>>>>>> > >> >>>>>>>> > >> >>>>>>> > >> >>>>>>> > >> >>>>>> > >> >>>>>> > >> >>>>> > >> >>>>> > >> >>>> > >> >>>> > >> >>> > >> >>> > >> >>> > >> >>> found > >> >>> > >> >>> > >> >>>> > >> >>>> > >> >>>> by > >> >>>> > >> >>>> > >> >>>>> > >> >>>>>> > >> >>>>>>> > >> >>>>>>>> > >> >>>>>>>>> > >> >>>>>>>>> > >> >>>>>>>>> traversing from the the latest leader epoch from leader epoch > >> >>>>>>>>> > >> >>>>>>>>> > >> >>>>>>>> > >> >>>>>>>> > >> >>>>>>> > >> >>>>>>> > >> >>>>>> > >> >>>>>> > >> >>>>> > >> >>>>> > >> >>>> > >> >>>> > >> >>>> > >> >>>> history > >> >>>> > >> >>>> > >> >>>>> > >> >>>>>> > >> >>>>>>> > >> >>>>>>>> > >> >>>>>>>>> > >> >>>>>>>>> > >> >>>>>>>>> and find the highest offset of a segment with that epoch > >> >>>>>>>>> > >> >>>>>>>>> > >> >>>>>>>> > >> >>>>>>>> > >> >>>>>>> > >> >>>>>>> > >> >>>>>> > >> >>>>>> > >> >>>>> > >> >>>>> > >> >>>> > >> >>>> > >> >>> > >> >>> > >> >> > >> >> > >> >> > >> >> copied > >> >> > >> >> > >> >>> > >> >>>> > >> >>>> > >> >>>> into > >> >>>> > >> >>>> > >> >>>>> > >> >>>>>> > >> >>>>>>> > >> >>>>>>>> > >> >>>>>>>>> > >> >>>>>>>>> > >> >>>>>>>>> remote storage by using respective RLMM APIs. If it can not > >> >>>>>>>>> > >> >>>>>>>>> > >> >>>>>>>> > >> >>>>>>>> > >> >>>>>>> > >> >>>>>>> > >> >>>>>> > >> >>>>>> > >> >>>>> > >> >>>>> > >> >>>> > >> >>>> > >> >>> > >> >>> > >> >>> > >> >>> find > >> >>> > >> >>> > >> >>>> > >> >>>> > >> >>>> an > >> >>>> > >> >>>> > >> >>>>> > >> >>>>>> > >> >>>>>>> > >> >>>>>>>> > >> >>>>>>>>> > >> >>>>>>>>> > >> >>>>>>>>> entry then it checks for the previous leader epoch till it > >> >>>>>>>>> > >> >>>>>>>>> > >> >>>>>>>> > >> >>>>>>>> > >> >>>>>>> > >> >>>>>>> > >> >>>>>> > >> >>>>>> > >> >>>>> > >> >>>>> > >> >>>> > >> >>>> > >> >>> > >> >>> > >> >>> > >> >>> finds > >> >>> > >> >>> > >> >>>> > >> >>>> > >> >>>> an > >> >>>> > >> >>>> > >> >>>>> > >> >>>>>> > >> >>>>>>> > >> >>>>>>>> > >> >>>>>>>>> > >> >>>>>>>>> > >> >>>>>>>>> entry, If there are no entries till the earliest leader epoch > >> >>>>>>>>> > >> >>>>>>>>> > >> >>>>>>>> > >> >>>>>>>> > >> >>>>>>> > >> >>>>>>> > >> >>>>>> > >> >>>>>> > >> >>>>> > >> >>>>> > >> >>>> > >> >>>> > >> >>> > >> >>> > >> >>> > >> >>> in > >> >>> > >> >>> > >> >>>> > >> >>>>> > >> >>>>>> > >> >>>>>>> > >> >>>>>>>> > >> >>>>>>>>> > >> >>>>>>>>> > >> >>>>>>>>> leader epoch cache then it starts copying the segments from > >> >>>>>>>>> > >> >>>>>>>>> > >> >>>>>>>> > >> >>>>>>>> > >> >>>>>>> > >> >>>>>>> > >> >>>>>> > >> >>>>>> > >> >>>>> > >> >>>>> > >> >>>> > >> >>>> > >> >>> > >> >>> > >> >> > >> >> > >> >> > >> >> the > >> >> > >> >> > >> >>> > >> >>>> > >> >>>>> > >> >>>>>> > >> >>>>>>> > >> >>>>>>>> > >> >>>>>>>>> > >> >>>>>>>>> > >> >>>>>>>>> earliest epoch entry’s offset. > >> >>>>>>>>> Added an example in the KIP here[1]. We will update RLMM APIs > >> >>>>>>>>> > >> >>>>>>>>> > >> >>>>>>>> > >> >>>>>>>> > >> >>>>>>> > >> >>>>>>> > >> >>>>>> > >> >>>>>> > >> >>>>> > >> >>>>> > >> >>>> > >> >>>> > >> >>> > >> >>> > >> >>> > >> >>> in > >> >>> > >> >>> > >> >>>> > >> >>>> > >> >>>> the > >> >>>> > >> >>>> > >> >>>>> > >> >>>>>> > >> >>>>>> > >> >>>>>> KIP. > >> >>>>>> > >> >>>>>> > >> >>>>> > >> >>>>> > >> >>>> > >> >>>> > >> >>> > >> >>> > >> >> > >> >> > >> >> > >> >> https:/ / cwiki. apache. org/ confluence/ display/ KAFKA/ KIP-405 > >> <https://issues.apache.org/jira/browse/KIP-405> > >> > %3A+Kafka+Tiered+Storage#KIP405:KafkaTieredStorage-Followertoleadertransition > >> >> ( > >> >> > >> > https://cwiki.apache.org/confluence/display/KAFKA/KIP-405%3A+Kafka+Tiered+Storage#KIP405:KafkaTieredStorage-Followertoleadertransition > >> >> ) > >> >> > >> >> > >> >>> > >> >>>> > >> >>>>> > >> >>>>>> > >> >>>>>>> > >> >>>>>>>> > >> >>>>>>>>> > >> >>>>>>>>> > >> >>>>>>>>> Satish. > >> >>>>>>>>> > >> >>>>>>>>> > >> >>>>>>>>> > >> >>>>>>>>> On Tue, Aug 4, 2020 at 10:28 AM Ying Zheng > >> >>>>>>>>> > >> >>>>>>>>> > >> >>>>>>>> > >> >>>>>>>> > >> >>>>>>> > >> >>>>>>> > >> >>>>>> > >> >>>>>> > >> >>>>> > >> >>>>> > >> >>>> > >> >>>> > >> >>>> > >> >>>> < yingz@ uber. com. invalid ( yi...@uber.com.invalid ) > > >> >>>> > >> >>>> > >> >>>>> > >> >>>>>> > >> >>>>>>> > >> >>>>>>>> > >> >>>>>>>> > >> >>>>>>>> wrote: > >> >>>>>>>> > >> >>>>>>>> > >> >>>>>>>>> > >> >>>>>>>>>> > >> >>>>>>>>>> > >> >>>>>>>>>> Hi Jun, > >> >>>>>>>>>> > >> >>>>>>>>>> > >> >>>>>>>>>> > >> >>>>>>>>>> Thank you for the comment! The current KIP is not very > >> >>>>>>>>>> > >> >>>>>>>>>> > >> >>>>>>>>> > >> >>>>>>>>> > >> >>>>>>>> > >> >>>>>>>> > >> >>>>>>> > >> >>>>>>> > >> >>>>>> > >> >>>>>> > >> >>>>> > >> >>>>> > >> >>>> > >> >>>> > >> >>> > >> >>> > >> >> > >> >> > >> >> > >> >> clear > >> >> > >> >> > >> >>> > >> >>>> > >> >>>> > >> >>>> about > >> >>>> > >> >>>> > >> >>>>> > >> >>>>>> > >> >>>>>> > >> >>>>>> this > >> >>>>>> > >> >>>>>> > >> >>>>>>> > >> >>>>>>>> > >> >>>>>>>>> > >> >>>>>>>>>> > >> >>>>>>>>>> > >> >>>>>>>>>> part. > >> >>>>>>>>>> > >> >>>>>>>>>> > >> >>>>>>>>>> > >> >>>>>>>>>> 1001. The new leader will start archiving from the earliest > >> >>>>>>>>>> > >> >>>>>>>>>> > >> >>>>>>>>> > >> >>>>>>>>> > >> >>>>>>>> > >> >>>>>>>> > >> >>>>>>> > >> >>>>>>> > >> >>>>>> > >> >>>>>> > >> >>>>> > >> >>>>> > >> >>>> > >> >>>> > >> >>>> > >> >>>> local > >> >>>> > >> >>>> > >> >>>>> > >> >>>>>> > >> >>>>>>> > >> >>>>>>>> > >> >>>>>>>> > >> >>>>>>>> segment > >> >>>>>>>> > >> >>>>>>>> > >> >>>>>>>>> > >> >>>>>>>>>> > >> >>>>>>>>>> > >> >>>>>>>>>> that is not fully > >> >>>>>>>>>> covered by the "valid" remote data. "valid" means the > >> >>>>>>>>>> > >> >>>>>>>>>> > >> >>>>>>>>> > >> >>>>>>>>> > >> >>>>>>>> > >> >>>>>>>> > >> >>>>>>> > >> >>>>>>> > >> >>>>>> > >> >>>>>> > >> >>>>> > >> >>>>> > >> >>>> > >> >>>> > >> >>> > >> >>> > >> >>> > >> >>> (offset, > >> >>> > >> >>> > >> >>>> > >> >>>>> > >> >>>>>> > >> >>>>>> > >> >>>>>> leader > >> >>>>>> > >> >>>>>> > >> >>>>>>> > >> >>>>>>>> > >> >>>>>>>>> > >> >>>>>>>>>> > >> >>>>>>>>>> > >> >>>>>>>>>> epoch) pair is valid > >> >>>>>>>>>> based on the leader-epoch history. > >> >>>>>>>>>> > >> >>>>>>>>>> > >> >>>>>>>>>> > >> >>>>>>>>>> There are some edge cases where the same offset range (with > >> >>>>>>>>>> > >> >>>>>>>>>> > >> >>>>>>>>> > >> >>>>>>>>> > >> >>>>>>>> > >> >>>>>>>> > >> >>>>>>> > >> >>>>>>> > >> >>>>>> > >> >>>>>> > >> >>>>> > >> >>>>> > >> >>>> > >> >>>> > >> >>> > >> >>> > >> >>> > >> >>> the > >> >>> > >> >>> > >> >>>> > >> >>>>> > >> >>>>>> > >> >>>>>> > >> >>>>>> same > >> >>>>>> > >> >>>>>> > >> >>>>>>> > >> >>>>>>>> > >> >>>>>>>> > >> >>>>>>>> leader > >> >>>>>>>> > >> >>>>>>>> > >> >>>>>>>>> > >> >>>>>>>>>> > >> >>>>>>>>>> > >> >>>>>>>>>> epoch) can > >> >>>>>>>>>> be copied to the remote storage more than once. But this > >> >>>>>>>>>> > >> >>>>>>>>>> > >> >>>>>>>>> > >> >>>>>>>>> > >> >>>>>>>> > >> >>>>>>>> > >> >>>>>>> > >> >>>>>>> > >> >>>>>> > >> >>>>>> > >> >>>>> > >> >>>>> > >> >>>> > >> >>>> > >> >>> > >> >>> > >> >> > >> >> > >> >> > >> >> kind > >> >> > >> >> > >> >>> > >> >>>> > >> >>>> > >> >>>> of > >> >>>> > >> >>>> > >> >>>>> > >> >>>>>> > >> >>>>>>> > >> >>>>>>>> > >> >>>>>>>>> > >> >>>>>>>>>> > >> >>>>>>>>>> > >> >>>>>>>>>> duplication shouldn't be a > >> >>>>>>>>>> problem. > >> >>>>>>>>>> > >> >>>>>>>>>> > >> >>>>>>>>>> > >> >>>>>>>>>> Staish is going to explain the details in the KIP with > >> >>>>>>>>>> > >> >>>>>>>>>> > >> >>>>>>>>> > >> >>>>>>>>> > >> >>>>>>>> > >> >>>>>>>> > >> >>>>>>> > >> >>>>>>> > >> >>>>>> > >> >>>>>> > >> >>>>> > >> >>>>> > >> >>>> > >> >>>> > >> >>>> > >> >>>> examples. > >> >>>> > >> >>>> > >> >>>>> > >> >>>>>> > >> >>>>>>> > >> >>>>>>>> > >> >>>>>>>>> > >> >>>>>>>>>> > >> >>>>>>>>>> > >> >>>>>>>>>> On Fri, Jul 31, 2020 at 2:55 PM Jun Rao < jun@ confluent. > io ( > >> >>>>>>>>>> j...@confluent.io ) > > >> >>>>>>>>>> > >> >>>>>>>>>> > >> >>>>>>>>> > >> >>>>>>>>> > >> >>>>>>>> > >> >>>>>>>> > >> >>>>>>> > >> >>>>>>> > >> >>>>>> > >> >>>>>> > >> >>>>> > >> >>>>> > >> >>>> > >> >>>> > >> >>>> > >> >>>> wrote: > >> >>>> > >> >>>> > >> >>>>> > >> >>>>>> > >> >>>>>>> > >> >>>>>>>> > >> >>>>>>>>> > >> >>>>>>>>>> > >> >>>>>>>>>>> > >> >>>>>>>>>>> > >> >>>>>>>>>>> Hi, Ying, > >> >>>>>>>>>>> > >> >>>>>>>>>>> > >> >>>>>>>>>>> > >> >>>>>>>>>>> Thanks for the reply. > >> >>>>>>>>>>> > >> >>>>>>>>>>> > >> >>>>>>>>>>> > >> >>>>>>>>>>> 1001. Using the new leader as the source of truth may be > >> >>>>>>>>>>> > >> >>>>>>>>>>> > >> >>>>>>>>>> > >> >>>>>>>>>> > >> >>>>>>>>> > >> >>>>>>>>> > >> >>>>>>>> > >> >>>>>>>> > >> >>>>>>> > >> >>>>>>> > >> >>>>>> > >> >>>>>> > >> >>>>> > >> >>>>> > >> >>>> > >> >>>> > >> >>> > >> >>> > >> >>> > >> >>> fine > >> >>> > >> >>> > >> >>>> > >> >>>>> > >> >>>>>> > >> >>>>>> > >> >>>>>> too. > >> >>>>>> > >> >>>>>> > >> >>>>>>> > >> >>>>>>>> > >> >>>>>>>> > >> >>>>>>>> What's > >> >>>>>>>> > >> >>>>>>>> > >> >>>>>>>>> > >> >>>>>>>>>> > >> >>>>>>>>>>> > >> >>>>>>>>>>> > >> >>>>>>>>>>> not clear to me is when a follower takes over as the new > >> >>>>>>>>>>> > >> >>>>>>>>>>> > >> >>>>>>>>>> > >> >>>>>>>>>> > >> >>>>>>>>> > >> >>>>>>>>> > >> >>>>>>>> > >> >>>>>>>> > >> >>>>>>> > >> >>>>>>> > >> >>>>>> > >> >>>>>> > >> >>>>> > >> >>>>> > >> >>>> > >> >>>> > >> >>>> > >> >>>> leader, > >> >>>> > >> >>>> > >> >>>>> > >> >>>>>> > >> >>>>>>> > >> >>>>>>>> > >> >>>>>>>> > >> >>>>>>>> from which > >> >>>>>>>> > >> >>>>>>>> > >> >>>>>>>>> > >> >>>>>>>>>> > >> >>>>>>>>>>> > >> >>>>>>>>>>> > >> >>>>>>>>>>> offset does it start archiving to the block storage. I > >> >>>>>>>>>>> > >> >>>>>>>>>>> > >> >>>>>>>>>> > >> >>>>>>>>>> > >> >>>>>>>>> > >> >>>>>>>>> > >> >>>>>>>> > >> >>>>>>>> > >> >>>>>>> > >> >>>>>>> > >> >>>>>> > >> >>>>>> > >> >>>>> > >> >>>>> > >> >>>> > >> >>>> > >> >>> > >> >>> > >> >>> > >> >>> assume > >> >>> > >> >>> > >> >>>> > >> >>>>> > >> >>>>>> > >> >>>>>> > >> >>>>>> that > >> >>>>>> > >> >>>>>> > >> >>>>>>> > >> >>>>>>>> > >> >>>>>>>> > >> >>>>>>>> the new > >> >>>>>>>> > >> >>>>>>>> > >> >>>>>>>>> > >> >>>>>>>>>> > >> >>>>>>>>>>> > >> >>>>>>>>>>> > >> >>>>>>>>>>> leader starts from the latest archived ooffset by the > >> >>>>>>>>>>> > >> >>>>>>>>>>> > >> >>>>>>>>>> > >> >>>>>>>>>> > >> >>>>>>>>> > >> >>>>>>>>> > >> >>>>>>>> > >> >>>>>>>> > >> >>>>>>> > >> >>>>>>> > >> >>>>>> > >> >>>>>> > >> >>>>> > >> >>>>> > >> >>>> > >> >>>> > >> >>>> > >> >>>> previous > >> >>>> > >> >>>> > >> >>>>> > >> >>>>>> > >> >>>>>>> > >> >>>>>>>> > >> >>>>>>>> > >> >>>>>>>> leader, but > >> >>>>>>>> > >> >>>>>>>> > >> >>>>>>>>> > >> >>>>>>>>>> > >> >>>>>>>>>>> > >> >>>>>>>>>>> > >> >>>>>>>>>>> it seems that's not the case. It would be useful to > >> >>>>>>>>>>> > >> >>>>>>>>>>> > >> >>>>>>>>>> > >> >>>>>>>>>> > >> >>>>>>>>> > >> >>>>>>>>> > >> >>>>>>>> > >> >>>>>>>> > >> >>>>>>> > >> >>>>>>> > >> >>>>>> > >> >>>>>> > >> >>>>> > >> >>>>> > >> >>>> > >> >>>> > >> >>> > >> >>> > >> >>> > >> >>> document > >> >>> > >> >>> > >> >>>> > >> >>>>> > >> >>>>>> > >> >>>>>> > >> >>>>>> this in > >> >>>>>> > >> >>>>>> > >> >>>>>>> > >> >>>>>>>> > >> >>>>>>>> > >> >>>>>>>> the > >> >>>>>>>> > >> >>>>>>>> > >> >>>>>>>>> > >> >>>>>>>>>> > >> >>>>>>>>>>> > >> >>>>>>>>>>> > >> >>>>>>>>>>> wiki. > >> >>>>>>>>>>> > >> >>>>>>>>>>> > >> >>>>>>>>>>> > >> >>>>>>>>>>> Jun > >> >>>>>>>>>>> > >> >>>>>>>>>>> > >> >>>>>>>>>>> > >> >>>>>>>>>>> On Tue, Jul 28, 2020 at 12:11 PM Ying Zheng > >> >>>>>>>>>>> > >> >>>>>>>>>>> > >> >>>>>>>>>> > >> >>>>>>>>>> > >> >>>>>>>>> > >> >>>>>>>>> > >> >>>>>>>> > >> >>>>>>>> > >> >>>>>>> > >> >>>>>>> > >> >>>>>> > >> >>>>>> > >> >>>>>> > >> >>>>>> < yingz@ uber. com. invalid ( yi...@uber.com.invalid ) > > >> >>>>>> > >> >>>>>> > >> >>>>>>> > >> >>>>>>>> > >> >>>>>>>>> > >> >>>>>>>>>> > >> >>>>>>>>>>> > >> >>>>>>>>>>> > >> >>>>>>>>>>> wrote: > >> >>>>>>>>>>> > >> >>>>>>>>>>> > >> >>>>>>>>>>>> > >> >>>>>>>>>>>> > >> >>>>>>>>>>>> 1001. > >> >>>>>>>>>>>> > >> >>>>>>>>>>>> > >> >>>>>>>>>>>> > >> >>>>>>>>>>>> We did consider this approach. The concerns are > >> >>>>>>>>>>>> 1) This makes unclean-leader-election rely on remote > >> >>>>>>>>>>>> > >> >>>>>>>>>>>> > >> >>>>>>>>>>> > >> >>>>>>>>>>> > >> >>>>>>>>>> > >> >>>>>>>>>> > >> >>>>>>>>> > >> >>>>>>>>> > >> >>>>>>>> > >> >>>>>>>> > >> >>>>>>> > >> >>>>>>> > >> >>>>>> > >> >>>>>> > >> >>>>> > >> >>>>> > >> >>>> > >> >>>> > >> >>>> > >> >>>> storage. > >> >>>> > >> >>>> > >> >>>>> > >> >>>>>> > >> >>>>>> > >> >>>>>> In > >> >>>>>> > >> >>>>>> > >> >>>>>>> > >> >>>>>>>> > >> >>>>>>>> > >> >>>>>>>> case > >> >>>>>>>> > >> >>>>>>>> > >> >>>>>>>>> > >> >>>>>>>>>> > >> >>>>>>>>>>> > >> >>>>>>>>>>> > >> >>>>>>>>>>> the > >> >>>>>>>>>>> > >> >>>>>>>>>>> > >> >>>>>>>>>>>> > >> >>>>>>>>>>>> > >> >>>>>>>>>>>> remote storage > >> >>>>>>>>>>>> is unavailable, Kafka will not be able to finish the > >> >>>>>>>>>>> > > > > >