Re: [DISCUSS] KIP-405: Kafka Tiered Storage

Harsha Ch Wed, 19 Aug 2020 14:01:11 -0700

Hi Jun,
         Thanks. This will help a lot. Tuesday will work for us.
-Harsha



On Wed, Aug 19, 2020 at 1:24 PM Jun Rao <[email protected]> wrote:

> Hi, Satish, Ying, Harsha,
>
> Do you think it would be useful to have a regular virtual meeting to
> discuss this KIP? The goal of the meeting will be sharing
> design/development progress and discussing any open issues to
> accelerate this KIP. If so, will every Tuesday (from next week) 9am-10am PT
> work for you? I can help set up a Zoom meeting, invite everyone who might
> be interested, have it recorded and shared, etc.
>
> Thanks,
>
> Jun
>
> On Tue, Aug 18, 2020 at 11:01 AM Satish Duggana <[email protected]>
> wrote:
>
> > Hi  Kowshik,
> >
> > Thanks for looking into the  KIP and sending your comments.
> >
> > 5001. Under the section "Follower fetch protocol in detail", the
> > next-local-offset is the offset upto which the segments are copied to
> > remote storage. Instead, would last-tiered-offset be a better name than
> > next-local-offset? last-tiered-offset seems to naturally align well with
> > the definition provided in the KIP.
> >
> > Both next-local-offset and local-log-start-offset were introduced to
> > talk about offsets related to local log. We are fine with
> > last-tiered-offset too as you suggested.
> >
> > 5002. After leadership is established for a partition, the leader would
> > begin uploading a segment to remote storage. If successful, the leader
> > would write the updated RemoteLogSegmentMetadata to the metadata topic
> (via
> > RLMM.putRemoteLogSegmentData). However, for defensive reasons, it seems
> > useful that before the first time the segment is uploaded by the leader
> for
> > a partition, the leader should ensure to catch up to all the metadata
> > events written so far in the metadata topic for that partition (ex: by
> > previous leader). To achieve this, the leader could start a lease (using
> an
> > establish_leader metadata event) before commencing tiering, and wait
> until
> > the event is read back. For example, this seems useful to avoid cases
> where
> > zombie leaders can be active for the same partition. This can also prove
> > useful to help avoid making decisions on which segments to be uploaded
> for
> > a partition, until the current leader has caught up to a complete view of
> > all segments uploaded for the partition so far (otherwise this may cause
> > same segment being uploaded twice -- once by the previous leader and then
> > by the new leader).
> >
> > We allow copying segments to remote storage which may have common
> > offsets. Please go through the KIP to understand the follower fetch
> > protocol(1) and follower to leader transition(2).
> >
> >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-405%3A+Kafka+Tiered+Storage#KIP405:KafkaTieredStorage-FollowerReplication
> >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-405%3A+Kafka+Tiered+Storage#KIP405:KafkaTieredStorage-Followertoleadertransition
> >
> >
> > 5003. There is a natural interleaving between uploading a segment to
> remote
> > store, and, writing a metadata event for the same (via
> > RLMM.putRemoteLogSegmentData). There can be cases where a remote segment
> is
> > uploaded, then the leader fails and a corresponding metadata event never
> > gets written. In such cases, the orphaned remote segment has to be
> > eventually deleted (since there is no confirmation of the upload). To
> > handle this, we could use 2 separate metadata events viz. copy_initiated
> > and copy_completed, so that copy_initiated events that don't have a
> > corresponding copy_completed event can be treated as garbage and deleted
> > from the remote object store by the broker.
> >
> > We are already updating RMM with RemoteLogSegmentMetadata pre and post
> > copying of log segments. We had a flag in RemoteLogSegmentMetadata
> > whether it is copied or not. But we are making changes in
> > RemoteLogSegmentMetadata to introduce a state field in
> > RemoteLogSegmentMetadata which will have the respective started and
> > finished states. This includes for other operations like delete too.
> >
> > 5004. In the default implementation of RLMM (using the internal topic
> > __remote_log_metadata), a separate topic called
> > __remote_segments_to_be_deleted is going to be used just to track
> failures
> > in removing remote log segments. A separate topic (effectively another
> > metadata stream) introduces some maintenance overhead and design
> > complexity. It seems to me that the same can be achieved just by using
> just
> > the __remote_log_metadata topic with the following steps: 1) the leader
> > writes a delete_initiated metadata event, 2) the leader deletes the
> segment
> > and 3) the leader writes a delete_completed metadata event. Tiered
> segments
> > that have delete_initiated message and not delete_completed message, can
> be
> > considered to be a failure and retried.
> >
> > Jun suggested in earlier mail to keep this simple . We decided not to
> > have this topic as mentioned in our earlier replies, updated the KIP.
> > As I mentioned in an earlier comment, we are  adding state entries for
> > delete operations too.
> >
> > 5005. When a Kafka cluster is provisioned for the first time with KIP-405
> > tiered storage enabled, could you explain in the KIP about how the
> > bootstrap for __remote_log_metadata topic will be performed in the the
> > default RLMM implementation?
> >
> > __remote_log_segment_metadata topic is created by default with the
> > respective topic like  partitions/replication-factor etc. Can  you be
> > more specific on what you are looking for?
> >
> > 5008. The system-wide configuration 'remote.log.storage.enable' is used
> to
> > enable tiered storage. Can this be made a topic-level configuration, so
> > that the user can enable/disable tiered storage at a topic level rather
> > than a system-wide default for an entire Kafka cluster?
> >
> > Yes, we mentioned in an earlier mail thread that it will be supported
> > at topic level too, updated  the KIP.
> >
> > 5009. Whenever a topic with tiered storage enabled is deleted, the
> > underlying actions require the topic data to be deleted in local store as
> > well as remote store, and eventually the topic metadata needs to be
> deleted
> > too. What is the role of the controller in deleting a topic and it's
> > contents, while the topic has tiered storage enabled?
> >
> > When a topic partition is deleted, there will be an event for that in
> > RLMM for its deletion and the controller considers that topic is
> > deleted only when all the remote log segments are also deleted.
> >
> > 5010. RLMM APIs are currently synchronous, for example
> > RLMM.putRemoteLogSegmentData waits until the put operation is completed
> in
> > the remote metadata store. It may also block until the leader has caught
> up
> > to the metadata (not sure). Could we make these apis asynchronous (ex:
> > based on java.util.concurrent.Future) to provide room for tapping
> > performance improvements such as non-blocking i/o?
> > 5011. The same question as 5009 on sync vs async api for RSM. Have we
> > considered the pros/cons of making the RSM apis asynchronous?
> >
> > Async methods are used to do other tasks while the result is not
> > available. In this case, we need to have the result before proceeding
> > to take next actions. These APIs are evolving and these can be updated
> > as and when needed instead of having them as asynchronous now.
> >
> > Thanks,
> > Satish.
> >
> > On Fri, Aug 14, 2020 at 4:30 AM Kowshik Prakasam <[email protected]
> >
> > wrote:
> > >
> > > Hi Harsha/Satish,
> > >
> > > Thanks for the great KIP. Below are the first set of
> > questions/suggestions
> > > I had after making a pass on the KIP.
> > >
> > > 5001. Under the section "Follower fetch protocol in detail", the
> > > next-local-offset is the offset upto which the segments are copied to
> > > remote storage. Instead, would last-tiered-offset be a better name than
> > > next-local-offset? last-tiered-offset seems to naturally align well
> with
> > > the definition provided in the KIP.
> > >
> > > 5002. After leadership is established for a partition, the leader would
> > > begin uploading a segment to remote storage. If successful, the leader
> > > would write the updated RemoteLogSegmentMetadata to the metadata topic
> > (via
> > > RLMM.putRemoteLogSegmentData). However, for defensive reasons, it seems
> > > useful that before the first time the segment is uploaded by the leader
> > for
> > > a partition, the leader should ensure to catch up to all the metadata
> > > events written so far in the metadata topic for that partition (ex: by
> > > previous leader). To achieve this, the leader could start a lease
> (using
> > an
> > > establish_leader metadata event) before commencing tiering, and wait
> > until
> > > the event is read back. For example, this seems useful to avoid cases
> > where
> > > zombie leaders can be active for the same partition. This can also
> prove
> > > useful to help avoid making decisions on which segments to be uploaded
> > for
> > > a partition, until the current leader has caught up to a complete view
> of
> > > all segments uploaded for the partition so far (otherwise this may
> cause
> > > same segment being uploaded twice -- once by the previous leader and
> then
> > > by the new leader).
> > >
> > > 5003. There is a natural interleaving between uploading a segment to
> > remote
> > > store, and, writing a metadata event for the same (via
> > > RLMM.putRemoteLogSegmentData). There can be cases where a remote
> segment
> > is
> > > uploaded, then the leader fails and a corresponding metadata event
> never
> > > gets written. In such cases, the orphaned remote segment has to be
> > > eventually deleted (since there is no confirmation of the upload). To
> > > handle this, we could use 2 separate metadata events viz.
> copy_initiated
> > > and copy_completed, so that copy_initiated events that don't have a
> > > corresponding copy_completed event can be treated as garbage and
> deleted
> > > from the remote object store by the broker.
> > >
> > > 5004. In the default implementation of RLMM (using the internal topic
> > > __remote_log_metadata), a separate topic called
> > > __remote_segments_to_be_deleted is going to be used just to track
> > failures
> > > in removing remote log segments. A separate topic (effectively another
> > > metadata stream) introduces some maintenance overhead and design
> > > complexity. It seems to me that the same can be achieved just by using
> > just
> > > the __remote_log_metadata topic with the following steps: 1) the leader
> > > writes a delete_initiated metadata event, 2) the leader deletes the
> > segment
> > > and 3) the leader writes a delete_completed metadata event. Tiered
> > segments
> > > that have delete_initiated message and not delete_completed message,
> can
> > be
> > > considered to be a failure and retried.
> > >
> > > 5005. When a Kafka cluster is provisioned for the first time with
> KIP-405
> > > tiered storage enabled, could you explain in the KIP about how the
> > > bootstrap for __remote_log_metadata topic will be performed in the the
> > > default RLMM implementation?
> > >
> > > 5006. I currently do not see details on the KIP on why RocksDB was
> chosen
> > > as the default cache implementation, and how it is going to be used.
> Were
> > > alternatives compared/considered? For example, it would be useful to
> > > explain/evaulate the following: 1) debuggability of the RocksDB JNI
> > > interface, 2) performance, 3) portability across platforms and 4)
> > interface
> > > parity of RocksDB’s JNI api with it's underlying C/C++ api.
> > >
> > > 5007. For the RocksDB cache (the default implementation of RLMM), what
> is
> > > the relationship/mapping between the following: 1) # of tiered
> > partitions,
> > > 2) # of partitions of metadata topic __remote_log_metadata and 3) # of
> > > RocksDB instances? i.e. is the plan to have a RocksDB instance per
> tiered
> > > partition, or per metadata topic partition, or just 1 for per broker?
> > >
> > > 5008. The system-wide configuration 'remote.log.storage.enable' is used
> > to
> > > enable tiered storage. Can this be made a topic-level configuration, so
> > > that the user can enable/disable tiered storage at a topic level rather
> > > than a system-wide default for an entire Kafka cluster?
> > >
> > > 5009. Whenever a topic with tiered storage enabled is deleted, the
> > > underlying actions require the topic data to be deleted in local store
> as
> > > well as remote store, and eventually the topic metadata needs to be
> > deleted
> > > too. What is the role of the controller in deleting a topic and it's
> > > contents, while the topic has tiered storage enabled?
> > >
> > > 5010. RLMM APIs are currently synchronous, for example
> > > RLMM.putRemoteLogSegmentData waits until the put operation is completed
> > in
> > > the remote metadata store. It may also block until the leader has
> caught
> > up
> > > to the metadata (not sure). Could we make these apis asynchronous (ex:
> > > based on java.util.concurrent.Future) to provide room for tapping
> > > performance improvements such as non-blocking i/o?
> > >
> > > 5011. The same question as 5009 on sync vs async api for RSM. Have we
> > > considered the pros/cons of making the RSM apis asynchronous?
> > >
> > >
> > > Cheers,
> > > Kowshik
> > >
> > >
> > > On Thu, Aug 6, 2020 at 11:02 AM Satish Duggana <
> [email protected]
> > >
> > > wrote:
> > >
> > > > Hi Jun,
> > > > Thanks for your comments.
> > > >
> > > > > At the high level, that approach sounds reasonable to
> > > > me. It would be useful to document how RLMM handles overlapping
> > archived
> > > > offset ranges and how those overlapping segments are deleted through
> > > > retention.
> > > >
> > > > Sure, we will document that in the KIP.
> > > >
> > > > >How is the remaining part of the KIP coming along? To me, the two
> > biggest
> > > > missing items are (1) more detailed documentation on how all the new
> > APIs
> > > > are being used and (2) metadata format and usage in the internal
> > > > topic __remote_log_metadata.
> > > >
> > > > We are working on updating APIs based on the recent discussions and
> > > > get the perf numbers by plugging in rocksdb as a cache store for
> RLMM.
> > > > We will update the KIP with the updated APIs and with the above
> > > > requested details in a few days and let you know.
> > > >
> > > > Thanks,
> > > > Satish.
> > > >
> > > >
> > > >
> > > >
> > > > On Wed, Aug 5, 2020 at 12:49 AM Jun Rao <[email protected]> wrote:
> > > > >
> > > > > Hi, Ying, Satish,
> > > > >
> > > > > Thanks for the reply. At the high level, that approach sounds
> > reasonable
> > > > to
> > > > > me. It would be useful to document how RLMM handles overlapping
> > archived
> > > > > offset ranges and how those overlapping segments are deleted
> through
> > > > > retention.
> > > > >
> > > > > How is the remaining part of the KIP coming along? To me, the two
> > biggest
> > > > > missing items are (1) more detailed documentation on how all the
> new
> > APIs
> > > > > are being used and (2) metadata format and usage in the internal
> > > > > topic __remote_log_metadata.
> > > > >
> > > > > Thanks,
> > > > >
> > > > > Jun
> > > > >
> > > > > On Tue, Aug 4, 2020 at 8:32 AM Satish Duggana <
> > [email protected]>
> > > > > wrote:
> > > > >
> > > > > > Hi Jun,
> > > > > > Thanks for your comment,
> > > > > >
> > > > > > 1001. Using the new leader as the source of truth may be fine
> too.
> > > > What's
> > > > > > not clear to me is when a follower takes over as the new leader,
> > from
> > > > which
> > > > > > offset does it start archiving to the block storage. I assume
> that
> > the
> > > > new
> > > > > > leader starts from the latest archived ooffset by the previous
> > leader,
> > > > but
> > > > > > it seems that's not the case. It would be useful to document this
> > in
> > > > the
> > > > > > Wiki.
> > > > > >
> > > > > > When a follower becomes a leader it needs to findout the offset
> > from
> > > > > > which the segments to be copied to remote storage. This is found
> by
> > > > > > traversing from the the latest leader epoch from leader epoch
> > history
> > > > > > and find the highest offset of a segment with that epoch copied
> > into
> > > > > > remote storage by using respective RLMM APIs. If it can not find
> an
> > > > > > entry then it checks for the previous leader epoch till it finds
> an
> > > > > > entry, If there are no entries till the earliest leader epoch in
> > > > > > leader epoch cache then it starts copying the segments from the
> > > > > > earliest epoch entry’s offset.
> > > > > > Added an example in the KIP here[1]. We will update RLMM APIs in
> > the
> > > > KIP.
> > > > > >
> > > > > >
> > > > > >
> > > >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-405%3A+Kafka+Tiered+Storage#KIP405:KafkaTieredStorage-Followertoleadertransition
> > > > > >
> > > > > > Satish.
> > > > > >
> > > > > >
> > > > > > On Tue, Aug 4, 2020 at 9:00 PM Satish Duggana <
> > > > [email protected]>
> > > > > > wrote:
> > > > > > >
> > > > > > > Hi Ying,
> > > > > > > Thanks for your comment.
> > > > > > >
> > > > > > > 1001. Using the new leader as the source of truth may be fine
> > too.
> > > > What's
> > > > > > > not clear to me is when a follower takes over as the new
> leader,
> > from
> > > > > > which
> > > > > > > offset does it start archiving to the block storage. I assume
> > that
> > > > the
> > > > > > new
> > > > > > > leader starts from the latest archived ooffset by the previous
> > > > leader,
> > > > > > but
> > > > > > > it seems that's not the case. It would be useful to document
> > this in
> > > > the
> > > > > > > Wiki.
> > > > > > >
> > > > > > > When a follower becomes a leader it needs to findout the offset
> > from
> > > > > > > which the segments to be copied to remote storage. This is
> found
> > by
> > > > > > > traversing from the the latest leader epoch from leader epoch
> > history
> > > > > > > and find the highest offset of a segment with that epoch copied
> > into
> > > > > > > remote storage by using respective RLMM APIs. If it can not
> find
> > an
> > > > > > > entry then it checks for the previous leader epoch till it
> finds
> > an
> > > > > > > entry, If there are no entries till the earliest leader epoch
> in
> > > > > > > leader epoch cache then it starts copying the segments from the
> > > > > > > earliest epoch entry’s offset.
> > > > > > > Added an example in the KIP here[1]. We will update RLMM APIs
> in
> > the
> > > > KIP.
> > > > > > >
> > > > > > >
> > > > > >
> > > >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-405%3A+Kafka+Tiered+Storage#KIP405:KafkaTieredStorage-Followertoleadertransition
> > > > > > >
> > > > > > >
> > > > > > > Satish.
> > > > > > >
> > > > > > >
> > > > > > > On Tue, Aug 4, 2020 at 10:28 AM Ying Zheng
> > <[email protected]>
> > > > > > wrote:
> > > > > > > >
> > > > > > > > Hi Jun,
> > > > > > > >
> > > > > > > > Thank you for the comment! The current KIP is not very clear
> > about
> > > > this
> > > > > > > > part.
> > > > > > > >
> > > > > > > > 1001. The new leader will start archiving from the earliest
> > local
> > > > > > segment
> > > > > > > > that is not fully
> > > > > > > > covered by the "valid" remote data. "valid" means the
> (offset,
> > > > leader
> > > > > > > > epoch) pair is valid
> > > > > > > > based on the leader-epoch history.
> > > > > > > >
> > > > > > > > There are some edge cases where the same offset range (with
> the
> > > > same
> > > > > > leader
> > > > > > > > epoch) can
> > > > > > > > be copied to the remote storage more than once. But this kind
> > of
> > > > > > > > duplication shouldn't be a
> > > > > > > > problem.
> > > > > > > >
> > > > > > > > Staish is going to explain the details in the KIP with
> > examples.
> > > > > > > >
> > > > > > > >
> > > > > > > > On Fri, Jul 31, 2020 at 2:55 PM Jun Rao <[email protected]>
> > wrote:
> > > > > > > >
> > > > > > > > > Hi, Ying,
> > > > > > > > >
> > > > > > > > > Thanks for the reply.
> > > > > > > > >
> > > > > > > > > 1001. Using the new leader as the source of truth may be
> fine
> > > > too.
> > > > > > What's
> > > > > > > > > not clear to me is when a follower takes over as the new
> > leader,
> > > > > > from which
> > > > > > > > > offset does it start archiving to the block storage. I
> assume
> > > > that
> > > > > > the new
> > > > > > > > > leader starts from the latest archived ooffset by the
> > previous
> > > > > > leader, but
> > > > > > > > > it seems that's not the case. It would be useful to
> document
> > > > this in
> > > > > > the
> > > > > > > > > wiki.
> > > > > > > > >
> > > > > > > > > Jun
> > > > > > > > >
> > > > > > > > > On Tue, Jul 28, 2020 at 12:11 PM Ying Zheng
> > > > <[email protected]>
> > > > > > > > > wrote:
> > > > > > > > >
> > > > > > > > > > 1001.
> > > > > > > > > >
> > > > > > > > > > We did consider this approach. The concerns are
> > > > > > > > > > 1)  This makes unclean-leader-election rely on remote
> > storage.
> > > > In
> > > > > > case
> > > > > > > > > the
> > > > > > > > > > remote storage
> > > > > > > > > >  is unavailable, Kafka will not be able to finish the
> > > > > > > > > > unclean-leader-election.
> > > > > > > > > > 2) Since the user set local retention time (or local
> > retention
> > > > > > bytes), I
> > > > > > > > > > think we are expected to
> > > > > > > > > > keep that much local data when possible (avoid truncating
> > all
> > > > the
> > > > > > local
> > > > > > > > > > data). But, as you said,
> > > > > > > > > > unclean leader elections are very rare, this may not be a
> > big
> > > > > > problem.
> > > > > > > > > >
> > > > > > > > > > The current design uses the leader broker as
> > source-of-truth.
> > > > This
> > > > > > is
> > > > > > > > > > consistent with the
> > > > > > > > > > existing Kafka behavior.
> > > > > > > > > >
> > > > > > > > > > By using remote storage as the source-of-truth, the
> > follower
> > > > logic
> > > > > > can
> > > > > > > > > be a
> > > > > > > > > > little simpler,
> > > > > > > > > > but the leader logic is going to be more complex.
> Overall,
> > I
> > > > don't
> > > > > > see
> > > > > > > > > > there many benefits
> > > > > > > > > > of using remote storage as the source-of-truth.
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > On Tue, Jul 28, 2020 at 10:25 AM Jun Rao <
> [email protected]
> > >
> > > > wrote:
> > > > > > > > > >
> > > > > > > > > > > Hi, Satish,
> > > > > > > > > > >
> > > > > > > > > > > Thanks for the reply.
> > > > > > > > > > >
> > > > > > > > > > > 1001. In your example, I was thinking that you could
> just
> > > > > > download the
> > > > > > > > > > > latest leader epoch from the object store. After that
> you
> > > > know
> > > > > > the
> > > > > > > > > leader
> > > > > > > > > > > should end with offset 1100. The leader will delete all
> > its
> > > > > > local data
> > > > > > > > > > > before offset 1000 and start accepting new messages at
> > offset
> > > > > > 1100.
> > > > > > > > > > > Consumer requests for messages before offset 1100 will
> be
> > > > served
> > > > > > from
> > > > > > > > > the
> > > > > > > > > > > object store. The benefit with this approach is that
> it's
> > > > > > simpler to
> > > > > > > > > > reason
> > > > > > > > > > > about who is the source of truth. The downside is
> > slightly
> > > > > > increased
> > > > > > > > > > > unavailability window during unclean leader election.
> > Since
> > > > > > unclean
> > > > > > > > > > leader
> > > > > > > > > > > elections are rare, I am not sure if this is a big
> > concern.
> > > > > > > > > > >
> > > > > > > > > > > 1008. Yes, I think introducing sth like
> > local.retention.ms
> > > > > > seems more
> > > > > > > > > > > consistent.
> > > > > > > > > > >
> > > > > > > > > > > Jun
> > > > > > > > > > >
> > > > > > > > > > > On Tue, Jul 28, 2020 at 2:30 AM Satish Duggana <
> > > > > > > > > [email protected]
> > > > > > > > > > >
> > > > > > > > > > > wrote:
> > > > > > > > > > >
> > > > > > > > > > > > HI Jun,
> > > > > > > > > > > > Thanks for your comments. We put our inline replies
> > below.
> > > > > > > > > > > >
> > > > > > > > > > > > 1001. I was thinking that you could just use the
> tiered
> > > > > > metadata to
> > > > > > > > > do
> > > > > > > > > > > the
> > > > > > > > > > > > reconciliation. The tiered metadata contains offset
> > ranges
> > > > and
> > > > > > epoch
> > > > > > > > > > > > history. Those should be enough for reconciliation
> > > > purposes.
> > > > > > > > > > > >
> > > > > > > > > > > > If we use remote storage as the source-of-truth
> during
> > > > > > > > > > > > unclean-leader-election, it's possible that after
> > > > > > reconciliation the
> > > > > > > > > > > > remote storage will have more recent data than the
> new
> > > > > > leader's local
> > > > > > > > > > > > storage. For example, the new leader's latest message
> > is
> > > > > > offset 1000,
> > > > > > > > > > > > while the remote storage has message 1100. In such a
> > case,
> > > > the
> > > > > > new
> > > > > > > > > > > > leader will have to download the messages from 1001
> to
> > > > 1100,
> > > > > > before
> > > > > > > > > > > > accepting new messages from producers. Otherwise,
> there
> > > > would
> > > > > > be a
> > > > > > > > > gap
> > > > > > > > > > > > in the local data between 1000 and 1101.
> > > > > > > > > > > >
> > > > > > > > > > > > Moreover, with the current design, leader epoch
> > history is
> > > > > > stored in
> > > > > > > > > > > > remote storage, rather than the metadata topic. We
> did
> > > > consider
> > > > > > > > > saving
> > > > > > > > > > > > epoch history in remote segment metadata. But the
> > concern
> > > > is
> > > > > > that
> > > > > > > > > > > > there is currently no limit for the epoch history
> size.
> > > > > > > > > Theoretically,
> > > > > > > > > > > > if a user has a very long remote retention time and
> > there
> > > > are
> > > > > > very
> > > > > > > > > > > > frequent leadership changes, the leader epoch history
> > can
> > > > > > become too
> > > > > > > > > > > > long to fit into a regular Kafka message.
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > 1003.3 Having just a serverEndpoint string is
> probably
> > not
> > > > > > enough.
> > > > > > > > > > > > Connecting to a Kafka cluster may need various
> security
> > > > > > credentials.
> > > > > > > > > We
> > > > > > > > > > > can
> > > > > > > > > > > > make RLMM configurable and pass in the properties
> > through
> > > > the
> > > > > > > > > > configure()
> > > > > > > > > > > > method. Ditto for RSM.
> > > > > > > > > > > >
> > > > > > > > > > > > RLMM and  RSM are already configurable and they take
> > > > > > properties which
> > > > > > > > > > > > start with "remote.log.metadata." and
> > "remote.log.storage."
> > > > > > > > > > > > respectively and a few others. We have listener-name
> > as the
> > > > > > config
> > > > > > > > > for
> > > > > > > > > > > > RLMM and other properties(like security) can be sent
> > as you
> > > > > > > > > suggested.
> > > > > > > > > > > > We will update the KIP with the details.
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > 1008.1 We started with log.retention.hours and
> > > > > > log.retention.minutes,
> > > > > > > > > > and
> > > > > > > > > > > > added log.retention.ms later. If we are adding a new
> > > > > > configuration,
> > > > > > > > > ms
> > > > > > > > > > > > level config alone is enough and is simpler. We can
> > build
> > > > > > tools to
> > > > > > > > > make
> > > > > > > > > > > the
> > > > > > > > > > > > configuration at different granularities easier. The
> > > > > > definition of
> > > > > > > > > > > > log.retention.ms is "The number of milliseconds to
> > keep a
> > > > log
> > > > > > file
> > > > > > > > > > > before
> > > > > > > > > > > > deleting it". The deletion is independent of whether
> > > > tiering is
> > > > > > > > > enabled
> > > > > > > > > > > or
> > > > > > > > > > > > not. If this changes to just the local portion of the
> > > > data, we
> > > > > > are
> > > > > > > > > > > changing
> > > > > > > > > > > > the meaning of an existing configuration.
> > > > > > > > > > > >
> > > > > > > > > > > > We are fine with either way. We can go with
> > > > log.retention.xxxx
> > > > > > as the
> > > > > > > > > > > > effective log retention instead of local log
> retention.
> > > > With
> > > > > > this
> > > > > > > > > > > > convention, we need to introduce  local.log.retention
> > > > instead
> > > > > > of
> > > > > > > > > > > > remote.log.retention.ms that we proposed. If
> > > > log.retention.ms
> > > > > > as -1
> > > > > > > > > > > > then remote retention is also considered as unlimited
> > but
> > > > user
> > > > > > should
> > > > > > > > > > > > be able to set the local.retention.ms.
> > > > > > > > > > > > So, we need to introduce local.log.retention.ms and
> > > > > > > > > > > > local.log.retention.bytes which should  always  be <=
> > > > > > > > > > > > log.retention.ms/bytes respectively.
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > On Fri, Jul 24, 2020 at 3:37 AM Jun Rao <
> > [email protected]>
> > > > > > wrote:
> > > > > > > > > > > > >
> > > > > > > > > > > > > Hi, Satish,
> > > > > > > > > > > > >
> > > > > > > > > > > > > Thanks for the reply. A few quick comments below.
> > > > > > > > > > > > >
> > > > > > > > > > > > > 1001. I was thinking that you could just use the
> > tiered
> > > > > > metadata to
> > > > > > > > > > do
> > > > > > > > > > > > the
> > > > > > > > > > > > > reconciliation. The tiered metadata contains offset
> > > > ranges
> > > > > > and
> > > > > > > > > epoch
> > > > > > > > > > > > > history. Those should be enough for reconciliation
> > > > purposes.
> > > > > > > > > > > > >
> > > > > > > > > > > > > 1003.3 Having just a serverEndpoint string is
> > probably
> > > > not
> > > > > > enough.
> > > > > > > > > > > > > Connecting to a Kafka cluster may need various
> > security
> > > > > > > > > credentials.
> > > > > > > > > > We
> > > > > > > > > > > > can
> > > > > > > > > > > > > make RLMM configurable and pass in the properties
> > > > through the
> > > > > > > > > > > configure()
> > > > > > > > > > > > > method. Ditto for RSM.
> > > > > > > > > > > > >
> > > > > > > > > > > > > 1008.1 We started with log.retention.hours and
> > > > > > > > > log.retention.minutes,
> > > > > > > > > > > and
> > > > > > > > > > > > > added log.retention.ms later. If we are adding a
> new
> > > > > > > > > configuration,
> > > > > > > > > > ms
> > > > > > > > > > > > > level config alone is enough and is simpler. We can
> > build
> > > > > > tools to
> > > > > > > > > > make
> > > > > > > > > > > > the
> > > > > > > > > > > > > configuration at different granularities easier.
> The
> > > > > > definition of
> > > > > > > > > > > > > log.retention.ms is "The number of milliseconds to
> > keep
> > > > a
> > > > > > log file
> > > > > > > > > > > > before
> > > > > > > > > > > > > deleting it". The deletion is independent of
> whether
> > > > tiering
> > > > > > is
> > > > > > > > > > enabled
> > > > > > > > > > > > or
> > > > > > > > > > > > > not. If this changes to just the local portion of
> the
> > > > data,
> > > > > > we are
> > > > > > > > > > > > changing
> > > > > > > > > > > > > the meaning of an existing configuration.
> > > > > > > > > > > > >
> > > > > > > > > > > > > Jun
> > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > On Thu, Jul 23, 2020 at 11:04 AM Satish Duggana <
> > > > > > > > > > > > [email protected]>
> > > > > > > > > > > > > wrote:
> > > > > > > > > > > > >
> > > > > > > > > > > > > > Hi Jun,
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Thank you for the comments! Ying, Harsha and I
> > > > discussed
> > > > > > and put
> > > > > > > > > > our
> > > > > > > > > > > > > > comments below.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > 1001. The KIP described a few scenarios of
> unclean
> > > > leader
> > > > > > > > > > elections.
> > > > > > > > > > > > This
> > > > > > > > > > > > > > is very useful, but I am wondering if this is the
> > best
> > > > > > approach.
> > > > > > > > > My
> > > > > > > > > > > > > > understanding of the proposed approach is to
> allow
> > the
> > > > new
> > > > > > > > > > (unclean)
> > > > > > > > > > > > leader
> > > > > > > > > > > > > > to take new messages immediately. While this
> > increases
> > > > > > > > > > availability,
> > > > > > > > > > > it
> > > > > > > > > > > > > > creates the problem that there could be multiple
> > > > > > conflicting
> > > > > > > > > > segments
> > > > > > > > > > > > in
> > > > > > > > > > > > > > the remote store for the same offset range. This
> > seems
> > > > to
> > > > > > make it
> > > > > > > > > > > > harder
> > > > > > > > > > > > > > for RLMM to determine which archived log segments
> > > > contain
> > > > > > the
> > > > > > > > > > correct
> > > > > > > > > > > > data.
> > > > > > > > > > > > > > For example, an archived log segment could at one
> > time
> > > > be
> > > > > > the
> > > > > > > > > > correct
> > > > > > > > > > > > data,
> > > > > > > > > > > > > > but be changed to incorrect data after an unclean
> > > > leader
> > > > > > > > > election.
> > > > > > > > > > An
> > > > > > > > > > > > > > alternative approach is to let the unclean leader
> > use
> > > > the
> > > > > > > > > archived
> > > > > > > > > > > > data as
> > > > > > > > > > > > > > the source of truth. So, when the new (unclean)
> > leader
> > > > > > takes
> > > > > > > > > over,
> > > > > > > > > > it
> > > > > > > > > > > > first
> > > > > > > > > > > > > > reconciles the local data based on the archived
> > data
> > > > before
> > > > > > > > > taking
> > > > > > > > > > > new
> > > > > > > > > > > > > > messages. This makes the job of RLMM a bit easier
> > > > since all
> > > > > > > > > > archived
> > > > > > > > > > > > data
> > > > > > > > > > > > > > are considered correct. This increases
> > availability a
> > > > bit.
> > > > > > > > > However,
> > > > > > > > > > > > since
> > > > > > > > > > > > > > unclean leader elections are rare, this may be
> ok.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Firstly, We don't want to assume the remote
> > storage is
> > > > more
> > > > > > > > > > reliable
> > > > > > > > > > > > than
> > > > > > > > > > > > > > Kafka. Kafka unclean leader election usually
> > happens
> > > > when
> > > > > > there
> > > > > > > > > is
> > > > > > > > > > a
> > > > > > > > > > > > large
> > > > > > > > > > > > > > scale outage that impacts multiple racks (or even
> > > > multiple
> > > > > > > > > > > availability
> > > > > > > > > > > > > > zones). In such a case, the remote storage may be
> > > > > > unavailable or
> > > > > > > > > > > > unstable.
> > > > > > > > > > > > > > Pulling a large amount of data from the remote
> > storage
> > > > to
> > > > > > > > > reconcile
> > > > > > > > > > > the
> > > > > > > > > > > > > > local data may also exacerbate the outage. With
> the
> > > > current
> > > > > > > > > design,
> > > > > > > > > > > > the new
> > > > > > > > > > > > > > leader can start working even when the remote
> > storage
> > > > is
> > > > > > > > > > temporarily
> > > > > > > > > > > > > > unavailable.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Secondly, it is not easier to implement the
> > reconciling
> > > > > > logic at
> > > > > > > > > > the
> > > > > > > > > > > > leader
> > > > > > > > > > > > > > side. It can take a long time for the new leader
> to
> > > > > > download the
> > > > > > > > > > > remote
> > > > > > > > > > > > > > data and rebuild local producer id / leader epoch
> > > > > > information.
> > > > > > > > > > During
> > > > > > > > > > > > this
> > > > > > > > > > > > > > period, the leader cannot accept any requests
> from
> > the
> > > > > > clients
> > > > > > > > > and
> > > > > > > > > > > > > > followers. We have to introduce a new state for
> the
> > > > > > leader, and a
> > > > > > > > > > new
> > > > > > > > > > > > error
> > > > > > > > > > > > > > code to let the clients / followers know what is
> > > > happening.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > 1002. RemoteStorageManager.
> > > > > > > > > > > > > > 1002.1 There seems to be some inconsistencies in
> > > > > > > > > > > RemoteStorageManager.
> > > > > > > > > > > > We
> > > > > > > > > > > > > > pass in RemoteLogSegmentId copyLogSegment(). For
> > all
> > > > other
> > > > > > > > > methods,
> > > > > > > > > > > we
> > > > > > > > > > > > pass
> > > > > > > > > > > > > > in RemoteLogSegmentMetadata.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Nice catch, we can have the
> > RemoteLogSegmentMetadata
> > > > for
> > > > > > > > > > > copyLogSegment
> > > > > > > > > > > > > > too.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > 1002.2 Is endOffset in RemoteLogSegmentMetadata
> > > > inclusive
> > > > > > or
> > > > > > > > > > > exclusive?
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > It is inclusive.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > 1002.3 It seems that we need an api to get the
> > > > leaderEpoch
> > > > > > > > > history
> > > > > > > > > > > for
> > > > > > > > > > > > a
> > > > > > > > > > > > > > partition.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Yes, updated the KIP with the new method.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > 1002.4 Could you define the type of
> > > > > > RemoteLogSegmentContext?
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > This is removed in the latest code and it is not
> > > > needed.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > 1003 RemoteLogMetadataManager
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > 1003.1 I am not sure why we need both of the
> > following
> > > > > > methods
> > > > > > > > > > > > > > in RemoteLogMetadataManager. Could we combine
> them
> > into
> > > > > > one that
> > > > > > > > > > > takes
> > > > > > > > > > > > in
> > > > > > > > > > > > > > offset and returns RemoteLogSegmentMetadata?
> > > > > > > > > > > > > >     RemoteLogSegmentId
> > > > getRemoteLogSegmentId(TopicPartition
> > > > > > > > > > > > topicPartition,
> > > > > > > > > > > > > > long offset) throws IOException;
> > > > > > > > > > > > > >     RemoteLogSegmentMetadata
> > > > > > > > > > > > getRemoteLogSegmentMetadata(RemoteLogSegmentId
> > > > > > > > > > > > > > remoteLogSegmentId) throws IOException;
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Good point, these can be merged for now. I guess
> we
> > > > needed
> > > > > > them
> > > > > > > > > in
> > > > > > > > > > > > earlier
> > > > > > > > > > > > > > version of the implementation but it is not
> needed
> > now.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > 1003.2 There seems to be some inconsistencies in
> > the
> > > > > > methods
> > > > > > > > > > below. I
> > > > > > > > > > > > am
> > > > > > > > > > > > > > not sure why one takes RemoteLogSegmentMetadata
> > and the
> > > > > > other
> > > > > > > > > > > > > > takes RemoteLogSegmentId.
> > > > > > > > > > > > > >     void
> > > > putRemoteLogSegmentData(RemoteLogSegmentMetadata
> > > > > > > > > > > > > > remoteLogSegmentMetadata) throws IOException;
> > > > > > > > > > > > > >     void
> > > > deleteRemoteLogSegmentMetadata(RemoteLogSegmentId
> > > > > > > > > > > > > > remoteLogSegmentId) throws IOException;
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > RLMM stores RemoteLogSegmentMetadata which is
> > > > identified by
> > > > > > > > > > > > > > RemoteLogsSegmentId. So, when it is added it
> takes
> > > > > > > > > > > > > > RemoteLogSegmentMetadata. `delete` operation
> needs
> > only
> > > > > > > > > > > > RemoteLogsSegmentId
> > > > > > > > > > > > > > as RemoteLogSegmentMetadata can be identified
> with
> > > > > > > > > > > RemoteLogsSegmentId.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > 1003.3 In void onServerStarted(final String
> > > > > > serverEndpoint), what
> > > > > > > > > > > > > > is serverEndpoint used for?
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > This can be used by RLMM implementation to
> connect
> > to
> > > > the
> > > > > > local
> > > > > > > > > > Kafka
> > > > > > > > > > > > > > cluster. Incase of default  implementation, it is
> > used
> > > > in
> > > > > > > > > > > initializing
> > > > > > > > > > > > > > kafka clients connecting to the local cluster.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > 1004. It would be useful to document how all the
> > new
> > > > APIs
> > > > > > are
> > > > > > > > > being
> > > > > > > > > > > > used.
> > > > > > > > > > > > > > For example, when is
> > > > > > RemoteLogSegmentMetadata.markedForDeletion
> > > > > > > > > > being
> > > > > > > > > > > > set
> > > > > > > > > > > > > > and used? How are
> > > > > > > > > > > > > >
> > > > > > RemoteLogMetadataManager.earliestLogOffset/highestLogOffset being
> > > > > > > > > > > used?
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > RLMM APIs are going through the changes and they
> > > > should be
> > > > > > ready
> > > > > > > > > > in a
> > > > > > > > > > > > few
> > > > > > > > > > > > > > days. I will update the KIP and the mail  thread
> > once
> > > > they
> > > > > > are
> > > > > > > > > > ready.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > 1005. Handling partition deletion: The KIP says
> > "RLMM
> > > > will
> > > > > > > > > > eventually
> > > > > > > > > > > > > > delete these segments by using
> > RemoteStorageManager."
> > > > Which
> > > > > > > > > replica
> > > > > > > > > > > > does
> > > > > > > > > > > > > > this logic?
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > This is a good point. When a topic is deleted, it
> > will
> > > > not
> > > > > > have
> > > > > > > > > any
> > > > > > > > > > > > > > leader/followers to do the cleanup. We will have
> a
> > > > cleaner
> > > > > > agent
> > > > > > > > > > on a
> > > > > > > > > > > > > > single broker in the cluster to do this cleanup,
> we
> > > > plan
> > > > > > to add
> > > > > > > > > > that
> > > > > > > > > > > in
> > > > > > > > > > > > > > controller broker.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > 1006. "If there are any failures in removing
> > remote log
> > > > > > segments
> > > > > > > > > > then
> > > > > > > > > > > > those
> > > > > > > > > > > > > > are stored in a specific topic (default as
> > > > > > > > > > > > __remote_segments_to_be_deleted)
> > > > > > > > > > > > > > and user can consume the events(which contain
> > > > > > > > > > remote-log-segment-id)
> > > > > > > > > > > > from
> > > > > > > > > > > > > > that topic and clean them up from remote storage.
> > "
> > > > Not
> > > > > > sure if
> > > > > > > > > > it's
> > > > > > > > > > > > worth
> > > > > > > > > > > > > > the complexity of adding another topic. Could we
> > just
> > > > > > retry?
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Sure, we can keep this simpler for now by logging
> > an
> > > > error
> > > > > > after
> > > > > > > > > > > > retries.
> > > > > > > > > > > > > > We can give users a better way to process this in
> > > > future.
> > > > > > Oneway
> > > > > > > > > > can
> > > > > > > > > > > > be a
> > > > > > > > > > > > > > dead letter topic which can be configured by the
> > user.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > 1007. RemoteFetchPurgatory: Could we just reuse
> the
> > > > > > existing
> > > > > > > > > > > > > > fetchPurgatory?
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > We have 2 types of delayed operations waiting
> for 2
> > > > > > different
> > > > > > > > > > events.
> > > > > > > > > > > > > > DelayedFetch waits for new messages from
> producers.
> > > > > > > > > > > DelayedRemoteFetch
> > > > > > > > > > > > > > waits for the remote-storage-read-task to finish.
> > When
> > > > > > either of
> > > > > > > > > > the
> > > > > > > > > > > 2
> > > > > > > > > > > > > > events happens, we only want to notify one type
> of
> > the
> > > > > > delayed
> > > > > > > > > > > > operations.
> > > > > > > > > > > > > > It would be inefficient to put 2 types of delayed
> > > > > > operations in
> > > > > > > > > one
> > > > > > > > > > > > > > purgatory, as the tryComplete() methods of the
> > delayed
> > > > > > operations
> > > > > > > > > > can
> > > > > > > > > > > > be
> > > > > > > > > > > > > > triggered by irrelevant events.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > 1008. Configurations:
> > > > > > > > > > > > > > 1008.1 remote.log.retention.ms,
> > > > > > remote.log.retention.minutes,
> > > > > > > > > > > > > > remote.log.retention.hours: It seems that we just
> > need
> > > > the
> > > > > > ms
> > > > > > > > > one.
> > > > > > > > > > > > Also,
> > > > > > > > > > > > > > are we changing the meaning of existing config
> > > > > > log.retention.ms
> > > > > > > > > to
> > > > > > > > > > > > mean
> > > > > > > > > > > > > > the
> > > > > > > > > > > > > > local retention? For backward compatibility, it's
> > > > better
> > > > > > to not
> > > > > > > > > > > change
> > > > > > > > > > > > the
> > > > > > > > > > > > > > meaning of existing configurations.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > We agree that we only need
> remote.log.retention.ms
> > .
> > > > But,
> > > > > > the
> > > > > > > > > > > existing
> > > > > > > > > > > > > > Kafka
> > > > > > > > > > > > > > configuration
> > > > > > > > > > > > > > has 3 properties (log.retention.ms,
> > > > log.retention.minutes,
> > > > > > > > > > > > > > log.retention.hours). We just
> > > > > > > > > > > > > > want to keep consistent with the existing
> > properties.
> > > > > > > > > > > > > > Existing log.retention.xxxx config is about log
> > > > retention
> > > > > > in
> > > > > > > > > > broker’s
> > > > > > > > > > > > > > storage which is local. It should be easy for
> > users to
> > > > > > configure
> > > > > > > > > > > > partition
> > > > > > > > > > > > > > storage with local retention and remote retention
> > > > based on
> > > > > > their
> > > > > > > > > > > usage.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > 1008.2 Should remote.log.storage.enable be at the
> > topic
> > > > > > level?
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > We can introduce topic level config for the same
> > > > remote.log
> > > > > > > > > > settings.
> > > > > > > > > > > > User
> > > > > > > > > > > > > > can set the desired config while creating the
> > topic.
> > > > > > > > > > > > > > remote.log.storage.enable property is not allowed
> > to be
> > > > > > updated
> > > > > > > > > > after
> > > > > > > > > > > > the
> > > > > > > > > > > > > > topic is created. Other remote.log.* properties
> > can be
> > > > > > modified.
> > > > > > > > > We
> > > > > > > > > > > > will
> > > > > > > > > > > > > > support flipping remote.log.storage.enable in
> next
> > > > > > versions.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > 1009. It would be useful to list all limitations
> > in a
> > > > > > separate
> > > > > > > > > > > section:
> > > > > > > > > > > > > > compacted topic, JBOD, etc. Also, is changing a
> > topic
> > > > from
> > > > > > delete
> > > > > > > > > > to
> > > > > > > > > > > > > > compact and vice versa allowed when tiering is
> > enabled?
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > +1 to have limitations in a separate section. We
> > will
> > > > > > update the
> > > > > > > > > > KIP
> > > > > > > > > > > > with
> > > > > > > > > > > > > > that.
> > > > > > > > > > > > > > Topic  created with effective value for
> > > > remote.log.enabled
> > > > > > as
> > > > > > > > > true,
> > > > > > > > > > > > can not
> > > > > > > > > > > > > > change its retention policy from delete to
> compact.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > 1010. Thanks for performance numbers. Are those
> > with
> > > > > > RocksDB as
> > > > > > > > > the
> > > > > > > > > > > > cache?
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > No, We have not yet added RocksDB support. This
> is
> > > > based on
> > > > > > > > > > in-memory
> > > > > > > > > > > > map
> > > > > > > > > > > > > > representation. We will add that support and
> update
> > > > this
> > > > > > thread
> > > > > > > > > > after
> > > > > > > > > > > > > > updating the KIP with the numbers.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Thanks,
> > > > > > > > > > > > > > Satish.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > On Tue, Jul 21, 2020 at 6:49 AM Jun Rao <
> > > > [email protected]>
> > > > > > > > > wrote:
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Hi, Satish, Ying, Harsha,
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Thanks for the updated KIP. A few more comments
> > > > below.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > 1000. Regarding Colin's question on querying
> the
> > > > metadata
> > > > > > > > > > directly
> > > > > > > > > > > > in the
> > > > > > > > > > > > > > > remote block store. One issue is that not all
> > block
> > > > > > stores
> > > > > > > > > offer
> > > > > > > > > > > the
> > > > > > > > > > > > > > needed
> > > > > > > > > > > > > > > api to query the metadata. For example, S3 only
> > > > offers
> > > > > > an api
> > > > > > > > > to
> > > > > > > > > > > list
> > > > > > > > > > > > > > > objects under a prefix and this api has the
> > eventual
> > > > > > > > > consistency
> > > > > > > > > > > > > > semantic.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > 1001. The KIP described a few scenarios of
> > unclean
> > > > leader
> > > > > > > > > > > elections.
> > > > > > > > > > > > This
> > > > > > > > > > > > > > > is very useful, but I am wondering if this is
> the
> > > > best
> > > > > > > > > approach.
> > > > > > > > > > My
> > > > > > > > > > > > > > > understanding of the proposed approach is to
> > allow
> > > > the
> > > > > > new
> > > > > > > > > > > (unclean)
> > > > > > > > > > > > > > leader
> > > > > > > > > > > > > > > to take new messages immediately. While this
> > > > increases
> > > > > > > > > > > availability,
> > > > > > > > > > > > it
> > > > > > > > > > > > > > > creates the problem that there could be
> multiple
> > > > > > conflicting
> > > > > > > > > > > > segments in
> > > > > > > > > > > > > > > the remote store for the same offset range.
> This
> > > > seems
> > > > > > to make
> > > > > > > > > it
> > > > > > > > > > > > harder
> > > > > > > > > > > > > > > for RLMM to determine which archived log
> segments
> > > > > > contain the
> > > > > > > > > > > correct
> > > > > > > > > > > > > > data.
> > > > > > > > > > > > > > > For example, an archived log segment could at
> one
> > > > time
> > > > > > be the
> > > > > > > > > > > correct
> > > > > > > > > > > > > > data,
> > > > > > > > > > > > > > > but be changed to incorrect data after an
> unclean
> > > > leader
> > > > > > > > > > election.
> > > > > > > > > > > An
> > > > > > > > > > > > > > > alternative approach is to let the unclean
> leader
> > > > use the
> > > > > > > > > > archived
> > > > > > > > > > > > data
> > > > > > > > > > > > > > as
> > > > > > > > > > > > > > > the source of truth. So, when the new (unclean)
> > > > leader
> > > > > > takes
> > > > > > > > > > over,
> > > > > > > > > > > it
> > > > > > > > > > > > > > first
> > > > > > > > > > > > > > > reconciles the local data based on the archived
> > data
> > > > > > before
> > > > > > > > > > taking
> > > > > > > > > > > > new
> > > > > > > > > > > > > > > messages. This makes the job of RLMM a bit
> easier
> > > > since
> > > > > > all
> > > > > > > > > > > archived
> > > > > > > > > > > > data
> > > > > > > > > > > > > > > are considered correct. This increases
> > availability a
> > > > > > bit.
> > > > > > > > > > However,
> > > > > > > > > > > > since
> > > > > > > > > > > > > > > unclean leader elections are rare, this may be
> > ok.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > 1002. RemoteStorageManager.
> > > > > > > > > > > > > > > 1002.1 There seems to be some inconsistencies
> in
> > > > > > > > > > > > RemoteStorageManager. We
> > > > > > > > > > > > > > > pass in RemoteLogSegmentId copyLogSegment().
> For
> > all
> > > > > > other
> > > > > > > > > > methods,
> > > > > > > > > > > > we
> > > > > > > > > > > > > > pass
> > > > > > > > > > > > > > > in RemoteLogSegmentMetadata.
> > > > > > > > > > > > > > > 1002.2 Is endOffset in RemoteLogSegmentMetadata
> > > > > > inclusive or
> > > > > > > > > > > > exclusive?
> > > > > > > > > > > > > > > 1002.3 It seems that we need an api to get the
> > > > > > leaderEpoch
> > > > > > > > > > history
> > > > > > > > > > > > for a
> > > > > > > > > > > > > > > partition.
> > > > > > > > > > > > > > > 1002.4 Could you define the type of
> > > > > > RemoteLogSegmentContext?
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > 1003 RemoteLogMetadataManager
> > > > > > > > > > > > > > > 1003.1 I am not sure why we need both of the
> > > > following
> > > > > > methods
> > > > > > > > > > > > > > > in RemoteLogMetadataManager. Could we combine
> > them
> > > > into
> > > > > > one
> > > > > > > > > that
> > > > > > > > > > > > takes in
> > > > > > > > > > > > > > > offset and returns RemoteLogSegmentMetadata?
> > > > > > > > > > > > > > >     RemoteLogSegmentId
> > > > > > getRemoteLogSegmentId(TopicPartition
> > > > > > > > > > > > > > topicPartition,
> > > > > > > > > > > > > > > long offset) throws IOException;
> > > > > > > > > > > > > > >     RemoteLogSegmentMetadata
> > > > > > > > > > > > > > getRemoteLogSegmentMetadata(RemoteLogSegmentId
> > > > > > > > > > > > > > > remoteLogSegmentId) throws IOException;
> > > > > > > > > > > > > > > 1003.2 There seems to be some inconsistencies
> in
> > the
> > > > > > methods
> > > > > > > > > > below.
> > > > > > > > > > > > I am
> > > > > > > > > > > > > > > not sure why one takes RemoteLogSegmentMetadata
> > and
> > > > the
> > > > > > other
> > > > > > > > > > > > > > > takes RemoteLogSegmentId.
> > > > > > > > > > > > > > >     void
> > > > putRemoteLogSegmentData(RemoteLogSegmentMetadata
> > > > > > > > > > > > > > > remoteLogSegmentMetadata) throws IOException;
> > > > > > > > > > > > > > >     void
> > > > > > deleteRemoteLogSegmentMetadata(RemoteLogSegmentId
> > > > > > > > > > > > > > > remoteLogSegmentId) throws IOException;
> > > > > > > > > > > > > > > 1003.3 In void onServerStarted(final String
> > > > > > serverEndpoint),
> > > > > > > > > what
> > > > > > > > > > > > > > > is serverEndpoint used for?
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > 1004. It would be useful to document how all
> the
> > new
> > > > > > APIs are
> > > > > > > > > > being
> > > > > > > > > > > > used.
> > > > > > > > > > > > > > > For example, when is
> > > > > > RemoteLogSegmentMetadata.markedForDeletion
> > > > > > > > > > > > being set
> > > > > > > > > > > > > > > and used? How are
> > > > > > > > > > > > > > >
> > > > > > RemoteLogMetadataManager.earliestLogOffset/highestLogOffset
> > > > > > > > > being
> > > > > > > > > > > > used?
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > 1005. Handling partition deletion: The KIP says
> > "RLMM
> > > > > > will
> > > > > > > > > > > eventually
> > > > > > > > > > > > > > > delete these segments by using
> > RemoteStorageManager."
> > > > > > Which
> > > > > > > > > > replica
> > > > > > > > > > > > does
> > > > > > > > > > > > > > > this logic?
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > 1006. "If there are any failures in removing
> > remote
> > > > log
> > > > > > > > > segments
> > > > > > > > > > > then
> > > > > > > > > > > > > > those
> > > > > > > > > > > > > > > are stored in a specific topic (default as
> > > > > > > > > > > > > > __remote_segments_to_be_deleted)
> > > > > > > > > > > > > > > and user can consume the events(which contain
> > > > > > > > > > > remote-log-segment-id)
> > > > > > > > > > > > from
> > > > > > > > > > > > > > > that topic and clean them up from remote
> > storage.  "
> > > > Not
> > > > > > sure
> > > > > > > > > if
> > > > > > > > > > > it's
> > > > > > > > > > > > > > worth
> > > > > > > > > > > > > > > the complexity of adding another topic. Could
> we
> > just
> > > > > > retry?
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > 1007. RemoteFetchPurgatory: Could we just reuse
> > the
> > > > > > existing
> > > > > > > > > > > > > > > fetchPurgatory?
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > 1008. Configurations:
> > > > > > > > > > > > > > > 1008.1 remote.log.retention.ms,
> > > > > > remote.log.retention.minutes,
> > > > > > > > > > > > > > > remote.log.retention.hours: It seems that we
> just
> > > > need
> > > > > > the ms
> > > > > > > > > > one.
> > > > > > > > > > > > Also,
> > > > > > > > > > > > > > > are we changing the meaning of existing config
> > > > > > > > > log.retention.ms
> > > > > > > > > > to
> > > > > > > > > > > > mean
> > > > > > > > > > > > > > > the
> > > > > > > > > > > > > > > local retention? For backward compatibility,
> it's
> > > > better
> > > > > > to not
> > > > > > > > > > > > change
> > > > > > > > > > > > > > the
> > > > > > > > > > > > > > > meaning of existing configurations.
> > > > > > > > > > > > > > > 1008.2 Should remote.log.storage.enable be at
> the
> > > > topic
> > > > > > level?
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > 1009. It would be useful to list all
> limitations
> > in a
> > > > > > separate
> > > > > > > > > > > > section:
> > > > > > > > > > > > > > > compacted topic, JBOD, etc. Also, is changing a
> > topic
> > > > > > from
> > > > > > > > > delete
> > > > > > > > > > > to
> > > > > > > > > > > > > > > compact and vice versa allowed when tiering is
> > > > enabled?
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > 1010. Thanks for performance numbers. Are those
> > with
> > > > > > RocksDB as
> > > > > > > > > > the
> > > > > > > > > > > > > > cache?
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Thanks,
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Jun
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > On Wed, Jul 15, 2020 at 6:12 PM Harsha Ch <
> > > > > > [email protected]
> > > > > > > > > >
> > > > > > > > > > > > wrote:
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > Hi Colin,
> > > > > > > > > > > > > > > >                Thats not what we said in the
> > > > previous
> > > > > > email.
> > > > > > > > > > RLMM
> > > > > > > > > > > > is
> > > > > > > > > > > > > > > > pluggable storage and by running numbers even
> > 1PB
> > > > data
> > > > > > you do
> > > > > > > > > > not
> > > > > > > > > > > > need
> > > > > > > > > > > > > > > more
> > > > > > > > > > > > > > > > than 10GB local storage.
> > > > > > > > > > > > > > > > If in future this becomes a blocker for any
> > users
> > > > we
> > > > > > can
> > > > > > > > > > revisit
> > > > > > > > > > > > but
> > > > > > > > > > > > > > this
> > > > > > > > > > > > > > > > does not warrant another implementation at
> this
> > > > point
> > > > > > to push
> > > > > > > > > > the
> > > > > > > > > > > > data
> > > > > > > > > > > > > > to
> > > > > > > > > > > > > > > > remote storage.
> > > > > > > > > > > > > > > > We can ofcourse implement another RLMM that
> is
> > > > > > optional for
> > > > > > > > > > users
> > > > > > > > > > > > to
> > > > > > > > > > > > > > > > configure to push to remote. But that doesn't
> > need
> > > > to
> > > > > > be
> > > > > > > > > > > addressed
> > > > > > > > > > > > in
> > > > > > > > > > > > > > > this
> > > > > > > > > > > > > > > > KIP.
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > Thanks,
> > > > > > > > > > > > > > > > Harsha
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > On Wed, Jul 15, 2020 at 5:50 PM Colin McCabe
> <
> > > > > > > > > > [email protected]
> > > > > > > > > > > >
> > > > > > > > > > > > > > wrote:
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > Hi Ying,
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > Thanks for the response.
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > It sounds like you agree that storing the
> > > > metadata
> > > > > > in the
> > > > > > > > > > > remote
> > > > > > > > > > > > > > > storage
> > > > > > > > > > > > > > > > > would be a better design overall.  Given
> that
> > > > that's
> > > > > > true,
> > > > > > > > > is
> > > > > > > > > > > > there
> > > > > > > > > > > > > > any
> > > > > > > > > > > > > > > > > reason to include the worse implementation
> > based
> > > > on
> > > > > > > > > RocksDB?
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > Choosing a long-term metadata store is not
> > > > something
> > > > > > that
> > > > > > > > > we
> > > > > > > > > > > > should
> > > > > > > > > > > > > > do
> > > > > > > > > > > > > > > > > lightly.  It can take users years to
> migrate
> > from
> > > > > > metadata
> > > > > > > > > > > store
> > > > > > > > > > > > to
> > > > > > > > > > > > > > the
> > > > > > > > > > > > > > > > > other.  I also don't think it's realistic
> or
> > > > > > desirable for
> > > > > > > > > > > users
> > > > > > > > > > > > to
> > > > > > > > > > > > > > > write
> > > > > > > > > > > > > > > > > their own metadata stores.  Even assuming
> > that
> > > > they
> > > > > > could
> > > > > > > > > do
> > > > > > > > > > a
> > > > > > > > > > > > good
> > > > > > > > > > > > > > job
> > > > > > > > > > > > > > > > at
> > > > > > > > > > > > > > > > > this, it would create huge fragmentation in
> > the
> > > > Kafka
> > > > > > > > > > > ecosystem.
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > best,
> > > > > > > > > > > > > > > > > Colin
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > On Tue, Jul 14, 2020, at 09:39, Ying Zheng
> > wrote:
> > > > > > > > > > > > > > > > > > Hi Jun,
> > > > > > > > > > > > > > > > > > Hi Colin,
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > Satish and I are still discussing some
> > details
> > > > > > about how
> > > > > > > > > to
> > > > > > > > > > > > handle
> > > > > > > > > > > > > > > > > > transactions / producer ids. Satish is
> > going to
> > > > > > make some
> > > > > > > > > > > minor
> > > > > > > > > > > > > > > changes
> > > > > > > > > > > > > > > > > to
> > > > > > > > > > > > > > > > > > RLMM API and other parts. Other than
> that,
> > we
> > > > have
> > > > > > > > > finished
> > > > > > > > > > > > > > updating
> > > > > > > > > > > > > > > > the
> > > > > > > > > > > > > > > > > KIP
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > I agree with Colin that the current
> design
> > of
> > > > using
> > > > > > > > > rocksDB
> > > > > > > > > > > is
> > > > > > > > > > > > not
> > > > > > > > > > > > > > > > > > optimal. But this design is simple and
> > should
> > > > work
> > > > > > for
> > > > > > > > > > almost
> > > > > > > > > > > > all
> > > > > > > > > > > > > > the
> > > > > > > > > > > > > > > > > > existing Kafka users. RLMM is a plugin.
> > Users
> > > > can
> > > > > > replace
> > > > > > > > > > > > rocksDB
> > > > > > > > > > > > > > > with
> > > > > > > > > > > > > > > > > > their own RLMM implementation, if needed.
> > So, I
> > > > > > think we
> > > > > > > > > > can
> > > > > > > > > > > > keep
> > > > > > > > > > > > > > > > rocksDB
> > > > > > > > > > > > > > > > > > for now. What do you think?
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > Thanks,
> > > > > > > > > > > > > > > > > > Ying
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > On Tue, Jul 7, 2020 at 10:35 AM Jun Rao <
> > > > > > > > > [email protected]>
> > > > > > > > > > > > wrote:
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > Hi, Ying,
> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > Thanks for the update. It's good to see
> > the
> > > > > > progress on
> > > > > > > > > > > this.
> > > > > > > > > > > > > > > Please
> > > > > > > > > > > > > > > > > let us
> > > > > > > > > > > > > > > > > > > know when you are done updating the KIP
> > wiki.
> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > Jun
> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > On Tue, Jul 7, 2020 at 10:13 AM Ying
> > Zheng
> > > > > > > > > > > > > > <[email protected]
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > wrote:
> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > Hi Jun,
> > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > Satish and I have added more design
> > > > details in
> > > > > > the
> > > > > > > > > KIP,
> > > > > > > > > > > > > > including
> > > > > > > > > > > > > > > > > how to
> > > > > > > > > > > > > > > > > > > > keep consistency between replicas
> > > > (especially
> > > > > > when
> > > > > > > > > > there
> > > > > > > > > > > is
> > > > > > > > > > > > > > > > > leadership
> > > > > > > > > > > > > > > > > > > > changes / log truncations) and new
> > > > metrics. We
> > > > > > also
> > > > > > > > > > made
> > > > > > > > > > > > some
> > > > > > > > > > > > > > > other
> > > > > > > > > > > > > > > > > minor
> > > > > > > > > > > > > > > > > > > > changes in the doc. We will finish
> the
> > KIP
> > > > > > changes in
> > > > > > > > > > the
> > > > > > > > > > > > next
> > > > > > > > > > > > > > > > > couple of
> > > > > > > > > > > > > > > > > > > > days. We will let you know when we
> are
> > > > done.
> > > > > > Most of
> > > > > > > > > > the
> > > > > > > > > > > > > > changes
> > > > > > > > > > > > > > > > are
> > > > > > > > > > > > > > > > > > > > already updated to the wiki KIP. You
> > can
> > > > take
> > > > > > a look.
> > > > > > > > > > But
> > > > > > > > > > > > it's
> > > > > > > > > > > > > > > not
> > > > > > > > > > > > > > > > > the
> > > > > > > > > > > > > > > > > > > > final version yet.
> > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > As for the implementation, the code
> is
> > > > mostly
> > > > > > done
> > > > > > > > > and
> > > > > > > > > > we
> > > > > > > > > > > > > > already
> > > > > > > > > > > > > > > > had
> > > > > > > > > > > > > > > > > > > some
> > > > > > > > > > > > > > > > > > > > feature tests / system tests. I have
> > added
> > > > the
> > > > > > > > > > > performance
> > > > > > > > > > > > test
> > > > > > > > > > > > > > > > > results
> > > > > > > > > > > > > > > > > > > in
> > > > > > > > > > > > > > > > > > > > the KIP. However the recent design
> > changes
> > > > > > (e.g.
> > > > > > > > > leader
> > > > > > > > > > > > epoch
> > > > > > > > > > > > > > > info
> > > > > > > > > > > > > > > > > > > > management / log truncation / some of
> > the
> > > > new
> > > > > > > > > metrics)
> > > > > > > > > > > > have not
> > > > > > > > > > > > > > > > been
> > > > > > > > > > > > > > > > > > > > implemented yet. It will take about 2
> > weeks
> > > > > > for us to
> > > > > > > > > > > > implement
> > > > > > > > > > > > > > > > > after you
> > > > > > > > > > > > > > > > > > > > review and agree with those design
> > changes.
> > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > On Tue, Jul 7, 2020 at 9:23 AM Jun
> Rao
> > <
> > > > > > > > > > [email protected]
> > > > > > > > > > > >
> > > > > > > > > > > > > > wrote:
> > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > Hi, Satish, Harsha,
> > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > Any new updates on the KIP? This
> > feature
> > > > is
> > > > > > one of
> > > > > > > > > > the
> > > > > > > > > > > > most
> > > > > > > > > > > > > > > > > important
> > > > > > > > > > > > > > > > > > > and
> > > > > > > > > > > > > > > > > > > > > most requested features in Apache
> > Kafka
> > > > > > right now.
> > > > > > > > > It
> > > > > > > > > > > > would
> > > > > > > > > > > > > > be
> > > > > > > > > > > > > > > > > helpful
> > > > > > > > > > > > > > > > > > > if
> > > > > > > > > > > > > > > > > > > > > we can make sustained progress on
> > this.
> > > > > > Could you
> > > > > > > > > > share
> > > > > > > > > > > > how
> > > > > > > > > > > > > > far
> > > > > > > > > > > > > > > > > along
> > > > > > > > > > > > > > > > > > > is
> > > > > > > > > > > > > > > > > > > > > the design/implementation right
> now?
> > Is
> > > > there
> > > > > > > > > > anything
> > > > > > > > > > > > that
> > > > > > > > > > > > > > > other
> > > > > > > > > > > > > > > > > > > people
> > > > > > > > > > > > > > > > > > > > > can help to get it across the line?
> > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > As for "transactional support" and
> > > > "follower
> > > > > > > > > > > > > > > > > requests/replication", no
> > > > > > > > > > > > > > > > > > > > > further comments from me as long as
> > the
> > > > > > producer
> > > > > > > > > > state
> > > > > > > > > > > > and
> > > > > > > > > > > > > > > leader
> > > > > > > > > > > > > > > > > epoch
> > > > > > > > > > > > > > > > > > > > can
> > > > > > > > > > > > > > > > > > > > > be restored properly from the
> object
> > > > store
> > > > > > when
> > > > > > > > > > needed.
> > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > Thanks,
> > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > Jun
> > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > On Tue, Jun 9, 2020 at 3:39 AM
> Satish
> > > > > > Duggana <
> > > > > > > > > > > > > > > > > > > [email protected]>
> > > > > > > > > > > > > > > > > > > > > wrote:
> > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > We did not want to add many
> > > > implementation
> > > > > > > > > details
> > > > > > > > > > in
> > > > > > > > > > > > the
> > > > > > > > > > > > > > > KIP.
> > > > > > > > > > > > > > > > > But we
> > > > > > > > > > > > > > > > > > > > > > decided to add them in the KIP as
> > > > appendix
> > > > > > or
> > > > > > > > > > > > > > > > > sub-sections(including
> > > > > > > > > > > > > > > > > > > > > > follower fetch protocol) to
> > describe
> > > > the
> > > > > > flow
> > > > > > > > > with
> > > > > > > > > > > the
> > > > > > > > > > > > main
> > > > > > > > > > > > > > > > > cases.
> > > > > > > > > > > > > > > > > > > > > > That will answer most of the
> > queries. I
> > > > > > will
> > > > > > > > > update
> > > > > > > > > > > on
> > > > > > > > > > > > this
> > > > > > > > > > > > > > > > mail
> > > > > > > > > > > > > > > > > > > > > > thread when the respective
> > sections are
> > > > > > updated.
> > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > Thanks,
> > > > > > > > > > > > > > > > > > > > > > Satish.
> > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > On Sat, Jun 6, 2020 at 7:49 PM
> > > > Alexandre
> > > > > > Dupriez
> > > > > > > > > > > > > > > > > > > > > > <[email protected]>
> > wrote:
> > > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > > Hi Satish,
> > > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > > A couple of questions specific
> > to the
> > > > > > section
> > > > > > > > > > > > "Follower
> > > > > > > > > > > > > > > > > > > > > > > Requests/Replication", pages
> > 16:17
> > > > in the
> > > > > > > > > design
> > > > > > > > > > > > document
> > > > > > > > > > > > > > > > [1].
> > > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > > 900. It is mentioned that
> > followers
> > > > fetch
> > > > > > > > > > auxiliary
> > > > > > > > > > > > > > states
> > > > > > > > > > > > > > > > > from the
> > > > > > > > > > > > > > > > > > > > > > > remote storage.
> > > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > > 900.a Does the consistency
> model
> > of
> > > > the
> > > > > > > > > external
> > > > > > > > > > > > storage
> > > > > > > > > > > > > > > > > impacts
> > > > > > > > > > > > > > > > > > > > reads
> > > > > > > > > > > > > > > > > > > > > > > of leader epochs and other
> > auxiliary
> > > > > > data?
> > > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > > 900.b What are the benefits of
> > using
> > > > a
> > > > > > > > > mechanism
> > > > > > > > > > to
> > > > > > > > > > > > store
> > > > > > > > > > > > > > > and
> > > > > > > > > > > > > > > > > > > access
> > > > > > > > > > > > > > > > > > > > > > > the leader epochs which is
> > different
> > > > > > from other
> > > > > > > > > > > > metadata
> > > > > > > > > > > > > > > > > associated
> > > > > > > > > > > > > > > > > > > > to
> > > > > > > > > > > > > > > > > > > > > > > tiered segments? What are the
> > > > benefits of
> > > > > > > > > > > retrieving
> > > > > > > > > > > > this
> > > > > > > > > > > > > > > > > > > information
> > > > > > > > > > > > > > > > > > > > > > > on-demand from the follower
> > rather
> > > > than
> > > > > > relying
> > > > > > > > > > on
> > > > > > > > > > > > > > > > propagation
> > > > > > > > > > > > > > > > > via
> > > > > > > > > > > > > > > > > > > > the
> > > > > > > > > > > > > > > > > > > > > > > topic __remote_log_metadata?
> What
> > > > are the
> > > > > > > > > > > advantages
> > > > > > > > > > > > over
> > > > > > > > > > > > > > > > > using a
> > > > > > > > > > > > > > > > > > > > > > > dedicated control structure
> > (e.g. a
> > > > new
> > > > > > record
> > > > > > > > > > > type)
> > > > > > > > > > > > > > > > > propagated via
> > > > > > > > > > > > > > > > > > > > > > > this topic? Since in the
> > document,
> > > > > > different
> > > > > > > > > > > control
> > > > > > > > > > > > > > paths
> > > > > > > > > > > > > > > > are
> > > > > > > > > > > > > > > > > > > > > > > operating in the system, how
> are
> > the
> > > > > > metadata
> > > > > > > > > > > stored
> > > > > > > > > > > > in
> > > > > > > > > > > > > > > > > > > > > > > __remote_log_metadata [which
> also
> > > > > > include the
> > > > > > > > > > epoch
> > > > > > > > > > > > of
> > > > > > > > > > > > > > the
> > > > > > > > > > > > > > > > > leader
> > > > > > > > > > > > > > > > > > > > > > > which offloaded a segment] and
> > the
> > > > remote
> > > > > > > > > > auxiliary
> > > > > > > > > > > > > > states,
> > > > > > > > > > > > > > > > > kept in
> > > > > > > > > > > > > > > > > > > > > > > sync?
> > > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > > 900.c A follower can encounter
> an
> > > > > > > > > > > > > > > > > OFFSET_MOVED_TO_TIERED_STORAGE.
> > > > > > > > > > > > > > > > > > > Is
> > > > > > > > > > > > > > > > > > > > > > > this in response to a Fetch or
> > > > > > > > > > OffsetForLeaderEpoch
> > > > > > > > > > > > > > > request?
> > > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > > 900.d What happens if, after a
> > > > follower
> > > > > > > > > > encountered
> > > > > > > > > > > > an
> > > > > > > > > > > > > > > > > > > > > > > OFFSET_MOVED_TO_TIERED_STORAGE
> > > > response,
> > > > > > its
> > > > > > > > > > > > attempts to
> > > > > > > > > > > > > > > > > retrieve
> > > > > > > > > > > > > > > > > > > > > > > leader epochs fail (for
> instance,
> > > > > > because the
> > > > > > > > > > > remote
> > > > > > > > > > > > > > > storage
> > > > > > > > > > > > > > > > is
> > > > > > > > > > > > > > > > > > > > > > > temporarily unavailable)? Does
> > the
> > > > > > follower
> > > > > > > > > > > > fallbacks to
> > > > > > > > > > > > > > a
> > > > > > > > > > > > > > > > mode
> > > > > > > > > > > > > > > > > > > where
> > > > > > > > > > > > > > > > > > > > > > > it ignores tiered segments, and
> > > > applies
> > > > > > > > > > truncation
> > > > > > > > > > > > using
> > > > > > > > > > > > > > > only
> > > > > > > > > > > > > > > > > > > locally
> > > > > > > > > > > > > > > > > > > > > > > available information? What
> > happens
> > > > when
> > > > > > access
> > > > > > > > > > to
> > > > > > > > > > > > the
> > > > > > > > > > > > > > > remote
> > > > > > > > > > > > > > > > > > > storage
> > > > > > > > > > > > > > > > > > > > > > > is restored? How is the replica
> > > > lineage
> > > > > > > > > inferred
> > > > > > > > > > by
> > > > > > > > > > > > the
> > > > > > > > > > > > > > > > remote
> > > > > > > > > > > > > > > > > > > leader
> > > > > > > > > > > > > > > > > > > > > > > epochs reconciled with the
> > follower's
> > > > > > replica
> > > > > > > > > > > > lineage,
> > > > > > > > > > > > > > > which
> > > > > > > > > > > > > > > > > has
> > > > > > > > > > > > > > > > > > > > > > > evolved? Does the follower
> > remember
> > > > > > fetching
> > > > > > > > > > > > auxiliary
> > > > > > > > > > > > > > > states
> > > > > > > > > > > > > > > > > > > failed
> > > > > > > > > > > > > > > > > > > > > > > in the past and attempt
> > > > reconciliation?
> > > > > > Is
> > > > > > > > > there
> > > > > > > > > > a
> > > > > > > > > > > > plan
> > > > > > > > > > > > > > to
> > > > > > > > > > > > > > > > > offer
> > > > > > > > > > > > > > > > > > > > > > > different strategies in this
> > > > scenario,
> > > > > > > > > > configurable
> > > > > > > > > > > > via
> > > > > > > > > > > > > > > > > > > > configuration?
> > > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > > 900.e Is the leader epoch cache
> > > > > > offloaded with
> > > > > > > > > > > every
> > > > > > > > > > > > > > > segment?
> > > > > > > > > > > > > > > > > Or
> > > > > > > > > > > > > > > > > > > when
> > > > > > > > > > > > > > > > > > > > > > > a new checkpoint is detected?
> If
> > that
> > > > > > > > > information
> > > > > > > > > > > is
> > > > > > > > > > > > not
> > > > > > > > > > > > > > > > always
> > > > > > > > > > > > > > > > > > > > > > > offloaded to avoid duplicating
> > data,
> > > > how
> > > > > > does
> > > > > > > > > the
> > > > > > > > > > > > remote
> > > > > > > > > > > > > > > > > storage
> > > > > > > > > > > > > > > > > > > > > > > satisfy the request to retrieve
> > it?
> > > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > > 900.f Since the leader epoch
> > cache
> > > > > > covers the
> > > > > > > > > > > entire
> > > > > > > > > > > > > > > replica
> > > > > > > > > > > > > > > > > > > lineage,
> > > > > > > > > > > > > > > > > > > > > > > what happens if, after a leader
> > epoch
> > > > > > cache
> > > > > > > > > file
> > > > > > > > > > is
> > > > > > > > > > > > > > > offloaded
> > > > > > > > > > > > > > > > > with
> > > > > > > > > > > > > > > > > > > a
> > > > > > > > > > > > > > > > > > > > > > > given segment, the local epoch
> > cache
> > > > is
> > > > > > > > > truncated
> > > > > > > > > > > > [not
> > > > > > > > > > > > > > > > > necessarily
> > > > > > > > > > > > > > > > > > > > for
> > > > > > > > > > > > > > > > > > > > > > > a range of offset included in
> > tiered
> > > > > > segments]?
> > > > > > > > > > How
> > > > > > > > > > > > are
> > > > > > > > > > > > > > > > remote
> > > > > > > > > > > > > > > > > and
> > > > > > > > > > > > > > > > > > > > > > > local leader epoch caches kept
> > > > > > consistent?
> > > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > > 900.g Consumer can also use
> > leader
> > > > > > epochs (e.g.
> > > > > > > > > > to
> > > > > > > > > > > > enable
> > > > > > > > > > > > > > > > > fencing
> > > > > > > > > > > > > > > > > > > to
> > > > > > > > > > > > > > > > > > > > > > > protect against stale leaders).
> > What
> > > > > > > > > differences
> > > > > > > > > > > > would
> > > > > > > > > > > > > > > there
> > > > > > > > > > > > > > > > be
> > > > > > > > > > > > > > > > > > > > > > > between consumer and follower
> > > > fetches?
> > > > > > > > > > Especially,
> > > > > > > > > > > > would
> > > > > > > > > > > > > > > > > consumers
> > > > > > > > > > > > > > > > > > > > > > > also fetch leader epoch
> > information
> > > > from
> > > > > > the
> > > > > > > > > > remote
> > > > > > > > > > > > > > > storage?
> > > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > > 900.h Assume a newly elected
> > leader
> > > > of a
> > > > > > > > > > > > topic-partition
> > > > > > > > > > > > > > > > > detects
> > > > > > > > > > > > > > > > > > > more
> > > > > > > > > > > > > > > > > > > > > > > recent segments are available
> in
> > the
> > > > > > external
> > > > > > > > > > > > storage,
> > > > > > > > > > > > > > with
> > > > > > > > > > > > > > > > > epochs
> > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > > its local epoch. Does it ignore
> > these
> > > > > > segments
> > > > > > > > > > and
> > > > > > > > > > > > their
> > > > > > > > > > > > > > > > > associated
> > > > > > > > > > > > > > > > > > > > > > > epoch-to-offset vectors? Or try
> > to
> > > > > > reconstruct
> > > > > > > > > > its
> > > > > > > > > > > > local
> > > > > > > > > > > > > > > > > replica
> > > > > > > > > > > > > > > > > > > > > > > lineage based on the data
> > remotely
> > > > > > available?
> > > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > > Thanks,
> > > > > > > > > > > > > > > > > > > > > > > Alexandre
> > > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > > [1]
> > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > >
> > > >
> >
> https://docs.google.com/document/d/18tnobSas3mKFZFr8oRguZoj_tkD_sGzivuLRlMloEMs/edit?usp=sharing
> > > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > > Le jeu. 4 juin 2020 à 19:55,
> > Satish
> > > > > > Duggana <
> > > > > > > > > > > > > > > > > > > > [email protected]>
> > > > > > > > > > > > > > > > > > > > > > a écrit :
> > > > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > > > Hi Jun,
> > > > > > > > > > > > > > > > > > > > > > > > Please let us know if you
> have
> > any
> > > > > > comments
> > > > > > > > > on
> > > > > > > > > > > > > > > > "transactional
> > > > > > > > > > > > > > > > > > > > > support"
> > > > > > > > > > > > > > > > > > > > > > > > and "follower
> > requests/replication"
> > > > > > mentioned
> > > > > > > > > > in
> > > > > > > > > > > > the
> > > > > > > > > > > > > > > wiki.
> > > > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > > > Thanks,
> > > > > > > > > > > > > > > > > > > > > > > > Satish.
> > > > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > > > On Tue, Jun 2, 2020 at 9:25
> PM
> > > > Satish
> > > > > > > > > Duggana <
> > > > > > > > > > > > > > > > > > > > > > [email protected]> wrote:
> > > > > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > > > > Thanks Jun for your
> comments.
> > > > > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > > > > >100. It would be useful to
> > > > provide
> > > > > > more
> > > > > > > > > > > details
> > > > > > > > > > > > on
> > > > > > > > > > > > > > how
> > > > > > > > > > > > > > > > > those
> > > > > > > > > > > > > > > > > > > > apis
> > > > > > > > > > > > > > > > > > > > > > are used. Otherwise, it's kind of
> > hard
> > > > to
> > > > > > really
> > > > > > > > > > > assess
> > > > > > > > > > > > > > > whether
> > > > > > > > > > > > > > > > > the
> > > > > > > > > > > > > > > > > > > new
> > > > > > > > > > > > > > > > > > > > > > apis are sufficient/redundant. A
> > few
> > > > > > examples
> > > > > > > > > > below.
> > > > > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > > > > We will update the wiki and
> > let
> > > > you
> > > > > > know.
> > > > > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > > > > >100.1 deleteRecords seems
> to
> > > > only
> > > > > > advance
> > > > > > > > > > the
> > > > > > > > > > > > > > > > > logStartOffset
> > > > > > > > > > > > > > > > > > > in
> > > > > > > > > > > > > > > > > > > > > > Log. How does that trigger the
> > > > deletion of
> > > > > > remote
> > > > > > > > > > log
> > > > > > > > > > > > > > > segments?
> > > > > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > > > > RLMTask for leader
> partition
> > > > > > periodically
> > > > > > > > > > > checks
> > > > > > > > > > > > > > > whether
> > > > > > > > > > > > > > > > > there
> > > > > > > > > > > > > > > > > > > > are
> > > > > > > > > > > > > > > > > > > > > > > > > remote log segments earlier
> > to
> > > > > > > > > logStartOffset
> > > > > > > > > > > > and the
> > > > > > > > > > > > > > > > > > > respective
> > > > > > > > > > > > > > > > > > > > > > > > > remote log segment metadata
> > and
> > > > data
> > > > > > are
> > > > > > > > > > > deleted
> > > > > > > > > > > > by
> > > > > > > > > > > > > > > using
> > > > > > > > > > > > > > > > > RLMM
> > > > > > > > > > > > > > > > > > > > and
> > > > > > > > > > > > > > > > > > > > > > > > > RSM.
> > > > > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > > > > >100.2 stopReplica with
> > deletion
> > > > is
> > > > > > used
> > > > > > > > > in 2
> > > > > > > > > > > > cases
> > > > > > > > > > > > > > (a)
> > > > > > > > > > > > > > > > > replica
> > > > > > > > > > > > > > > > > > > > > > reassignment; (b) topic deletion.
> > We
> > > > only
> > > > > > want to
> > > > > > > > > > > > delete
> > > > > > > > > > > > > > the
> > > > > > > > > > > > > > > > > tiered
> > > > > > > > > > > > > > > > > > > > > > metadata in the second case.
> Also,
> > in
> > > > the
> > > > > > second
> > > > > > > > > > > case,
> > > > > > > > > > > > who
> > > > > > > > > >
> > > >
> >
> >
>

Re: [DISCUSS] KIP-405: Kafka Tiered Storage

Reply via email to