Re: [DISCUSS] KIP-405: Kafka Tiered Storage

Satish Duggana Thu, 28 May 2020 05:39:09 -0700

Hi Jun,
Gentle reminder. Please go through the updated wiki and let us know your
comments.


Thanks,
Satish.

On Tue, May 19, 2020 at 3:50 PM Satish Duggana <satish.dugg...@gmail.com>
wrote:

> Hi Jun,
> Please go through the wiki which has the latest updates. Google doc is
> updated frequently to be in sync with wiki.
>
> Thanks,
> Satish.
>
> On Tue, May 19, 2020 at 12:30 AM Jun Rao <j...@confluent.io> wrote:
>
>> Hi, Satish,
>>
>> Thanks for the update. Just to clarify. Which doc has the latest updates,
>> the wiki or the google doc?
>>
>> Jun
>>
>> On Thu, May 14, 2020 at 10:38 AM Satish Duggana <satish.dugg...@gmail.com>
>> wrote:
>>
>>> Hi Jun,
>>> Thanks for your comments.  We updated the KIP with more details.
>>>
>>> >100. For each of the operations related to tiering, it would be useful
>>> to provide a description on how it works with the new API. These include
>>> things like consumer fetch, replica fetch, offsetForTimestamp, retention
>>> (remote and local) by size, time and logStartOffset, topic deletion, etc.
>>> This will tell us if the proposed APIs are sufficient.
>>>
>>> We addressed most of these APIs in the KIP. We can add more details if
>>> needed.
>>>
>>> >101. For the default implementation based on internal topic, is it
>>> meant as a proof of concept or for production usage? I assume that it's the
>>> former. However, if it's the latter, then the KIP needs to describe the
>>> design in more detail.
>>>
>>> It is production usage as was mentioned in an earlier mail. We plan to
>>> update this section in the next few days.
>>>
>>> >102. When tiering a segment, the segment is first written to the object
>>> store and then its metadata is written to RLMM using the api "void 
>>> putRemoteLogSegmentData()".
>>> One potential issue with this approach is that if the system fails after
>>> the first operation, it leaves a garbage in the object store that's never
>>> reclaimed. One way to improve this is to have two separate APIs, sth like
>>> preparePutRemoteLogSegmentData() and commitPutRemoteLogSegmentData().
>>>
>>> That is a good point. We currently have a different way using markers in
>>> the segment but your suggestion is much better.
>>>
>>> >103. It seems that the transactional support and the ability to read
>>> from follower are missing.
>>>
>>> KIP is updated with transactional support, follower fetch semantics, and
>>> reading from a follower.
>>>
>>> >104. It would be useful to provide a testing plan for this KIP.
>>>
>>> We added a few tests by introducing test util for tiered storage in the
>>> PR. We will provide the testing plan in the next few days.
>>>
>>> Thanks,
>>> Satish.
>>>
>>>
>>> On Wed, Feb 26, 2020 at 9:43 PM Harsha Chintalapani <ka...@harsha.io>
>>> wrote:
>>>
>>>>
>>>>
>>>>
>>>>
>>>> On Tue, Feb 25, 2020 at 12:46 PM, Jun Rao <j...@confluent.io> wrote:
>>>>
>>>>> Hi, Satish,
>>>>>
>>>>> Thanks for the updated doc. The new API seems to be an improvement
>>>>> overall. A few more comments below.
>>>>>
>>>>> 100. For each of the operations related to tiering, it would be useful
>>>>> to provide a description on how it works with the new API. These include
>>>>> things like consumer fetch, replica fetch, offsetForTimestamp, retention
>>>>> (remote and local) by size, time and logStartOffset, topic deletion,
>>>>> etc. This will tell us if the proposed APIs are sufficient.
>>>>>
>>>>
>>>> Thanks for the feedback Jun. We will add more details around this.
>>>>
>>>>
>>>>> 101. For the default implementation based on internal topic, is it
>>>>> meant as a proof of concept or for production usage? I assume that it's 
>>>>> the
>>>>> former. However, if it's the latter, then the KIP needs to describe the
>>>>> design in more detail.
>>>>>
>>>>
>>>> Yes it meant to be for production use.  Ideally it would be good to
>>>> merge this in as the default implementation for metadata service. We can
>>>> add more details around design and testing.
>>>>
>>>> 102. When tiering a segment, the segment is first written to the object
>>>>> store and then its metadata is written to RLMM using the api "void
>>>>> putRemoteLogSegmentData()".
>>>>> One potential issue with this approach is that if the system fails
>>>>> after the first operation, it leaves a garbage in the object store that's
>>>>> never reclaimed. One way to improve this is to have two separate APIs, sth
>>>>> like preparePutRemoteLogSegmentData() and commitPutRemoteLogSegmentData().
>>>>>
>>>>> 103. It seems that the transactional support and the ability to read
>>>>> from follower are missing.
>>>>>
>>>>> 104. It would be useful to provide a testing plan for this KIP.
>>>>>
>>>>
>>>> We are working on adding more details around transactional support and
>>>> coming up with test plan.
>>>> Add system tests and integration tests.
>>>>
>>>> Thanks,
>>>>>
>>>>> Jun
>>>>>
>>>>> On Mon, Feb 24, 2020 at 8:10 AM Satish Duggana <
>>>>> satish.dugg...@gmail.com> wrote:
>>>>>
>>>>> Hi Jun,
>>>>> Please look at the earlier reply and let us know your comments.
>>>>>
>>>>> Thanks,
>>>>> Satish.
>>>>>
>>>>> On Wed, Feb 12, 2020 at 4:06 PM Satish Duggana <
>>>>> satish.dugg...@gmail.com> wrote:
>>>>>
>>>>> Hi Jun,
>>>>> Thanks for your comments on the separation of remote log metadata
>>>>> storage and remote log storage.
>>>>> We had a few discussions since early Jan on how to support eventually
>>>>> consistent stores like S3 by uncoupling remote log segment metadata and
>>>>> remote log storage. It is written with details in the doc here(1). Below 
>>>>> is
>>>>> the brief summary of the discussion from that doc.
>>>>>
>>>>> The current approach consists of pulling the remote log segment
>>>>> metadata from remote log storage APIs. It worked fine for storages like
>>>>> HDFS. But one of the problems of relying on the remote storage to maintain
>>>>> metadata is that tiered-storage needs to be strongly consistent, with an
>>>>> impact not only on the metadata(e.g. LIST in S3) but also on the segment
>>>>> data(e.g. GET after a DELETE in S3). The cost of maintaining metadata in
>>>>> remote storage needs to be factored in. This is true in the case of S3,
>>>>> LIST APIs incur huge costs as you raised earlier.
>>>>> So, it is good to separate the remote storage from the remote log
>>>>> metadata store. We refactored the existing RemoteStorageManager and
>>>>> introduced RemoteLogMetadataManager. Remote log metadata store should give
>>>>> strong consistency semantics but remote log storage can be eventually
>>>>> consistent.
>>>>> We can have a default implementation for RemoteLogMetadataManager
>>>>> which uses an internal topic(as mentioned in one of our earlier emails) as
>>>>> storage. But users can always plugin their own RemoteLogMetadataManager
>>>>> implementation based on their environment.
>>>>>
>>>>> Please go through the updated KIP and let us know your comments. We
>>>>> have started refactoring for the changes mentioned in the KIP and there 
>>>>> may
>>>>> be a few more updates to the APIs.
>>>>>
>>>>> [1]
>>>>>
>>>>>
>>>>> https://docs.google.com/document/d/1qfkBCWL1e7ZWkHU7brxKDBebq4ie9yK20XJnKbgAlew/edit?ts=5e208ec7#
>>>>>
>>>>> On Fri, Dec 27, 2019 at 5:43 PM Ivan Yurchenko <
>>>>> ivan0yurche...@gmail.com>
>>>>>
>>>>> wrote:
>>>>>
>>>>> Hi all,
>>>>>
>>>>> Jun:
>>>>>
>>>>> (a) Cost: S3 list object requests cost $0.005 per 1000 requests. If
>>>>>
>>>>> you
>>>>>
>>>>> have 100,000 partitions and want to pull the metadata for each
>>>>>
>>>>> partition
>>>>>
>>>>> at
>>>>>
>>>>> the rate of 1/sec. It can cost $0.5/sec, which is roughly $40K per
>>>>>
>>>>> day.
>>>>>
>>>>> I want to note here, that no reasonably durable storage will be cheap
>>>>> at 100k RPS. For example, DynamoDB might give the same ballpark
>>>>>
>>>>> figures.
>>>>>
>>>>> If we want to keep the pull-based approach, we can try to reduce this
>>>>>
>>>>> number
>>>>>
>>>>> in several ways: doing listings less frequently (as Satish mentioned,
>>>>> with the current defaults it's ~3.33k RPS for your example), batching
>>>>> listing operations in some way (depending on the storage; it might require
>>>>> the change of RSM's interface).
>>>>>
>>>>> There are different ways for doing push based metadata propagation.
>>>>>
>>>>> Some
>>>>>
>>>>> object stores may support that already. For example, S3 supports
>>>>>
>>>>> events
>>>>>
>>>>> notification
>>>>>
>>>>> This sounds interesting. However, I see a couple of issues using it:
>>>>> 1. As I understand the documentation, notification delivery is not
>>>>> guaranteed
>>>>> and it's recommended to periodically do LIST to fill the gaps. Which
>>>>> brings us back to the same LIST consistency guarantees issue.
>>>>> 2. The same goes for the broker start: to get the current state, we
>>>>>
>>>>> need
>>>>>
>>>>> to LIST.
>>>>> 3. The dynamic set of multiple consumers (RSMs): AFAIK SQS and SNS
>>>>>
>>>>> aren't
>>>>>
>>>>> designed for such a case.
>>>>>
>>>>> Alexandre:
>>>>>
>>>>> A.1 As commented on PR 7561, S3 consistency model [1][2] implies RSM
>>>>>
>>>>> cannot
>>>>>
>>>>> relies solely on S3 APIs to guarantee the expected strong
>>>>>
>>>>> consistency. The
>>>>>
>>>>> proposed implementation [3] would need to be updated to take this
>>>>>
>>>>> into
>>>>>
>>>>> account. Let’s talk more about this.
>>>>>
>>>>> Thank you for the feedback. I clearly see the need for changing the S3
>>>>> implementation
>>>>> to provide stronger consistency guarantees. As it see from this
>>>>> thread, there are
>>>>> several possible approaches to this. Let's discuss RemoteLogManager's
>>>>> contract and
>>>>> behavior (like pull vs push model) further before picking one (or
>>>>>
>>>>> several -
>>>>>
>>>>> ?) of them.
>>>>> I'm going to do some evaluation of DynamoDB for the pull-based
>>>>>
>>>>> approach,
>>>>>
>>>>> if it's possible to apply it paying a reasonable bill. Also, of the
>>>>> push-based approach
>>>>> with a Kafka topic as the medium.
>>>>>
>>>>> A.2.3 Atomicity – what does an implementation of RSM need to provide
>>>>>
>>>>> with
>>>>>
>>>>> respect to atomicity of the APIs copyLogSegment, cleanupLogUntil and
>>>>> deleteTopicPartition? If a partial failure happens in any of those
>>>>>
>>>>> (e.g.
>>>>>
>>>>> in
>>>>>
>>>>> the S3 implementation, if one of the multiple uploads fails [4]),
>>>>>
>>>>> The S3 implementation is going to change, but it's worth clarifying
>>>>>
>>>>> anyway.
>>>>>
>>>>> The segment log file is being uploaded after S3 has acked uploading of
>>>>> all other files associated with the segment and only after this the
>>>>>
>>>>> whole
>>>>>
>>>>> segment file set becomes visible remotely for operations like
>>>>> listRemoteSegments [1].
>>>>> In case of upload failure, the files that has been successfully
>>>>>
>>>>> uploaded
>>>>>
>>>>> stays
>>>>> as invisible garbage that is collected by cleanupLogUntil (or
>>>>>
>>>>> overwritten
>>>>>
>>>>> successfully later).
>>>>> And the opposite happens during the deletion: log files are deleted
>>>>>
>>>>> first.
>>>>>
>>>>> This approach should generally work when we solve consistency issues
>>>>> by adding a strongly consistent storage: a segment's uploaded files
>>>>>
>>>>> remain
>>>>>
>>>>> invisible garbage until some metadata about them is written.
>>>>>
>>>>> A.3 Caching – storing locally the segments retrieved from the remote
>>>>> storage is excluded as it does not align with the original intent
>>>>>
>>>>> and even
>>>>>
>>>>> defeat some of its purposes (save disk space etc.). That said, could
>>>>>
>>>>> there
>>>>>
>>>>> be other types of use cases where the pattern of access to the
>>>>>
>>>>> remotely
>>>>>
>>>>> stored segments would benefit from local caching (and potentially
>>>>> read-ahead)? Consider the use case of a large pool of consumers which
>>>>>
>>>>> start
>>>>>
>>>>> a backfill at the same time for one day worth of data from one year
>>>>>
>>>>> ago
>>>>>
>>>>> stored remotely. Caching the segments locally would allow to
>>>>>
>>>>> uncouple the
>>>>>
>>>>> load on the remote storage from the load on the Kafka cluster. Maybe
>>>>>
>>>>> the
>>>>>
>>>>> RLM could expose a configuration parameter to switch that feature
>>>>>
>>>>> on/off?
>>>>>
>>>>> I tend to agree here, caching remote segments locally and making this
>>>>> configurable sounds pretty practical to me. We should implement
>>>>>
>>>>> this,
>>>>>
>>>>> maybe not in the first iteration.
>>>>>
>>>>> Br,
>>>>> Ivan
>>>>>
>>>>> [1]
>>>>>
>>>>>
>>>>> https://github.com/harshach/kafka/pull/18/files#diff-4d73d01c16caed6f2548fc3063550ef0R152
>>>>>
>>>>> On Thu, 19 Dec 2019 at 19:49, Alexandre Dupriez <
>>>>>
>>>>> alexandre.dupr...@gmail.com>
>>>>>
>>>>> wrote:
>>>>>
>>>>> Hi Jun,
>>>>>
>>>>> Thank you for the feedback. I am trying to understand how a
>>>>>
>>>>> push-based
>>>>>
>>>>> approach would work.
>>>>> In order for the metadata to be propagated (under the assumption you
>>>>> stated), would you plan to add a new API in Kafka to allow the metadata
>>>>> store to send them directly to the brokers?
>>>>>
>>>>> Thanks,
>>>>> Alexandre
>>>>>
>>>>> Le mer. 18 déc. 2019 à 20:14, Jun Rao <j...@confluent.io> a écrit :
>>>>>
>>>>> Hi, Satish and Ivan,
>>>>>
>>>>> There are different ways for doing push based metadata
>>>>>
>>>>> propagation. Some
>>>>>
>>>>> object stores may support that already. For example, S3 supports
>>>>>
>>>>> events
>>>>>
>>>>> notification (
>>>>>
>>>>> https://docs.aws.amazon.com/AmazonS3/latest/dev/NotificationHowTo.html).
>>>>>
>>>>>
>>>>> Otherwise one could use a separate metadata store that supports
>>>>>
>>>>> push-based
>>>>>
>>>>> change propagation. Other people have mentioned using a Kafka
>>>>>
>>>>> topic. The
>>>>>
>>>>> best approach may depend on the object store and the operational
>>>>> environment (e.g. whether an external metadata store is already
>>>>>
>>>>> available).
>>>>>
>>>>> The above discussion is based on the assumption that we need to
>>>>>
>>>>> cache the
>>>>>
>>>>> object metadata locally in every broker. I mentioned earlier that
>>>>>
>>>>> an
>>>>>
>>>>> alternative is to just store/retrieve those metadata in an external
>>>>> metadata store. That may simplify the implementation in some cases.
>>>>>
>>>>> Thanks,
>>>>>
>>>>> Jun
>>>>>
>>>>> On Thu, Dec 5, 2019 at 7:01 AM Satish Duggana <
>>>>>
>>>>> satish.dugg...@gmail.com>
>>>>>
>>>>> wrote:
>>>>>
>>>>> Hi Jun,
>>>>> Thanks for your reply.
>>>>>
>>>>> Currently, `listRemoteSegments` is called at the configured
>>>>> interval(not every second, defaults to 30secs). Storing remote
>>>>>
>>>>> log
>>>>>
>>>>> metadata in a strongly consistent store for S3 RSM is raised in
>>>>> PR-comment[1].
>>>>> RLM invokes RSM at regular intervals and RSM can give remote
>>>>>
>>>>> segment
>>>>>
>>>>> metadata if it is available. RSM is responsible for maintaining
>>>>>
>>>>> and
>>>>>
>>>>> fetching those entries. It should be based on whatever mechanism
>>>>>
>>>>> is
>>>>>
>>>>> consistent and efficient with the respective remote storage.
>>>>>
>>>>> Can you give more details about push based mechanism from RSM?
>>>>>
>>>>> 1.
>>>>>
>>>>> https://github.com/apache/kafka/pull/7561#discussion_r344576223
>>>>>
>>>>> Thanks,
>>>>> Satish.
>>>>>
>>>>> On Thu, Dec 5, 2019 at 4:23 AM Jun Rao <j...@confluent.io> wrote:
>>>>>
>>>>> Hi, Harsha,
>>>>>
>>>>> Thanks for the reply.
>>>>>
>>>>> 40/41. I am curious which block storages you have tested. S3
>>>>>
>>>>> seems
>>>>>
>>>>> to be
>>>>>
>>>>> one of the popular block stores. The concerns that I have with
>>>>>
>>>>> pull
>>>>>
>>>>> based
>>>>>
>>>>> approach are the following.
>>>>> (a) Cost: S3 list object requests cost $0.005 per 1000
>>>>>
>>>>> requests. If
>>>>>
>>>>> you
>>>>>
>>>>> have 100,000 partitions and want to pull the metadata for each
>>>>>
>>>>> partition
>>>>>
>>>>> at
>>>>>
>>>>> the rate of 1/sec. It can cost $0.5/sec, which is roughly $40K
>>>>>
>>>>> per
>>>>>
>>>>> day.
>>>>>
>>>>> (b) Semantics: S3 list objects are eventually consistent. So,
>>>>>
>>>>> when
>>>>>
>>>>> you
>>>>>
>>>>> do a
>>>>>
>>>>> list object request, there is no guarantee that you can see all
>>>>>
>>>>> uploaded
>>>>>
>>>>> objects. This could impact the correctness of subsequent
>>>>>
>>>>> logics.
>>>>>
>>>>> (c) Efficiency: Blindly pulling metadata when there is no
>>>>>
>>>>> change adds
>>>>>
>>>>> unnecessary overhead in the broker as well as in the block
>>>>>
>>>>> store.
>>>>>
>>>>> So, have you guys tested S3? If so, could you share your
>>>>>
>>>>> experience
>>>>>
>>>>> in
>>>>>
>>>>> terms of cost, semantics and efficiency?
>>>>>
>>>>> Jun
>>>>>
>>>>> On Tue, Dec 3, 2019 at 10:11 PM Harsha Chintalapani <
>>>>>
>>>>> ka...@harsha.io
>>>>>
>>>>> wrote:
>>>>>
>>>>> Hi Jun,
>>>>> Thanks for the reply.
>>>>>
>>>>> On Tue, Nov 26, 2019 at 3:46 PM, Jun Rao <j...@confluent.io>
>>>>>
>>>>> wrote:
>>>>>
>>>>> Hi, Satish and Ying,
>>>>>
>>>>> Thanks for the reply.
>>>>>
>>>>> 40/41. There are two different ways that we can approach
>>>>>
>>>>> this.
>>>>>
>>>>> One is
>>>>>
>>>>> what
>>>>>
>>>>> you said. We can have an opinionated way of storing and
>>>>>
>>>>> populating
>>>>>
>>>>> the
>>>>>
>>>>> tier
>>>>>
>>>>> metadata that we think is good enough for everyone. I am
>>>>>
>>>>> not
>>>>>
>>>>> sure if
>>>>>
>>>>> this
>>>>>
>>>>> is the case based on what's currently proposed in the KIP.
>>>>>
>>>>> For
>>>>>
>>>>> example, I
>>>>>
>>>>> am not sure that (1) everyone always needs local metadata;
>>>>>
>>>>> (2)
>>>>>
>>>>> the
>>>>>
>>>>> current
>>>>>
>>>>> local storage format is general enough and (3) everyone
>>>>>
>>>>> wants to
>>>>>
>>>>> use
>>>>>
>>>>> the
>>>>>
>>>>> pull based approach to propagate the metadata. Another
>>>>>
>>>>> approach
>>>>>
>>>>> is to
>>>>>
>>>>> make
>>>>>
>>>>> this pluggable and let the implementor implements the best
>>>>>
>>>>> approach
>>>>>
>>>>> for a
>>>>>
>>>>> particular block storage. I haven't seen any comments from
>>>>>
>>>>> Slack/AirBnb
>>>>>
>>>>> in
>>>>>
>>>>> the mailing list on this topic. It would be great if they
>>>>>
>>>>> can
>>>>>
>>>>> provide
>>>>>
>>>>> feedback directly here.
>>>>>
>>>>> The current interfaces are designed with most popular block
>>>>>
>>>>> storages
>>>>>
>>>>> available today and we did 2 implementations with these
>>>>>
>>>>> interfaces and
>>>>>
>>>>> they both are yielding good results as we going through the
>>>>>
>>>>> testing of
>>>>>
>>>>> it.
>>>>>
>>>>> If there is ever a need for pull based approach we can
>>>>>
>>>>> definitely
>>>>>
>>>>> evolve
>>>>>
>>>>> the interface.
>>>>> In the past we did mark interfaces to be evolving to make
>>>>>
>>>>> room for
>>>>>
>>>>> unknowns
>>>>>
>>>>> in the future.
>>>>> If you have any suggestions around the current interfaces
>>>>>
>>>>> please
>>>>>
>>>>> propose we
>>>>>
>>>>> are happy to see if we can work them into it.
>>>>>
>>>>> 43. To offer tier storage as a general feature, ideally all
>>>>>
>>>>> existing
>>>>>
>>>>> capabilities should still be supported. It's fine if the
>>>>>
>>>>> uber
>>>>>
>>>>> implementation doesn't support all capabilities for
>>>>>
>>>>> internal
>>>>>
>>>>> usage.
>>>>>
>>>>> However, the framework should be general enough.
>>>>>
>>>>> We agree on that as a principle. But all of these major
>>>>>
>>>>> features
>>>>>
>>>>> mostly
>>>>>
>>>>> coming right now and to have a new big feature such as tiered
>>>>>
>>>>> storage
>>>>>
>>>>> to
>>>>>
>>>>> support all the new features will be a big ask. We can
>>>>>
>>>>> document on
>>>>>
>>>>> how
>>>>>
>>>>> do
>>>>>
>>>>> we approach solving these in future iterations.
>>>>> Our goal is to make this tiered storage feature work for
>>>>>
>>>>> everyone.
>>>>>
>>>>> 43.3 This is more than just serving the tier-ed data from
>>>>>
>>>>> block
>>>>>
>>>>> storage.
>>>>>
>>>>> With KIP-392, the consumer now can resolve the conflicts
>>>>>
>>>>> with the
>>>>>
>>>>> replica
>>>>>
>>>>> based on leader epoch. So, we need to make sure that
>>>>>
>>>>> leader epoch
>>>>>
>>>>> can be
>>>>>
>>>>> recovered properly from tier storage.
>>>>>
>>>>> We are working on testing our approach and we will update
>>>>>
>>>>> the KIP
>>>>>
>>>>> with
>>>>>
>>>>> design details.
>>>>>
>>>>> 43.4 For JBOD, if tier storage stores the tier metadata
>>>>>
>>>>> locally, we
>>>>>
>>>>> need to
>>>>>
>>>>> support moving such metadata across disk directories since
>>>>>
>>>>> JBOD
>>>>>
>>>>> supports
>>>>>
>>>>> moving data across disks.
>>>>>
>>>>> KIP is updated with JBOD details. Having said that JBOD
>>>>>
>>>>> tooling
>>>>>
>>>>> needs
>>>>>
>>>>> to
>>>>>
>>>>> evolve to support production loads. Most of the users will be
>>>>>
>>>>> interested in
>>>>>
>>>>> using tiered storage without JBOD support support on day 1.
>>>>>
>>>>> Thanks,
>>>>> Harsha
>>>>>
>>>>> As for meeting, we could have a KIP e-meeting on this if
>>>>>
>>>>> needed,
>>>>>
>>>>> but it
>>>>>
>>>>> will be open to everyone and will be recorded and shared.
>>>>>
>>>>> Often,
>>>>>
>>>>> the
>>>>>
>>>>> details are still resolved through the mailing list.
>>>>>
>>>>> Jun
>>>>>
>>>>> On Tue, Nov 19, 2019 at 6:48 PM Ying Zheng
>>>>>
>>>>> <yi...@uber.com.invalid>
>>>>>
>>>>> wrote:
>>>>>
>>>>> Please ignore my previous email
>>>>> I didn't know Apache requires all the discussions to be
>>>>>
>>>>> "open"
>>>>>
>>>>> On Tue, Nov 19, 2019, 5:40 PM Ying Zheng <yi...@uber.com>
>>>>>
>>>>> wrote:
>>>>>
>>>>> Hi Jun,
>>>>>
>>>>> Thank you very much for your feedback!
>>>>>
>>>>> Can we schedule a meeting in your Palo Alto office in
>>>>>
>>>>> December? I
>>>>>
>>>>> think a
>>>>>
>>>>> face to face discussion is much more efficient than
>>>>>
>>>>> emails. Both
>>>>>
>>>>> Harsha
>>>>>
>>>>> and
>>>>>
>>>>> I can visit you. Satish may be able to join us remotely.
>>>>>
>>>>> On Fri, Nov 15, 2019 at 11:04 AM Jun Rao <j...@confluent.io
>>>>>
>>>>> wrote:
>>>>>
>>>>> Hi, Satish and Harsha,
>>>>>
>>>>> The following is a more detailed high level feedback for
>>>>>
>>>>> the KIP.
>>>>>
>>>>> Overall,
>>>>>
>>>>> the KIP seems useful. The challenge is how to design it
>>>>>
>>>>> such that
>>>>>
>>>>> it’s
>>>>>
>>>>> general enough to support different ways of implementing
>>>>>
>>>>> this
>>>>>
>>>>> feature
>>>>>
>>>>> and
>>>>>
>>>>> support existing features.
>>>>>
>>>>> 40. Local segment metadata storage: The KIP makes the
>>>>>
>>>>> assumption
>>>>>
>>>>> that
>>>>>
>>>>> the
>>>>>
>>>>> metadata for the archived log segments are cached locally
>>>>>
>>>>> in
>>>>>
>>>>> every
>>>>>
>>>>> broker
>>>>>
>>>>> and provides a specific implementation for the local
>>>>>
>>>>> storage in
>>>>>
>>>>> the
>>>>>
>>>>> framework. We probably should discuss this more. For
>>>>>
>>>>> example,
>>>>>
>>>>> some
>>>>>
>>>>> tier
>>>>>
>>>>> storage providers may not want to cache the metadata
>>>>>
>>>>> locally and
>>>>>
>>>>> just
>>>>>
>>>>> rely
>>>>>
>>>>> upon a remote key/value store if such a store is already
>>>>>
>>>>> present. If
>>>>>
>>>>> a
>>>>>
>>>>> local store is used, there could be different ways of
>>>>>
>>>>> implementing it
>>>>>
>>>>> (e.g., based on customized local files, an embedded local
>>>>>
>>>>> store
>>>>>
>>>>> like
>>>>>
>>>>> RocksDB, etc). An alternative of designing this is to just
>>>>>
>>>>> provide an
>>>>>
>>>>> interface for retrieving the tier segment metadata and
>>>>>
>>>>> leave the
>>>>>
>>>>> details
>>>>>
>>>>> of
>>>>>
>>>>> how to get the metadata outside of the framework.
>>>>>
>>>>> 41. RemoteStorageManager interface and the usage of the
>>>>>
>>>>> interface in
>>>>>
>>>>> the
>>>>>
>>>>> framework: I am not sure if the interface is general
>>>>>
>>>>> enough. For
>>>>>
>>>>> example,
>>>>>
>>>>> it seems that RemoteLogIndexEntry is tied to a specific
>>>>>
>>>>> way of
>>>>>
>>>>> storing
>>>>>
>>>>> the
>>>>>
>>>>> metadata in remote storage. The framework uses
>>>>>
>>>>> listRemoteSegments()
>>>>>
>>>>> api
>>>>>
>>>>> in
>>>>>
>>>>> a pull based approach. However, in some other
>>>>>
>>>>> implementations, a
>>>>>
>>>>> push
>>>>>
>>>>> based
>>>>> approach may be more preferred. I don’t have a concrete
>>>>>
>>>>> proposal
>>>>>
>>>>> yet.
>>>>>
>>>>> But,
>>>>>
>>>>> it would be useful to give this area some more thoughts
>>>>>
>>>>> and see
>>>>>
>>>>> if we
>>>>>
>>>>> can
>>>>>
>>>>> make the interface more general.
>>>>>
>>>>> 42. In the diagram, the RemoteLogManager is side by side
>>>>>
>>>>> with
>>>>>
>>>>> LogManager.
>>>>>
>>>>> This KIP only discussed how the fetch request is handled
>>>>>
>>>>> between
>>>>>
>>>>> the
>>>>>
>>>>> two
>>>>>
>>>>> layer. However, we should also consider how other requests
>>>>>
>>>>> that
>>>>>
>>>>> touch
>>>>>
>>>>> the
>>>>>
>>>>> log can be handled. e.g., list offsets by timestamp, delete
>>>>>
>>>>> records,
>>>>>
>>>>> etc.
>>>>>
>>>>> Also, in this model, it's not clear which component is
>>>>>
>>>>> responsible
>>>>>
>>>>> for
>>>>>
>>>>> managing the log start offset. It seems that the log start
>>>>>
>>>>> offset
>>>>>
>>>>> could
>>>>>
>>>>> be
>>>>>
>>>>> changed by both RemoteLogManager and LogManager.
>>>>>
>>>>> 43. There are quite a few existing features not covered by
>>>>>
>>>>> the
>>>>>
>>>>> KIP.
>>>>>
>>>>> It
>>>>>
>>>>> would be useful to discuss each of those.
>>>>> 43.1 I won’t say that compacted topics are rarely used and
>>>>>
>>>>> always
>>>>>
>>>>> small.
>>>>>
>>>>> For example, KStreams uses compacted topics for storing the
>>>>>
>>>>> states
>>>>>
>>>>> and
>>>>>
>>>>> sometimes the size of the topic could be large. While it
>>>>>
>>>>> might
>>>>>
>>>>> be ok
>>>>>
>>>>> to
>>>>>
>>>>> not
>>>>>
>>>>> support compacted topics initially, it would be useful to
>>>>>
>>>>> have a
>>>>>
>>>>> high
>>>>>
>>>>> level
>>>>> idea on how this might be supported down the road so that
>>>>>
>>>>> we
>>>>>
>>>>> don’t
>>>>>
>>>>> have
>>>>>
>>>>> to
>>>>>
>>>>> make incompatible API changes in the future.
>>>>> 43.2 We need to discuss how EOS is supported. In
>>>>>
>>>>> particular, how
>>>>>
>>>>> is
>>>>>
>>>>> the
>>>>>
>>>>> producer state integrated with the remote storage. 43.3
>>>>>
>>>>> Now that
>>>>>
>>>>> KIP-392
>>>>>
>>>>> (allow consumers to fetch from closest replica) is
>>>>>
>>>>> implemented,
>>>>>
>>>>> we
>>>>>
>>>>> need
>>>>>
>>>>> to
>>>>>
>>>>> discuss how reading from a follower replica is supported
>>>>>
>>>>> with
>>>>>
>>>>> tier
>>>>>
>>>>> storage.
>>>>>
>>>>> 43.4 We need to discuss how JBOD is supported with tier
>>>>>
>>>>> storage.
>>>>>
>>>>> Thanks,
>>>>>
>>>>> Jun
>>>>>
>>>>> On Fri, Nov 8, 2019 at 12:06 AM Tom Bentley <
>>>>>
>>>>> tbent...@redhat.com
>>>>>
>>>>> wrote:
>>>>>
>>>>> Thanks for those insights Ying.
>>>>>
>>>>> On Thu, Nov 7, 2019 at 9:26 PM Ying Zheng
>>>>>
>>>>> <yi...@uber.com.invalid
>>>>>
>>>>> wrote:
>>>>>
>>>>> Thanks, I missed that point. However, there's still a
>>>>>
>>>>> point at
>>>>>
>>>>> which
>>>>>
>>>>> the
>>>>>
>>>>> consumer fetches start getting served from remote storage
>>>>>
>>>>> (even
>>>>>
>>>>> if
>>>>>
>>>>> that
>>>>>
>>>>> point isn't as soon as the local log retention time/size).
>>>>>
>>>>> This
>>>>>
>>>>> represents
>>>>>
>>>>> a kind of performance cliff edge and what I'm really
>>>>>
>>>>> interested
>>>>>
>>>>> in
>>>>>
>>>>> is
>>>>>
>>>>> how
>>>>>
>>>>> easy it is for a consumer which falls off that cliff to
>>>>>
>>>>> catch up
>>>>>
>>>>> and so
>>>>>
>>>>> its
>>>>>
>>>>> fetches again come from local storage. Obviously this can
>>>>>
>>>>> depend
>>>>>
>>>>> on
>>>>>
>>>>> all
>>>>>
>>>>> sorts of factors (like production rate, consumption rate),
>>>>>
>>>>> so
>>>>>
>>>>> it's
>>>>>
>>>>> not
>>>>>
>>>>> guaranteed (just like it's not guaranteed for Kafka
>>>>>
>>>>> today), but
>>>>>
>>>>> this
>>>>>
>>>>> would
>>>>>
>>>>> represent a new failure mode.
>>>>>
>>>>> As I have explained in the last mail, it's a very rare
>>>>>
>>>>> case that
>>>>>
>>>>> a
>>>>>
>>>>> consumer
>>>>> need to read remote data. With our experience at Uber,
>>>>>
>>>>> this only
>>>>>
>>>>> happens
>>>>>
>>>>> when the consumer service had an outage for several hours.
>>>>>
>>>>> There is not a "performance cliff" as you assume. The
>>>>>
>>>>> remote
>>>>>
>>>>> storage
>>>>>
>>>>> is
>>>>>
>>>>> even faster than local disks in terms of bandwidth.
>>>>>
>>>>> Reading from
>>>>>
>>>>> remote
>>>>>
>>>>> storage is going to have higher latency than local disk.
>>>>>
>>>>> But
>>>>>
>>>>> since
>>>>>
>>>>> the
>>>>>
>>>>> consumer
>>>>> is catching up several hours data, it's not sensitive to
>>>>>
>>>>> the
>>>>>
>>>>> sub-second
>>>>>
>>>>> level
>>>>> latency, and each remote read request will read a large
>>>>>
>>>>> amount of
>>>>>
>>>>> data to
>>>>>
>>>>> make the overall performance better than reading from local
>>>>>
>>>>> disks.
>>>>>
>>>>> Another aspect I'd like to understand better is the effect
>>>>>
>>>>> of
>>>>>
>>>>> serving
>>>>>
>>>>> fetch
>>>>>
>>>>> request from remote storage has on the broker's network
>>>>>
>>>>> utilization. If
>>>>>
>>>>> we're just trimming the amount of data held locally
>>>>>
>>>>> (without
>>>>>
>>>>> increasing
>>>>>
>>>>> the
>>>>>
>>>>> overall local+remote retention), then we're effectively
>>>>>
>>>>> trading
>>>>>
>>>>> disk
>>>>>
>>>>> bandwidth for network bandwidth when serving fetch
>>>>>
>>>>> requests from
>>>>>
>>>>> remote
>>>>>
>>>>> storage (which I understand to be a good thing, since
>>>>>
>>>>> brokers are
>>>>>
>>>>> often/usually disk bound). But if we're increasing the
>>>>>
>>>>> overall
>>>>>
>>>>> local+remote
>>>>>
>>>>> retention then it's more likely that network itself
>>>>>
>>>>> becomes the
>>>>>
>>>>> bottleneck.
>>>>>
>>>>> I appreciate this is all rather hand wavy, I'm just trying
>>>>>
>>>>> to
>>>>>
>>>>> understand
>>>>>
>>>>> how this would affect broker performance, so I'd be
>>>>>
>>>>> grateful for
>>>>>
>>>>> any
>>>>>
>>>>> insights you can offer.
>>>>>
>>>>> Network bandwidth is a function of produce speed, it has
>>>>>
>>>>> nothing
>>>>>
>>>>> to
>>>>>
>>>>> do
>>>>>
>>>>> with
>>>>>
>>>>> remote retention. As long as the data is shipped to remote
>>>>>
>>>>> storage,
>>>>>
>>>>> you
>>>>>
>>>>> can
>>>>>
>>>>> keep the data there for 1 day or 1 year or 100 years, it
>>>>>
>>>>> doesn't
>>>>>
>>>>> consume
>>>>>
>>>>> any
>>>>> network resources.
>>>>>
>>>>>
>>>>

Re: [DISCUSS] KIP-405: Kafka Tiered Storage

Reply via email to