Hi Jun, Thanks for your detailed review and comments. >40. Local segment metadata storage: The KIP makes the assumption that the metadata for the archived log segments are cached locally in every broker and provides a specific implementation for the local storage in the framework. We probably should discuss this more. For example, some tier storage providers may not want to cache the metadata locally and just rely upon a remote key/value store if such a store is already present. If a local store is used, there could be different ways of implementing it (e.g., based on customized local files, an embedded local store like RocksDB, etc). An alternative of designing this is to just provide an interface for retrieving the tier segment metadata and leave the details of how to get the metadata outside of the framework.
I am fine with giving a way for RSM implementor to handle remote log metadata. But we should give a default implementation if any RSM implementers want to reuse that. The default implementation can be storing them locally as it is mentioned in the KIP. >41. RemoteStorageManager interface and the usage of the interface in the framework: I am not sure if the interface is general enough. For example, it seems that RemoteLogIndexEntry is tied to a specific way of storing the metadata in remote storage. The framework uses listRemoteSegments() api in a pull-based approach. However, in some other implementations, a push-based approach may be more preferred. I don’t have a concrete proposal yet. But, it would be useful to give this area some more thoughts and see if we can make the interface more general. RemoteLogIndexEntry is aligned with record batch representation and it also gives a generalized representation through RDI about the location of that batch in the remote storage. If there are use cases to represent them in a different way then we can have an interface and refactor the current RemoteLogIndexEntry as the default implementation. listRemoteSegments() API is to get metadata about a specific topic partition’s remote log storage. We thought RemoteLogManager should do most of the heavy lifting as much as it can and it should use RemoteStorageManager whenever it needs to retrieve remote log metadata/data. We can start with this approach in the initial version. If there are valid use cases to have push based mechanism we can add them later. >42. In the diagram, the RemoteLogManager is side by side with LogManager. This KIP only discussed how the fetch request is handled between the two layer. However, we should also consider how other requests that touch the log can be handled. e.g., list offsets by timestamp, delete records, etc. Also, in this model, it's not clear which component is responsible for managing the log start offset. It seems that the log start offset could be changed by both RemoteLogManager and LogManager. Sure, we will add more details in the KIP about how different request APIs which touch the log are handled. With tiered storage, log will have local-log-start-offset, remote-log-start-offset and effective-log-start-offset. Existing log-start-offset field is effective-log-start-offset of the Log. effective-log-start-offset = if(remote-log exists) remote-log-start-offset else local-log-start-offset. Log still manages log-start-offset but it can be updated by RemoteLogManager if tiering is enabled. >43. There are quite a few existing features not covered by the KIP. It would be useful to discuss each of those. >43.1 I won’t say that compacted topics are rarely used and always small. For example, KStreams uses compacted topics for storing the states and sometimes the size of the topic could be large. While it might be ok to not support compacted topics initially, it would be useful to have a high level idea on how this might be supported down the road so that we don’t have to make incompatible API changes in the future. As you know, any new APIs will evolve over the next couple of versions, they may even be incompatible till stabilized. But we will have the new APIs thinking through the possible usecases. We can discuss a high level idea on how compact topics can be supported but this is a lower priority for now. >43.2 We need to discuss how EOS is supported. In particular, how is the producer state integrated with the remote storage. Right, EOS needs producer state snapshots of the log segments. These snapshots can be maintained in remote storage like offset and time indexes. I will update the KIP with the details. >43.3 Now that KIP-392 (allow consumers to fetch from closest replica) is implemented, we need to discuss how reading from a follower replica is supported with tier storage. We plan to support consumer fetch requests on follower replicas with remote log segments. Remote log contains only the committed records(till log-stable-offset), This constraint allows us to support the ask here. I will update the KIP to make it clear that this is supported. >43.4 We need to discuss how JBOD is supported with tier storage. I will look into this. Thanks, Satish. On Sat, Nov 16, 2019 at 12:34 AM Jun Rao <j...@confluent.io> wrote: > > Hi, Satish and Harsha, > > The following is a more detailed high level feedback for the KIP. Overall, > the KIP seems useful. The challenge is how to design it such that it’s > general enough to support different ways of implementing this feature and > support existing features. > > 40. Local segment metadata storage: The KIP makes the assumption that the > metadata for the archived log segments are cached locally in every broker > and provides a specific implementation for the local storage in the > framework. We probably should discuss this more. For example, some tier > storage providers may not want to cache the metadata locally and just rely > upon a remote key/value store if such a store is already present. If a > local store is used, there could be different ways of implementing it > (e.g., based on customized local files, an embedded local store like > RocksDB, etc). An alternative of designing this is to just provide an > interface for retrieving the tier segment metadata and leave the details of > how to get the metadata outside of the framework. > > 41. RemoteStorageManager interface and the usage of the interface in the > framework: I am not sure if the interface is general enough. For example, > it seems that RemoteLogIndexEntry is tied to a specific way of storing the > metadata in remote storage. The framework uses listRemoteSegments() api in > a pull based approach. However, in some other implementations, a push based > approach may be more preferred. I don’t have a concrete proposal yet. But, > it would be useful to give this area some more thoughts and see if we can > make the interface more general. > > 42. In the diagram, the RemoteLogManager is side by side with LogManager. > This KIP only discussed how the fetch request is handled between the two > layer. However, we should also consider how other requests that touch the > log can be handled. e.g., list offsets by timestamp, delete records, etc. > Also, in this model, it's not clear which component is responsible for > managing the log start offset. It seems that the log start offset could be > changed by both RemoteLogManager and LogManager. > > 43. There are quite a few existing features not covered by the KIP. It > would be useful to discuss each of those. > 43.1 I won’t say that compacted topics are rarely used and always small. > For example, KStreams uses compacted topics for storing the states and > sometimes the size of the topic could be large. While it might be ok to not > support compacted topics initially, it would be useful to have a high level > idea on how this might be supported down the road so that we don’t have to > make incompatible API changes in the future. > 43.2 We need to discuss how EOS is supported. In particular, how is the > producer state integrated with the remote storage. > 43.3 Now that KIP-392 (allow consumers to fetch from closest replica) is > implemented, we need to discuss how reading from a follower replica is > supported with tier storage. > 43.4 We need to discuss how JBOD is supported with tier storage. > > Thanks, > > Jun > > On Fri, Nov 8, 2019 at 12:06 AM Tom Bentley <tbent...@redhat.com> wrote: > > > Thanks for those insights Ying. > > > > On Thu, Nov 7, 2019 at 9:26 PM Ying Zheng <yi...@uber.com.invalid> wrote: > > > > > > > > > > > > > > > > > > Thanks, I missed that point. However, there's still a point at which > > the > > > > consumer fetches start getting served from remote storage (even if that > > > > point isn't as soon as the local log retention time/size). This > > > represents > > > > a kind of performance cliff edge and what I'm really interested in is > > how > > > > easy it is for a consumer which falls off that cliff to catch up and so > > > its > > > > fetches again come from local storage. Obviously this can depend on all > > > > sorts of factors (like production rate, consumption rate), so it's not > > > > guaranteed (just like it's not guaranteed for Kafka today), but this > > > would > > > > represent a new failure mode. > > > > > > > > > > As I have explained in the last mail, it's a very rare case that a > > > consumer > > > need to read remote data. With our experience at Uber, this only happens > > > when the consumer service had an outage for several hours. > > > > > > There is not a "performance cliff" as you assume. The remote storage is > > > even faster than local disks in terms of bandwidth. Reading from remote > > > storage is going to have higher latency than local disk. But since the > > > consumer > > > is catching up several hours data, it's not sensitive to the sub-second > > > level > > > latency, and each remote read request will read a large amount of data to > > > make the overall performance better than reading from local disks. > > > > > > > > > > > > > Another aspect I'd like to understand better is the effect of serving > > > fetch > > > > request from remote storage has on the broker's network utilization. If > > > > we're just trimming the amount of data held locally (without increasing > > > the > > > > overall local+remote retention), then we're effectively trading disk > > > > bandwidth for network bandwidth when serving fetch requests from remote > > > > storage (which I understand to be a good thing, since brokers are > > > > often/usually disk bound). But if we're increasing the overall > > > local+remote > > > > retention then it's more likely that network itself becomes the > > > bottleneck. > > > > I appreciate this is all rather hand wavy, I'm just trying to > > understand > > > > how this would affect broker performance, so I'd be grateful for any > > > > insights you can offer. > > > > > > > > > > > Network bandwidth is a function of produce speed, it has nothing to do > > with > > > remote retention. As long as the data is shipped to remote storage, you > > can > > > keep the data there for 1 day or 1 year or 100 years, it doesn't consume > > > any > > > network resources. > > > > >