Hi Jun, Thank you very much for your feedback!
Can we schedule a meeting in your Palo Alto office in December? I think a face to face discussion is much more efficient than emails. Both Harsha and I can visit you. Satish may be able to join us remotely. On Fri, Nov 15, 2019 at 11:04 AM Jun Rao <[email protected]> wrote: > Hi, Satish and Harsha, > > The following is a more detailed high level feedback for the KIP. Overall, > the KIP seems useful. The challenge is how to design it such that it’s > general enough to support different ways of implementing this feature and > support existing features. > > 40. Local segment metadata storage: The KIP makes the assumption that the > metadata for the archived log segments are cached locally in every broker > and provides a specific implementation for the local storage in the > framework. We probably should discuss this more. For example, some tier > storage providers may not want to cache the metadata locally and just rely > upon a remote key/value store if such a store is already present. If a > local store is used, there could be different ways of implementing it > (e.g., based on customized local files, an embedded local store like > RocksDB, etc). An alternative of designing this is to just provide an > interface for retrieving the tier segment metadata and leave the details of > how to get the metadata outside of the framework. > > 41. RemoteStorageManager interface and the usage of the interface in the > framework: I am not sure if the interface is general enough. For example, > it seems that RemoteLogIndexEntry is tied to a specific way of storing the > metadata in remote storage. The framework uses listRemoteSegments() api in > a pull based approach. However, in some other implementations, a push based > approach may be more preferred. I don’t have a concrete proposal yet. But, > it would be useful to give this area some more thoughts and see if we can > make the interface more general. > > 42. In the diagram, the RemoteLogManager is side by side with LogManager. > This KIP only discussed how the fetch request is handled between the two > layer. However, we should also consider how other requests that touch the > log can be handled. e.g., list offsets by timestamp, delete records, etc. > Also, in this model, it's not clear which component is responsible for > managing the log start offset. It seems that the log start offset could be > changed by both RemoteLogManager and LogManager. > > 43. There are quite a few existing features not covered by the KIP. It > would be useful to discuss each of those. > 43.1 I won’t say that compacted topics are rarely used and always small. > For example, KStreams uses compacted topics for storing the states and > sometimes the size of the topic could be large. While it might be ok to not > support compacted topics initially, it would be useful to have a high level > idea on how this might be supported down the road so that we don’t have to > make incompatible API changes in the future. > 43.2 We need to discuss how EOS is supported. In particular, how is the > producer state integrated with the remote storage. > 43.3 Now that KIP-392 (allow consumers to fetch from closest replica) is > implemented, we need to discuss how reading from a follower replica is > supported with tier storage. > 43.4 We need to discuss how JBOD is supported with tier storage. > > Thanks, > > Jun > > On Fri, Nov 8, 2019 at 12:06 AM Tom Bentley <[email protected]> wrote: > > > Thanks for those insights Ying. > > > > On Thu, Nov 7, 2019 at 9:26 PM Ying Zheng <[email protected]> > wrote: > > > > > > > > > > > > > > > > > > Thanks, I missed that point. However, there's still a point at which > > the > > > > consumer fetches start getting served from remote storage (even if > that > > > > point isn't as soon as the local log retention time/size). This > > > represents > > > > a kind of performance cliff edge and what I'm really interested in is > > how > > > > easy it is for a consumer which falls off that cliff to catch up and > so > > > its > > > > fetches again come from local storage. Obviously this can depend on > all > > > > sorts of factors (like production rate, consumption rate), so it's > not > > > > guaranteed (just like it's not guaranteed for Kafka today), but this > > > would > > > > represent a new failure mode. > > > > > > > > > > As I have explained in the last mail, it's a very rare case that a > > > consumer > > > need to read remote data. With our experience at Uber, this only > happens > > > when the consumer service had an outage for several hours. > > > > > > There is not a "performance cliff" as you assume. The remote storage is > > > even faster than local disks in terms of bandwidth. Reading from remote > > > storage is going to have higher latency than local disk. But since the > > > consumer > > > is catching up several hours data, it's not sensitive to the sub-second > > > level > > > latency, and each remote read request will read a large amount of data > to > > > make the overall performance better than reading from local disks. > > > > > > > > > > > > > Another aspect I'd like to understand better is the effect of serving > > > fetch > > > > request from remote storage has on the broker's network utilization. > If > > > > we're just trimming the amount of data held locally (without > increasing > > > the > > > > overall local+remote retention), then we're effectively trading disk > > > > bandwidth for network bandwidth when serving fetch requests from > remote > > > > storage (which I understand to be a good thing, since brokers are > > > > often/usually disk bound). But if we're increasing the overall > > > local+remote > > > > retention then it's more likely that network itself becomes the > > > bottleneck. > > > > I appreciate this is all rather hand wavy, I'm just trying to > > understand > > > > how this would affect broker performance, so I'd be grateful for any > > > > insights you can offer. > > > > > > > > > > > Network bandwidth is a function of produce speed, it has nothing to do > > with > > > remote retention. As long as the data is shipped to remote storage, you > > can > > > keep the data there for 1 day or 1 year or 100 years, it doesn't > consume > > > any > > > network resources. > > > > > >
