Hi Jun, Thank you for the feedback. I am trying to understand how a push-based approach would work. In order for the metadata to be propagated (under the assumption you stated), would you plan to add a new API in Kafka to allow the metadata store to send them directly to the brokers?
Thanks, Alexandre Le mer. 18 déc. 2019 à 20:14, Jun Rao <j...@confluent.io> a écrit : > > Hi, Satish and Ivan, > > There are different ways for doing push based metadata propagation. Some > object stores may support that already. For example, S3 supports events > notification ( > https://docs.aws.amazon.com/AmazonS3/latest/dev/NotificationHowTo.html). > Otherwise one could use a separate metadata store that supports push-based > change propagation. Other people have mentioned using a Kafka topic. The > best approach may depend on the object store and the operational > environment (e.g. whether an external metadata store is already available). > > The above discussion is based on the assumption that we need to cache the > object metadata locally in every broker. I mentioned earlier that an > alternative is to just store/retrieve those metadata in an external > metadata store. That may simplify the implementation in some cases. > > Thanks, > > Jun > > On Thu, Dec 5, 2019 at 7:01 AM Satish Duggana <satish.dugg...@gmail.com> > wrote: > > > Hi Jun, > > Thanks for your reply. > > > > Currently, `listRemoteSegments` is called at the configured > > interval(not every second, defaults to 30secs). Storing remote log > > metadata in a strongly consistent store for S3 RSM is raised in > > PR-comment[1]. > > RLM invokes RSM at regular intervals and RSM can give remote segment > > metadata if it is available. RSM is responsible for maintaining and > > fetching those entries. It should be based on whatever mechanism is > > consistent and efficient with the respective remote storage. > > > > Can you give more details about push based mechanism from RSM? > > > > 1. https://github.com/apache/kafka/pull/7561#discussion_r344576223 > > > > Thanks, > > Satish. > > > > On Thu, Dec 5, 2019 at 4:23 AM Jun Rao <j...@confluent.io> wrote: > > > > > > Hi, Harsha, > > > > > > Thanks for the reply. > > > > > > 40/41. I am curious which block storages you have tested. S3 seems to be > > > one of the popular block stores. The concerns that I have with pull based > > > approach are the following. > > > (a) Cost: S3 list object requests cost $0.005 per 1000 requests. If you > > > have 100,000 partitions and want to pull the metadata for each partition > > at > > > the rate of 1/sec. It can cost $0.5/sec, which is roughly $40K per day. > > > (b) Semantics: S3 list objects are eventually consistent. So, when you > > do a > > > list object request, there is no guarantee that you can see all uploaded > > > objects. This could impact the correctness of subsequent logics. > > > (c) Efficiency: Blindly pulling metadata when there is no change adds > > > unnecessary overhead in the broker as well as in the block store. > > > > > > So, have you guys tested S3? If so, could you share your experience in > > > terms of cost, semantics and efficiency? > > > > > > Jun > > > > > > > > > On Tue, Dec 3, 2019 at 10:11 PM Harsha Chintalapani <ka...@harsha.io> > > wrote: > > > > > > > Hi Jun, > > > > Thanks for the reply. > > > > > > > > > > > > > > > > On Tue, Nov 26, 2019 at 3:46 PM, Jun Rao <j...@confluent.io> wrote: > > > > > > > > > Hi, Satish and Ying, > > > > > > > > > > Thanks for the reply. > > > > > > > > > > 40/41. There are two different ways that we can approach this. One is > > > > what > > > > > you said. We can have an opinionated way of storing and populating > > the > > > > tier > > > > > metadata that we think is good enough for everyone. I am not sure if > > this > > > > > is the case based on what's currently proposed in the KIP. For > > example, I > > > > > am not sure that (1) everyone always needs local metadata; (2) the > > > > current > > > > > local storage format is general enough and (3) everyone wants to use > > the > > > > > pull based approach to propagate the metadata. Another approach is to > > > > make > > > > > this pluggable and let the implementor implements the best approach > > for a > > > > > particular block storage. I haven't seen any comments from > > Slack/AirBnb > > > > in > > > > > the mailing list on this topic. It would be great if they can provide > > > > > feedback directly here. > > > > > > > > > > > > > The current interfaces are designed with most popular block storages > > > > available today and we did 2 implementations with these interfaces and > > > > they both are yielding good results as we going through the testing of > > it. > > > > If there is ever a need for pull based approach we can definitely > > evolve > > > > the interface. > > > > In the past we did mark interfaces to be evolving to make room for > > unknowns > > > > in the future. > > > > If you have any suggestions around the current interfaces please > > propose we > > > > are happy to see if we can work them into it. > > > > > > > > > > > > 43. To offer tier storage as a general feature, ideally all existing > > > > > capabilities should still be supported. It's fine if the uber > > > > > implementation doesn't support all capabilities for internal usage. > > > > > However, the framework should be general enough. > > > > > > > > > > > > > We agree on that as a principle. But all of these major features mostly > > > > coming right now and to have a new big feature such as tiered storage > > to > > > > support all the new features will be a big ask. We can document on how > > do > > > > we approach solving these in future iterations. > > > > Our goal is to make this tiered storage feature work for everyone. > > > > > > > > 43.3 This is more than just serving the tier-ed data from block > > storage. > > > > > With KIP-392, the consumer now can resolve the conflicts with the > > replica > > > > > based on leader epoch. So, we need to make sure that leader epoch > > can be > > > > > recovered properly from tier storage. > > > > > > > > > > > > > We are working on testing our approach and we will update the KIP with > > > > design details. > > > > > > > > 43.4 For JBOD, if tier storage stores the tier metadata locally, we > > need to > > > > > support moving such metadata across disk directories since JBOD > > supports > > > > > moving data across disks. > > > > > > > > > > > > > KIP is updated with JBOD details. Having said that JBOD tooling needs > > to > > > > evolve to support production loads. Most of the users will be > > interested in > > > > using tiered storage without JBOD support support on day 1. > > > > > > > > Thanks, > > > > Harsha > > > > > > > > As for meeting, we could have a KIP e-meeting on this if needed, but it > > > > > will be open to everyone and will be recorded and shared. Often, the > > > > > details are still resolved through the mailing list. > > > > > > > > > > Jun > > > > > > > > > > On Tue, Nov 19, 2019 at 6:48 PM Ying Zheng <yi...@uber.com.invalid> > > > > > wrote: > > > > > > > > > > > > > > > Please ignore my previous email > > > > > I didn't know Apache requires all the discussions to be "open" > > > > > > > > > > > > > > > On Tue, Nov 19, 2019, 5:40 PM Ying Zheng <yi...@uber.com> wrote: > > > > > > > > > > Hi Jun, > > > > > > > > > > Thank you very much for your feedback! > > > > > > > > > > Can we schedule a meeting in your Palo Alto office in December? I > > think a > > > > > face to face discussion is much more efficient than emails. Both > > Harsha > > > > > > > > > > and > > > > > > > > > > I can visit you. Satish may be able to join us remotely. > > > > > > > > > > On Fri, Nov 15, 2019 at 11:04 AM Jun Rao <j...@confluent.io> wrote: > > > > > > > > > > Hi, Satish and Harsha, > > > > > > > > > > The following is a more detailed high level feedback for the KIP. > > > > > > > > > > Overall, > > > > > > > > > > the KIP seems useful. The challenge is how to design it such that > > it’s > > > > > general enough to support different ways of implementing this feature > > > > > > > > > > and > > > > > > > > > > support existing features. > > > > > > > > > > 40. Local segment metadata storage: The KIP makes the assumption that > > > > > > > > > > the > > > > > > > > > > metadata for the archived log segments are cached locally in every > > > > > > > > > > broker > > > > > > > > > > and provides a specific implementation for the local storage in the > > > > > framework. We probably should discuss this more. For example, some > > tier > > > > > storage providers may not want to cache the metadata locally and just > > > > > > > > > > rely > > > > > > > > > > > > > > > upon a remote key/value store if such a store is already present. If > > a > > > > > local store is used, there could be different ways of implementing it > > > > > (e.g., based on customized local files, an embedded local store like > > > > > RocksDB, etc). An alternative of designing this is to just provide an > > > > > interface for retrieving the tier segment metadata and leave the > > details > > > > of > > > > > how to get the metadata outside of the framework. > > > > > > > > > > > > > > > 41. RemoteStorageManager interface and the usage of the interface in > > the > > > > > framework: I am not sure if the interface is general enough. For > > > > > > > > > > example, > > > > > > > > > > it seems that RemoteLogIndexEntry is tied to a specific way of > > storing > > > > > > > > > > the > > > > > > > > > > metadata in remote storage. The framework uses listRemoteSegments() > > api > > > > > > > > > > in > > > > > > > > > > > > > > > a pull based approach. However, in some other implementations, a push > > > > > based > > > > > approach may be more preferred. I don’t have a concrete proposal yet. > > > > > > > > > > > > > > > But, > > > > > > > > > > it would be useful to give this area some more thoughts and see if we > > > > > > > > > > can > > > > > > > > > > make the interface more general. > > > > > > > > > > 42. In the diagram, the RemoteLogManager is side by side with > > > > > > > > > > LogManager. > > > > > > > > > > This KIP only discussed how the fetch request is handled between the > > two > > > > > layer. However, we should also consider how other requests that touch > > > > > > > > > > the > > > > > > > > > > log can be handled. e.g., list offsets by timestamp, delete records, > > > > > > > > > > etc. > > > > > > > > > > Also, in this model, it's not clear which component is responsible > > for > > > > > managing the log start offset. It seems that the log start offset > > could > > > > > > > > > > be > > > > > > > > > > changed by both RemoteLogManager and LogManager. > > > > > > > > > > > > > > > 43. There are quite a few existing features not covered by the KIP. > > It > > > > > would be useful to discuss each of those. > > > > > 43.1 I won’t say that compacted topics are rarely used and always > > small. > > > > > For example, KStreams uses compacted topics for storing the states > > and > > > > > sometimes the size of the topic could be large. While it might be ok > > to > > > > not > > > > > support compacted topics initially, it would be useful to have a high > > > > > level > > > > > idea on how this might be supported down the road so that we don’t > > have > > > > > > > > > > > > > > > to > > > > > > > > > > > > > > > make incompatible API changes in the future. > > > > > 43.2 We need to discuss how EOS is supported. In particular, how is > > the > > > > > producer state integrated with the remote storage. 43.3 Now that > > KIP-392 > > > > > (allow consumers to fetch from closest replica) is implemented, we > > need > > > > to > > > > > discuss how reading from a follower replica is supported with tier > > > > storage. > > > > > 43.4 We need to discuss how JBOD is supported with tier storage. > > > > > > > > > > > > > > > Thanks, > > > > > > > > > > Jun > > > > > > > > > > On Fri, Nov 8, 2019 at 12:06 AM Tom Bentley <tbent...@redhat.com> > > > > > > > > > > wrote: > > > > > > > > > > Thanks for those insights Ying. > > > > > > > > > > On Thu, Nov 7, 2019 at 9:26 PM Ying Zheng <yi...@uber.com.invalid> > > > > > > > > > > wrote: > > > > > > > > > > Thanks, I missed that point. However, there's still a point at > > > > > > > > > > which > > > > > > > > > > the > > > > > > > > > > consumer fetches start getting served from remote storage (even if > > > > > > > > > > that > > > > > > > > > > point isn't as soon as the local log retention time/size). This > > > > > > > > > > represents > > > > > > > > > > a kind of performance cliff edge and what I'm really interested in > > > > > > > > > > is > > > > > > > > > > how > > > > > > > > > > easy it is for a consumer which falls off that cliff to catch up > > > > > > > > > > and so > > > > > > > > > > its > > > > > > > > > > fetches again come from local storage. Obviously this can depend > > > > > > > > > > on > > > > > > > > > > all > > > > > > > > > > sorts of factors (like production rate, consumption rate), so it's > > > > > > > > > > not > > > > > > > > > > guaranteed (just like it's not guaranteed for Kafka today), but > > > > > > > > > > this > > > > > > > > > > would > > > > > > > > > > represent a new failure mode. > > > > > > > > > > > > > > > As I have explained in the last mail, it's a very rare case that a > > > > > consumer > > > > > need to read remote data. With our experience at Uber, this only > > > > > > > > > > > > > > > happens > > > > > > > > > > when the consumer service had an outage for several hours. > > > > > > > > > > There is not a "performance cliff" as you assume. The remote storage > > > > > > > > > > is > > > > > > > > > > even faster than local disks in terms of bandwidth. Reading from > > > > > > > > > > remote > > > > > > > > > > storage is going to have higher latency than local disk. But since > > > > > > > > > > the > > > > > > > > > > > > > > > consumer > > > > > is catching up several hours data, it's not sensitive to the > > > > > > > > > > > > > > > sub-second > > > > > > > > > > > > > > > level > > > > > latency, and each remote read request will read a large amount of > > > > > > > > > > > > > > > data to > > > > > > > > > > make the overall performance better than reading from local disks. > > > > > > > > > > Another aspect I'd like to understand better is the effect of > > > > > > > > > > serving > > > > > > > > > > fetch > > > > > > > > > > request from remote storage has on the broker's network > > > > > > > > > > utilization. If > > > > > > > > > > we're just trimming the amount of data held locally (without > > > > > > > > > > increasing > > > > > > > > > > the > > > > > > > > > > overall local+remote retention), then we're effectively trading > > > > > > > > > > disk > > > > > > > > > > bandwidth for network bandwidth when serving fetch requests from > > > > > > > > > > remote > > > > > > > > > > storage (which I understand to be a good thing, since brokers are > > > > > often/usually disk bound). But if we're increasing the overall > > > > > > > > > > local+remote > > > > > > > > > > retention then it's more likely that network itself becomes the > > > > > > > > > > bottleneck. > > > > > > > > > > I appreciate this is all rather hand wavy, I'm just trying to > > > > > > > > > > understand > > > > > > > > > > how this would affect broker performance, so I'd be grateful for > > > > > > > > > > any > > > > > > > > > > insights you can offer. > > > > > > > > > > Network bandwidth is a function of produce speed, it has nothing to > > > > > > > > > > do > > > > > > > > > > with > > > > > > > > > > remote retention. As long as the data is shipped to remote storage, > > > > > > > > > > you > > > > > > > > > > can > > > > > > > > > > keep the data there for 1 day or 1 year or 100 years, it doesn't > > > > > > > > > > consume > > > > > > > > > > > > > > > any > > > > > network resources. > > > > > > > > > > > > > > > > > > > > >