It is a huge milestone, but a challenge for implementing pluggable metadata storage. Will the plan go from providing pluggable metadata storage to internalize the distributed coordination functionality of Pulsar itself finally?
Lari Hotari <lhot...@apache.org> 于2022年8月16日周二 11:17写道: > Bumping up this thread. > > -Lari > > pe 20. toukok. 2022 klo 1.57 Lari Hotari <lhot...@apache.org> kirjoitti: > > > Hi all, > > > > I started writing this email as feedback to "PIP-157: Bucketing topic > > metadata to allow more topics per namespace" [3]. > > This email expanded to cover some analysis of "PIP-45: Pluggable metadata > > interface" [4] design. (A good introduction to PIP-45 is the StreamNative > > blog post "Moving Toward a ZooKeeper-Less Apache Pulsar" [5]). > > > > The intention is to start discussions for Pulsar 3.0 and beyond. Bouncing > > ideas and challenging the existing design with good intentions and the > > benefit of all. > > > > I'll share some thoughts that have come up in discussions together with > my > > colleague Michael Marshall. We have been bouncing some ideas together > and > > that has been very helpful in being able to start building some > > understanding of the existing challenges and possible direction for > solving > > these challenges. I hope that we could have broader conversations in the > > Pulsar Community for improving Pulsar's metadata management and load > > balancing designs in the long term. > > > > There are few areas where there are challenges with the current Metadata > > Store / PIP-45 solution: > > > > 1) Metadata consistency from user's point of view > > - Summarized well in this great analysis and comment [1] by Zac Bentley > > "Ideally, the resolution of all of these issues would be the same: a > > management API operation--any operation--should not return successfully > > until all observable side effects of that operation across a Pulsar > cluster > > (including brokers, proxies, bookies, and ZK) were completed." (see [1] > for > > the full analysis and comment) > > > > 2) Metadata consistency issues within Pulsar > > - There are issues where the state in a single broker gets left in a > bad > > state as a result of consistency and concurrency issues with metadata > > handling and caching. > > Possible example https://github.com/apache/pulsar/issues/13946 > > > > 3) Scalability issue: all metadata changes are broadcasted to all brokers > > - model doesn't scale out > > - This is due to the change made in > > https://github.com/apache/pulsar/pull/11198 , "Use ZK persistent > watches". > > - The global broadcasting design of metadata changes doesn't follow > > typical scalable design principles such as the "Scale cube". This will > pose > > limits on Pulsar clusters with large number of brokers. The current > > metadata change notification solution doesn't support scaling out when > it's > > based on a design that broadcast all notifications to every participant. > > > > When doing some initial analysis and brainstorming on the above areas, > > there have been thoughts that PIP-45 Metadata Store API [2] abstractions > > are somewhat not optimal. > > > > A lot of the functionality that is provided in the PIP-45 Metadata Store > > API interface [2] could be solved more efficiently in a way where Pulsar > > itself would be a key part of the metadata storage solution. > > > > For example, listing topics in a namespace could be a "scatter-gather" > > query to all "metadata shards" that hold namespace topics. There's not > > necessarily a need to have a centralized external Metadata Store API > > interface [2] that replies to all queries. Pulsar metadata handling could > > be moving towards a distributed database type of design where consistent > > hashing plays a key role. Since the Metadata handling is an internal > > concern, the interface doesn't need to provide services directly to > > external users of Pulsar. The Pulsar Admin API should also be improved to > > scale for queries and listing of namespaces with millions of topics, and > > should have pagination to limit results. This implementation can > internally > > handle possible "scatter-gather" queries when the metadata handling > backend > > is not centralized. The point is that Metadata Store API [2] abstraction > > doesn't necessarily need to provide service for this, since it could be a > > different concern. > > > > Most of the complexity in the current PIP-45 MetaData Store comes from > > data consistency challenges. The solution is heavily based on caches and > > having ways to handle cache expirations and keeping data consistent. > There > > are gaps in the caching solution since there are metadata consistency > > problems, as described in 1) and 2) above. A lot of the problems go away > in > > a model where most processing and data access is local. Similar to how > the > > broker handles the topics. The topic is owned on a single broker at a > time. > > The approach could be extended to cover metadata changes and queries. > > > > What is interesting here regarding PIP-157 is that brainstorming led to a > > sharding (aka "bucketing") solution, where there are metadata shards in > the > > system. > > > > metadata shard > > | > > namespace bundle (existing) > > | > > namespace (existing) > > > > Instead of having a specific solution in mind for dealing with the > storage > > of the metadata, the main idea is that each metadata shard is independent > > and would be able to perform operations without coordination with other > > metadata shards. This does impact the storage of metadata so that > > operations to the storage system can be isolated (for example, it is > > necessary to be able to list the topics for a bundle without listing > > everything. PIP-157 provides one type of solution for this). We didn't > let > > the existing solution limit our brainstorming. > > > > Since there is metadata that needs to be available in multiple locations > > in the system such as tenant / namespace level policies, it would be > easier > > to handle the consistency aspects with a model that is not based on CRUD > > type of operations, but instead is event sourced where the state can be > > rebuilt from events (with the possibility to have state snapshots). There > > could be an internal metadata replication protocol which ensures > > consistency (some type of acknowledgements when followers have caught up > > with changes from the leader) when that is needed. > > > > metadata shard leader > > | > > metadata shard follower (namespace bundle, for example) > > > > The core principle is that all write operations will always be redirected > > to be handled to the leader, which is a single writer for a shard. The > > followers would get events for changes, and the followers could also > notify > > the leader each time they have caught up with changes. This would be one > > way to solve "1) Metadata consistency from user's point of view" without > > having a complex Metadata cache invalidation solution. This would also > > solve the problem "2) Metadata consistency issues within Pulsar". In > event > > sourcing, events are the truth and there are better ways to ensure "cache > > consistency" in a leader-follower model based on event sourcing. > > > > Everything above is just initial brainstorming, but it seems that it is > > going to a different direction than what PIP-45 is currently. > > Abstractions for coordination such as leader election and distributed > > locks will be necessary, and some external Metadata would have to be > > managed in a centralized fashion. In general, the model would be somewhat > > different compared to what PIP-45 has. Since the core idea would be to > use > > an event sourced model, it would be optimal to use BookKeeper ledgers > > (Pulsar managed ledger) for storing the events. > > With the nature of event sourcing, it would be possible to create > > point-in-time backup and restore solutions for Pulsar metadata. Even > today, > > it is very rare that Pulsar users would go directly to Zookeeper for > > observing the state of the metadata. In an event sourced system, this > state > > could be stored to flat files on disk if that is needed for debugging and > > observability purposes besides back and restore. Metadata events could > > possibly also be exposed externally for building efficient management > > tooling for Pulsar. > > > > The metadata handling also expands to Pulsar load balancing, and that > > should also be considered when revisiting the design of PIP-45 to address > > the current challenges. There are also aspects of metadata where changes > > aren't immediate. For example, deleting a topic will require to delete > the > > underlying data stored in bookkeeper. If the operation fails, there > should > > be ways to keep on retrying. Similar approach for creation. Some > operations > > might be asynchronous, and having support for a state machine for > creation > > and deletion could be helpful. This is to bring up the point that it's > not > > optimal to model a topic deletion as an atomic operation. The state > change > > should be atomic, but the deletion from the metadata storage should not > > happen until all asynchronous operations have been completed. The > metadata > > admin interface caller should be able to proceed after it is marked > > deleted, but the system should keep on managing the deletions in the > > background. Similarly, the creation of topics could have more states to > > deal with efficient creation of a large number of topics. > > > > This was a long email covering a subject that we haven't dealt with > before > > in the Pulsar community. Usually, we have discussions about solutions > that > > are very targeted. It isn't common to transparently discuss existing > > design challenges or problems and find ways to solve them together. > Sharing > > observations about problems would be valuable. High-level problems don't > > get reported in the GitHub issue tracker since they aren't individual > bugs. > > We should find ways to address also this type of challenges in the > > community. > > > > I hope we can change this and also take the opportunity to meet at Pulsar > > Community meetings and have more of these in-depth discussions that will > > help us improve Pulsar for the benefit of us all in the Apache Pulsar > > community. > > > > Since PIP-157 [3] is proceeding, I see that as an opportunity to start > > taking the design of Pulsar Metadata handling in the direction where we > > could address the challenges that there are currently in Pulsar with > > metadata handling and load balancing. We must decide together what that > > direction is. I hope this email opens some new aspects to the basis of > > these decisions. I'm hoping that you, the reader of the email, > > participate to share your views and also help develop this direction. > > > > PIP-157 [3] assumes "Pulsar is able to manage millions of topics but the > > number of topics within a single namespace is limited by metadata > > storage.". Does this assumption hold? > > > > For example, "3) Scalability issue: all metadata changes are broadcasted > > to all brokers" will become a challenge in a large system with a high > > number of brokers. Together with the other Metadata consistency > challenges > > ( 1 and 2 above), I have a doubt that after PIP-157 is implemented, the > > bottlenecks will move to these areas. In that sense, it might be a > band-aid > > that won't address the root cause of the Pulsar Metadata handling > > scalability challenges. > > > > Let's discuss and address the challenges together! > > > > Regards, > > > > -Lari > > > > [1] - analysis about Metadata consistency from user's point of view - > > https://github.com/apache/pulsar/issues/12555#issuecomment-955748744 > > [2] - MetadataStore interface: > > > https://github.com/apache/pulsar/blob/master/pulsar-metadata/src/main/java/org/apache/pulsar/metadata/api/MetadataStore.java > > [3] - PIP-157: Bucketing topic metadata to allow more topics per > namespace > > - https://github.com/apache/pulsar/issues/15254 > > [4] - PIP-45: Pluggable metadata interface - > > > https://github.com/apache/pulsar/wiki/PIP-45%3A-Pluggable-metadata-interface > > > > [5] - StreamNative's blog "Moving Toward a ZooKeeper-Less Apache Pulsar" > - > > > https://streamnative.io/blog/release/2022-01-25-moving-toward-a-zookeeperless-apache-pulsar/ > > > -- BR, Qiang Huang