Bumping up this thread. -Lari
pe 20. toukok. 2022 klo 1.57 Lari Hotari <lhot...@apache.org> kirjoitti: > Hi all, > > I started writing this email as feedback to "PIP-157: Bucketing topic > metadata to allow more topics per namespace" [3]. > This email expanded to cover some analysis of "PIP-45: Pluggable metadata > interface" [4] design. (A good introduction to PIP-45 is the StreamNative > blog post "Moving Toward a ZooKeeper-Less Apache Pulsar" [5]). > > The intention is to start discussions for Pulsar 3.0 and beyond. Bouncing > ideas and challenging the existing design with good intentions and the > benefit of all. > > I'll share some thoughts that have come up in discussions together with my > colleague Michael Marshall. We have been bouncing some ideas together and > that has been very helpful in being able to start building some > understanding of the existing challenges and possible direction for solving > these challenges. I hope that we could have broader conversations in the > Pulsar Community for improving Pulsar's metadata management and load > balancing designs in the long term. > > There are few areas where there are challenges with the current Metadata > Store / PIP-45 solution: > > 1) Metadata consistency from user's point of view > - Summarized well in this great analysis and comment [1] by Zac Bentley > "Ideally, the resolution of all of these issues would be the same: a > management API operation--any operation--should not return successfully > until all observable side effects of that operation across a Pulsar cluster > (including brokers, proxies, bookies, and ZK) were completed." (see [1] for > the full analysis and comment) > > 2) Metadata consistency issues within Pulsar > - There are issues where the state in a single broker gets left in a bad > state as a result of consistency and concurrency issues with metadata > handling and caching. > Possible example https://github.com/apache/pulsar/issues/13946 > > 3) Scalability issue: all metadata changes are broadcasted to all brokers > - model doesn't scale out > - This is due to the change made in > https://github.com/apache/pulsar/pull/11198 , "Use ZK persistent watches". > - The global broadcasting design of metadata changes doesn't follow > typical scalable design principles such as the "Scale cube". This will pose > limits on Pulsar clusters with large number of brokers. The current > metadata change notification solution doesn't support scaling out when it's > based on a design that broadcast all notifications to every participant. > > When doing some initial analysis and brainstorming on the above areas, > there have been thoughts that PIP-45 Metadata Store API [2] abstractions > are somewhat not optimal. > > A lot of the functionality that is provided in the PIP-45 Metadata Store > API interface [2] could be solved more efficiently in a way where Pulsar > itself would be a key part of the metadata storage solution. > > For example, listing topics in a namespace could be a "scatter-gather" > query to all "metadata shards" that hold namespace topics. There's not > necessarily a need to have a centralized external Metadata Store API > interface [2] that replies to all queries. Pulsar metadata handling could > be moving towards a distributed database type of design where consistent > hashing plays a key role. Since the Metadata handling is an internal > concern, the interface doesn't need to provide services directly to > external users of Pulsar. The Pulsar Admin API should also be improved to > scale for queries and listing of namespaces with millions of topics, and > should have pagination to limit results. This implementation can internally > handle possible "scatter-gather" queries when the metadata handling backend > is not centralized. The point is that Metadata Store API [2] abstraction > doesn't necessarily need to provide service for this, since it could be a > different concern. > > Most of the complexity in the current PIP-45 MetaData Store comes from > data consistency challenges. The solution is heavily based on caches and > having ways to handle cache expirations and keeping data consistent. There > are gaps in the caching solution since there are metadata consistency > problems, as described in 1) and 2) above. A lot of the problems go away in > a model where most processing and data access is local. Similar to how the > broker handles the topics. The topic is owned on a single broker at a time. > The approach could be extended to cover metadata changes and queries. > > What is interesting here regarding PIP-157 is that brainstorming led to a > sharding (aka "bucketing") solution, where there are metadata shards in the > system. > > metadata shard > | > namespace bundle (existing) > | > namespace (existing) > > Instead of having a specific solution in mind for dealing with the storage > of the metadata, the main idea is that each metadata shard is independent > and would be able to perform operations without coordination with other > metadata shards. This does impact the storage of metadata so that > operations to the storage system can be isolated (for example, it is > necessary to be able to list the topics for a bundle without listing > everything. PIP-157 provides one type of solution for this). We didn't let > the existing solution limit our brainstorming. > > Since there is metadata that needs to be available in multiple locations > in the system such as tenant / namespace level policies, it would be easier > to handle the consistency aspects with a model that is not based on CRUD > type of operations, but instead is event sourced where the state can be > rebuilt from events (with the possibility to have state snapshots). There > could be an internal metadata replication protocol which ensures > consistency (some type of acknowledgements when followers have caught up > with changes from the leader) when that is needed. > > metadata shard leader > | > metadata shard follower (namespace bundle, for example) > > The core principle is that all write operations will always be redirected > to be handled to the leader, which is a single writer for a shard. The > followers would get events for changes, and the followers could also notify > the leader each time they have caught up with changes. This would be one > way to solve "1) Metadata consistency from user's point of view" without > having a complex Metadata cache invalidation solution. This would also > solve the problem "2) Metadata consistency issues within Pulsar". In event > sourcing, events are the truth and there are better ways to ensure "cache > consistency" in a leader-follower model based on event sourcing. > > Everything above is just initial brainstorming, but it seems that it is > going to a different direction than what PIP-45 is currently. > Abstractions for coordination such as leader election and distributed > locks will be necessary, and some external Metadata would have to be > managed in a centralized fashion. In general, the model would be somewhat > different compared to what PIP-45 has. Since the core idea would be to use > an event sourced model, it would be optimal to use BookKeeper ledgers > (Pulsar managed ledger) for storing the events. > With the nature of event sourcing, it would be possible to create > point-in-time backup and restore solutions for Pulsar metadata. Even today, > it is very rare that Pulsar users would go directly to Zookeeper for > observing the state of the metadata. In an event sourced system, this state > could be stored to flat files on disk if that is needed for debugging and > observability purposes besides back and restore. Metadata events could > possibly also be exposed externally for building efficient management > tooling for Pulsar. > > The metadata handling also expands to Pulsar load balancing, and that > should also be considered when revisiting the design of PIP-45 to address > the current challenges. There are also aspects of metadata where changes > aren't immediate. For example, deleting a topic will require to delete the > underlying data stored in bookkeeper. If the operation fails, there should > be ways to keep on retrying. Similar approach for creation. Some operations > might be asynchronous, and having support for a state machine for creation > and deletion could be helpful. This is to bring up the point that it's not > optimal to model a topic deletion as an atomic operation. The state change > should be atomic, but the deletion from the metadata storage should not > happen until all asynchronous operations have been completed. The metadata > admin interface caller should be able to proceed after it is marked > deleted, but the system should keep on managing the deletions in the > background. Similarly, the creation of topics could have more states to > deal with efficient creation of a large number of topics. > > This was a long email covering a subject that we haven't dealt with before > in the Pulsar community. Usually, we have discussions about solutions that > are very targeted. It isn't common to transparently discuss existing > design challenges or problems and find ways to solve them together. Sharing > observations about problems would be valuable. High-level problems don't > get reported in the GitHub issue tracker since they aren't individual bugs. > We should find ways to address also this type of challenges in the > community. > > I hope we can change this and also take the opportunity to meet at Pulsar > Community meetings and have more of these in-depth discussions that will > help us improve Pulsar for the benefit of us all in the Apache Pulsar > community. > > Since PIP-157 [3] is proceeding, I see that as an opportunity to start > taking the design of Pulsar Metadata handling in the direction where we > could address the challenges that there are currently in Pulsar with > metadata handling and load balancing. We must decide together what that > direction is. I hope this email opens some new aspects to the basis of > these decisions. I'm hoping that you, the reader of the email, > participate to share your views and also help develop this direction. > > PIP-157 [3] assumes "Pulsar is able to manage millions of topics but the > number of topics within a single namespace is limited by metadata > storage.". Does this assumption hold? > > For example, "3) Scalability issue: all metadata changes are broadcasted > to all brokers" will become a challenge in a large system with a high > number of brokers. Together with the other Metadata consistency challenges > ( 1 and 2 above), I have a doubt that after PIP-157 is implemented, the > bottlenecks will move to these areas. In that sense, it might be a band-aid > that won't address the root cause of the Pulsar Metadata handling > scalability challenges. > > Let's discuss and address the challenges together! > > Regards, > > -Lari > > [1] - analysis about Metadata consistency from user's point of view - > https://github.com/apache/pulsar/issues/12555#issuecomment-955748744 > [2] - MetadataStore interface: > https://github.com/apache/pulsar/blob/master/pulsar-metadata/src/main/java/org/apache/pulsar/metadata/api/MetadataStore.java > [3] - PIP-157: Bucketing topic metadata to allow more topics per namespace > - https://github.com/apache/pulsar/issues/15254 > [4] - PIP-45: Pluggable metadata interface - > https://github.com/apache/pulsar/wiki/PIP-45%3A-Pluggable-metadata-interface > > [5] - StreamNative's blog "Moving Toward a ZooKeeper-Less Apache Pulsar" - > https://streamnative.io/blog/release/2022-01-25-moving-toward-a-zookeeperless-apache-pulsar/ >