Hi all, I started writing this email as feedback to "PIP-157: Bucketing topic metadata to allow more topics per namespace" [3]. This email expanded to cover some analysis of "PIP-45: Pluggable metadata interface" [4] design. (A good introduction to PIP-45 is the StreamNative blog post "Moving Toward a ZooKeeper-Less Apache Pulsar" [5]).
The intention is to start discussions for Pulsar 3.0 and beyond. Bouncing ideas and challenging the existing design with good intentions and the benefit of all. I'll share some thoughts that have come up in discussions together with my colleague Michael Marshall. We have been bouncing some ideas together and that has been very helpful in being able to start building some understanding of the existing challenges and possible direction for solving these challenges. I hope that we could have broader conversations in the Pulsar Community for improving Pulsar's metadata management and load balancing designs in the long term. There are few areas where there are challenges with the current Metadata Store / PIP-45 solution: 1) Metadata consistency from user's point of view - Summarized well in this great analysis and comment [1] by Zac Bentley "Ideally, the resolution of all of these issues would be the same: a management API operation--any operation--should not return successfully until all observable side effects of that operation across a Pulsar cluster (including brokers, proxies, bookies, and ZK) were completed." (see [1] for the full analysis and comment) 2) Metadata consistency issues within Pulsar - There are issues where the state in a single broker gets left in a bad state as a result of consistency and concurrency issues with metadata handling and caching. Possible example https://github.com/apache/pulsar/issues/13946 3) Scalability issue: all metadata changes are broadcasted to all brokers - model doesn't scale out - This is due to the change made in https://github.com/apache/pulsar/pull/11198 , "Use ZK persistent watches". - The global broadcasting design of metadata changes doesn't follow typical scalable design principles such as the "Scale cube". This will pose limits on Pulsar clusters with large number of brokers. The current metadata change notification solution doesn't support scaling out when it's based on a design that broadcast all notifications to every participant. When doing some initial analysis and brainstorming on the above areas, there have been thoughts that PIP-45 Metadata Store API [2] abstractions are somewhat not optimal. A lot of the functionality that is provided in the PIP-45 Metadata Store API interface [2] could be solved more efficiently in a way where Pulsar itself would be a key part of the metadata storage solution. For example, listing topics in a namespace could be a "scatter-gather" query to all "metadata shards" that hold namespace topics. There's not necessarily a need to have a centralized external Metadata Store API interface [2] that replies to all queries. Pulsar metadata handling could be moving towards a distributed database type of design where consistent hashing plays a key role. Since the Metadata handling is an internal concern, the interface doesn't need to provide services directly to external users of Pulsar. The Pulsar Admin API should also be improved to scale for queries and listing of namespaces with millions of topics, and should have pagination to limit results. This implementation can internally handle possible "scatter-gather" queries when the metadata handling backend is not centralized. The point is that Metadata Store API [2] abstraction doesn't necessarily need to provide service for this, since it could be a different concern. Most of the complexity in the current PIP-45 MetaData Store comes from data consistency challenges. The solution is heavily based on caches and having ways to handle cache expirations and keeping data consistent. There are gaps in the caching solution since there are metadata consistency problems, as described in 1) and 2) above. A lot of the problems go away in a model where most processing and data access is local. Similar to how the broker handles the topics. The topic is owned on a single broker at a time. The approach could be extended to cover metadata changes and queries. What is interesting here regarding PIP-157 is that brainstorming led to a sharding (aka "bucketing") solution, where there are metadata shards in the system. metadata shard | namespace bundle (existing) | namespace (existing) Instead of having a specific solution in mind for dealing with the storage of the metadata, the main idea is that each metadata shard is independent and would be able to perform operations without coordination with other metadata shards. This does impact the storage of metadata so that operations to the storage system can be isolated (for example, it is necessary to be able to list the topics for a bundle without listing everything. PIP-157 provides one type of solution for this). We didn't let the existing solution limit our brainstorming. Since there is metadata that needs to be available in multiple locations in the system such as tenant / namespace level policies, it would be easier to handle the consistency aspects with a model that is not based on CRUD type of operations, but instead is event sourced where the state can be rebuilt from events (with the possibility to have state snapshots). There could be an internal metadata replication protocol which ensures consistency (some type of acknowledgements when followers have caught up with changes from the leader) when that is needed. metadata shard leader | metadata shard follower (namespace bundle, for example) The core principle is that all write operations will always be redirected to be handled to the leader, which is a single writer for a shard. The followers would get events for changes, and the followers could also notify the leader each time they have caught up with changes. This would be one way to solve "1) Metadata consistency from user's point of view" without having a complex Metadata cache invalidation solution. This would also solve the problem "2) Metadata consistency issues within Pulsar". In event sourcing, events are the truth and there are better ways to ensure "cache consistency" in a leader-follower model based on event sourcing. Everything above is just initial brainstorming, but it seems that it is going to a different direction than what PIP-45 is currently. Abstractions for coordination such as leader election and distributed locks will be necessary, and some external Metadata would have to be managed in a centralized fashion. In general, the model would be somewhat different compared to what PIP-45 has. Since the core idea would be to use an event sourced model, it would be optimal to use BookKeeper ledgers (Pulsar managed ledger) for storing the events. With the nature of event sourcing, it would be possible to create point-in-time backup and restore solutions for Pulsar metadata. Even today, it is very rare that Pulsar users would go directly to Zookeeper for observing the state of the metadata. In an event sourced system, this state could be stored to flat files on disk if that is needed for debugging and observability purposes besides back and restore. Metadata events could possibly also be exposed externally for building efficient management tooling for Pulsar. The metadata handling also expands to Pulsar load balancing, and that should also be considered when revisiting the design of PIP-45 to address the current challenges. There are also aspects of metadata where changes aren't immediate. For example, deleting a topic will require to delete the underlying data stored in bookkeeper. If the operation fails, there should be ways to keep on retrying. Similar approach for creation. Some operations might be asynchronous, and having support for a state machine for creation and deletion could be helpful. This is to bring up the point that it's not optimal to model a topic deletion as an atomic operation. The state change should be atomic, but the deletion from the metadata storage should not happen until all asynchronous operations have been completed. The metadata admin interface caller should be able to proceed after it is marked deleted, but the system should keep on managing the deletions in the background. Similarly, the creation of topics could have more states to deal with efficient creation of a large number of topics. This was a long email covering a subject that we haven't dealt with before in the Pulsar community. Usually, we have discussions about solutions that are very targeted. It isn't common to transparently discuss existing design challenges or problems and find ways to solve them together. Sharing observations about problems would be valuable. High-level problems don't get reported in the GitHub issue tracker since they aren't individual bugs. We should find ways to address also this type of challenges in the community. I hope we can change this and also take the opportunity to meet at Pulsar Community meetings and have more of these in-depth discussions that will help us improve Pulsar for the benefit of us all in the Apache Pulsar community. Since PIP-157 [3] is proceeding, I see that as an opportunity to start taking the design of Pulsar Metadata handling in the direction where we could address the challenges that there are currently in Pulsar with metadata handling and load balancing. We must decide together what that direction is. I hope this email opens some new aspects to the basis of these decisions. I'm hoping that you, the reader of the email, participate to share your views and also help develop this direction. PIP-157 [3] assumes "Pulsar is able to manage millions of topics but the number of topics within a single namespace is limited by metadata storage.". Does this assumption hold? For example, "3) Scalability issue: all metadata changes are broadcasted to all brokers" will become a challenge in a large system with a high number of brokers. Together with the other Metadata consistency challenges ( 1 and 2 above), I have a doubt that after PIP-157 is implemented, the bottlenecks will move to these areas. In that sense, it might be a band-aid that won't address the root cause of the Pulsar Metadata handling scalability challenges. Let's discuss and address the challenges together! Regards, -Lari [1] - analysis about Metadata consistency from user's point of view - https://github.com/apache/pulsar/issues/12555#issuecomment-955748744 [2] - MetadataStore interface: https://github.com/apache/pulsar/blob/master/pulsar-metadata/src/main/java/org/apache/pulsar/metadata/api/MetadataStore.java [3] - PIP-157: Bucketing topic metadata to allow more topics per namespace - https://github.com/apache/pulsar/issues/15254 [4] - PIP-45: Pluggable metadata interface - https://github.com/apache/pulsar/wiki/PIP-45%3A-Pluggable-metadata-interface [5] - StreamNative's blog "Moving Toward a ZooKeeper-Less Apache Pulsar" - https://streamnative.io/blog/release/2022-01-25-moving-toward-a-zookeeperless-apache-pulsar/