Hi all, In case someone missed the first message in this thread, it can be read at https://lists.apache.org/thread/tvco1orf0hsyt59pjtfbwoq0vf6hfrcj .
The intention is to start discussions about improving Apache Pulsar design in Pulsar 3.0 . This seems to be the first time that we are discussing Pulsar 3.0 design in the community on the dev-mailing list. I haven't asked permission to start this discussion, since there is no need for that. In Apache projects, all individuals, regardless of committer or PMC member status, are given the opportunity to participate in discussions and make proposals without asking permission to do so. In the Apache way, private decisions on project direction are indeed disallowed [1], which could be surprising for those who are new to the Apache way. I'd like to clarify that there is no decision about Pulsar 3.0 or any timelines related to it. This is the beginning of the discussion and no decisions have been made. The current PIP process in the Apache Pulsar project seems to be a lot about documenting design decisions, and it's hard to make changes that would require reversing some previous design decision. In some cases, some problems or challenges could be better solved by reversing or replacing some element of the current design. One example of a fundamental foundation of the current Pulsar design is the concept of "namespace bundles". "Namespace bundles" are defined in Pulsar reference document in this way: "The assignment of topics to brokers is not done at the topic level but at the bundle level (a higher level). Instead of individual topic assignments, each broker takes ownership of a subset of the topics for a namespace. This subset is called a bundle and effectively this subset is a sharding mechanism." "Topics are assigned to a particular bundle by taking the hash of the topic name and checking in which bundle the hash falls." (source: https://pulsar.apache.org/docs/administration-load-balance#dynamic-assignments ) There's also an excellent blog post explaining how Pulsar load balancing currently works in detail, it's https://streamnative.io/blog/engineering/2022-07-21-achieving-broker-load-balancing-with-apache-pulsar/ . There has been a clear reason in the current design. With namespace bundles, topic load balancing operations can be handled as groups (batches) to minimize the overhead of coordination and load balancing operations. However, this design also causes essential limitations: a topic cannot be freely assigned to a bundle or a broker. In the current namespace bundles design, a namespace bundle is determined by calculating a hash of the topic's name and there are hash ranges to define bundle boundaries. There is no explicit way to assign a topic to a bundle. Splitting a bundle is the only way to impact bundle assignment. This adds a lot of unnecessary complexity to load balancing decisions and prevents optimal topic placement. The current Pulsar load balancing makes load balancing decisions dynamically, although there are features where placement with predefined rules is used (broker isolation, anti-affinity namespaces across failure domains). The current load balancing actions are heavily impacted and limited by the namespace bundles. There might be future requirements of having affinity and anti-affinity rules for topic placement. It might be useful to be able to have placement rules to take availability zones (or data center racks) into account. Free form topic placement would also be useful for capacity management. Rate limits could be used as part of load balancing decisions, and there could be rules for how much total possible throughput is allowed on a specific broker when considering the rate limits of assigned topics. Rule based placement could also be useful when it is known that the topic traffic pattern is bursty (common in batch oriented processing) and maximum throughput is desired while the traffic burst is being processed. Dynamic load balancing is simply too slow in reacting to such changes in traffic patterns. In these cases, it might be useful to be able to pre-assign the different partitions of a partitioned topic to be balanced across available brokers in the cluster with anti-affinity rules instead of making this decision dynamically after the traffic is running. The traffic burst could be over when the load balancer reacts. Broker isolation is another use case for free form topic placement. In the case of capacity issues or noisy neighbor performance issues in achieving SLAs, it would be useful to be able to assign a specific topic to a dedicate set of brokers. Broker isolation is currently possible at namespace level, but having this possibility at topic level would increase flexibility. On a multi-tenant SaaS platform, this would give more possibilities in meeting QoS/SLA by having better ways for ensuring guaranteed throughput and latency with better capacity management. Achieving autoscaling requires better capacity management. Unless there's a way to run the existing capacity at a certain level and measure and control this, there aren't ways to make relevant scale-up and scale-down decisions. These requirements could also be considered in Pulsar load balancing improvements. In a graceful broker shutdown, the namespace bundles that are owned by the broker are unloaded and released. There will be a short interruption in message delivery because of this operation. The benefit of free form topic placement would be such that in a graceful broker shutdown, producers and consumers could be migrated from one broker to another in a seamless handover, independently of any "namespace bundle". In a free form placement, each topic can start serving producers and consumers on the other broker immediately, without waiting for all topics in the bundle to be migrated. This is the key to prevent downtime and service interruptions in graceful broker shutdown. When it's cheap and non-intrusive to migrate topics across brokers, Pulsar load balancing will become more effective. Effective Pulsar load balancing will help meet SLAs and the defined QoS levels in a cost-effective way. Effective load balancing will also make autoscaling an option without the risk of causing service interruptions or SLA violations. The current design of namespace bundles is becoming a limitation for future Pulsar improvements. If we instead replace "namespace bundles" by rethinking and revisiting the Pulsar load balancing design, there will be a better way forward for Pulsar 3.0 and beyond. Pulsar load balancing design depends on the Pulsar metadata solution. These two cannot be separated in a performant, reliable and cost-efficient solution. Let's start active discussions about improving Pulsar design for Pulsar 3.0 and beyond. One practical way to participate in this discussion is to answer this question: Should we replace namespace bundles in Pulsar 3.0 design? What is the replacement? Best Regards, -Lari [1] expressed in https://www.apache.org/theapacheway/ , "Private decisions on code, policies, or project direction are disallowed; off-list discourse and transactions must be brought on-list." On Tue, Aug 16, 2022 at 6:15 AM Lari Hotari <lhot...@apache.org> wrote: > Bumping up this thread. > > -Lari > > pe 20. toukok. 2022 klo 1.57 Lari Hotari <lhot...@apache.org> kirjoitti: > >> Hi all, >> >> I started writing this email as feedback to "PIP-157: Bucketing topic >> metadata to allow more topics per namespace" [3]. >> This email expanded to cover some analysis of "PIP-45: Pluggable metadata >> interface" [4] design. (A good introduction to PIP-45 is the StreamNative >> blog post "Moving Toward a ZooKeeper-Less Apache Pulsar" [5]). >> >> The intention is to start discussions for Pulsar 3.0 and beyond. Bouncing >> ideas and challenging the existing design with good intentions and the >> benefit of all. >> >> I'll share some thoughts that have come up in discussions together with >> my colleague Michael Marshall. We have been bouncing some ideas together >> and that has been very helpful in being able to start building some >> understanding of the existing challenges and possible direction for solving >> these challenges. I hope that we could have broader conversations in the >> Pulsar Community for improving Pulsar's metadata management and load >> balancing designs in the long term. >> >> There are few areas where there are challenges with the current Metadata >> Store / PIP-45 solution: >> >> 1) Metadata consistency from user's point of view >> - Summarized well in this great analysis and comment [1] by Zac Bentley >> "Ideally, the resolution of all of these issues would be the same: a >> management API operation--any operation--should not return successfully >> until all observable side effects of that operation across a Pulsar cluster >> (including brokers, proxies, bookies, and ZK) were completed." (see [1] for >> the full analysis and comment) >> >> 2) Metadata consistency issues within Pulsar >> - There are issues where the state in a single broker gets left in a >> bad state as a result of consistency and concurrency issues with metadata >> handling and caching. >> Possible example https://github.com/apache/pulsar/issues/13946 >> >> 3) Scalability issue: all metadata changes are broadcasted to all brokers >> - model doesn't scale out >> - This is due to the change made in >> https://github.com/apache/pulsar/pull/11198 , "Use ZK persistent >> watches". >> - The global broadcasting design of metadata changes doesn't follow >> typical scalable design principles such as the "Scale cube". This will pose >> limits on Pulsar clusters with large number of brokers. The current >> metadata change notification solution doesn't support scaling out when it's >> based on a design that broadcast all notifications to every participant. >> >> When doing some initial analysis and brainstorming on the above areas, >> there have been thoughts that PIP-45 Metadata Store API [2] abstractions >> are somewhat not optimal. >> >> A lot of the functionality that is provided in the PIP-45 Metadata Store >> API interface [2] could be solved more efficiently in a way where Pulsar >> itself would be a key part of the metadata storage solution. >> >> For example, listing topics in a namespace could be a "scatter-gather" >> query to all "metadata shards" that hold namespace topics. There's not >> necessarily a need to have a centralized external Metadata Store API >> interface [2] that replies to all queries. Pulsar metadata handling could >> be moving towards a distributed database type of design where consistent >> hashing plays a key role. Since the Metadata handling is an internal >> concern, the interface doesn't need to provide services directly to >> external users of Pulsar. The Pulsar Admin API should also be improved to >> scale for queries and listing of namespaces with millions of topics, and >> should have pagination to limit results. This implementation can internally >> handle possible "scatter-gather" queries when the metadata handling backend >> is not centralized. The point is that Metadata Store API [2] abstraction >> doesn't necessarily need to provide service for this, since it could be a >> different concern. >> >> Most of the complexity in the current PIP-45 MetaData Store comes from >> data consistency challenges. The solution is heavily based on caches and >> having ways to handle cache expirations and keeping data consistent. There >> are gaps in the caching solution since there are metadata consistency >> problems, as described in 1) and 2) above. A lot of the problems go away in >> a model where most processing and data access is local. Similar to how the >> broker handles the topics. The topic is owned on a single broker at a time. >> The approach could be extended to cover metadata changes and queries. >> >> What is interesting here regarding PIP-157 is that brainstorming led to a >> sharding (aka "bucketing") solution, where there are metadata shards in the >> system. >> >> metadata shard >> | >> namespace bundle (existing) >> | >> namespace (existing) >> >> Instead of having a specific solution in mind for dealing with the >> storage of the metadata, the main idea is that each metadata shard is >> independent and would be able to perform operations without coordination >> with other metadata shards. This does impact the storage of metadata so >> that operations to the storage system can be isolated (for example, it is >> necessary to be able to list the topics for a bundle without listing >> everything. PIP-157 provides one type of solution for this). We didn't let >> the existing solution limit our brainstorming. >> >> Since there is metadata that needs to be available in multiple locations >> in the system such as tenant / namespace level policies, it would be easier >> to handle the consistency aspects with a model that is not based on CRUD >> type of operations, but instead is event sourced where the state can be >> rebuilt from events (with the possibility to have state snapshots). There >> could be an internal metadata replication protocol which ensures >> consistency (some type of acknowledgements when followers have caught up >> with changes from the leader) when that is needed. >> >> metadata shard leader >> | >> metadata shard follower (namespace bundle, for example) >> >> The core principle is that all write operations will always be redirected >> to be handled to the leader, which is a single writer for a shard. The >> followers would get events for changes, and the followers could also notify >> the leader each time they have caught up with changes. This would be one >> way to solve "1) Metadata consistency from user's point of view" without >> having a complex Metadata cache invalidation solution. This would also >> solve the problem "2) Metadata consistency issues within Pulsar". In event >> sourcing, events are the truth and there are better ways to ensure "cache >> consistency" in a leader-follower model based on event sourcing. >> >> Everything above is just initial brainstorming, but it seems that it is >> going to a different direction than what PIP-45 is currently. >> Abstractions for coordination such as leader election and distributed >> locks will be necessary, and some external Metadata would have to be >> managed in a centralized fashion. In general, the model would be somewhat >> different compared to what PIP-45 has. Since the core idea would be to use >> an event sourced model, it would be optimal to use BookKeeper ledgers >> (Pulsar managed ledger) for storing the events. >> With the nature of event sourcing, it would be possible to create >> point-in-time backup and restore solutions for Pulsar metadata. Even today, >> it is very rare that Pulsar users would go directly to Zookeeper for >> observing the state of the metadata. In an event sourced system, this state >> could be stored to flat files on disk if that is needed for debugging and >> observability purposes besides back and restore. Metadata events could >> possibly also be exposed externally for building efficient management >> tooling for Pulsar. >> >> The metadata handling also expands to Pulsar load balancing, and that >> should also be considered when revisiting the design of PIP-45 to address >> the current challenges. There are also aspects of metadata where changes >> aren't immediate. For example, deleting a topic will require to delete the >> underlying data stored in bookkeeper. If the operation fails, there should >> be ways to keep on retrying. Similar approach for creation. Some operations >> might be asynchronous, and having support for a state machine for creation >> and deletion could be helpful. This is to bring up the point that it's not >> optimal to model a topic deletion as an atomic operation. The state change >> should be atomic, but the deletion from the metadata storage should not >> happen until all asynchronous operations have been completed. The metadata >> admin interface caller should be able to proceed after it is marked >> deleted, but the system should keep on managing the deletions in the >> background. Similarly, the creation of topics could have more states to >> deal with efficient creation of a large number of topics. >> >> This was a long email covering a subject that we haven't dealt with >> before in the Pulsar community. Usually, we have discussions about >> solutions that are very targeted. It isn't common to transparently discuss >> existing design challenges or problems and find ways to solve them >> together. Sharing observations about problems would be valuable. High-level >> problems don't get reported in the GitHub issue tracker since they aren't >> individual bugs. We should find ways to address also this type of >> challenges in the community. >> >> I hope we can change this and also take the opportunity to meet at Pulsar >> Community meetings and have more of these in-depth discussions that will >> help us improve Pulsar for the benefit of us all in the Apache Pulsar >> community. >> >> Since PIP-157 [3] is proceeding, I see that as an opportunity to start >> taking the design of Pulsar Metadata handling in the direction where we >> could address the challenges that there are currently in Pulsar with >> metadata handling and load balancing. We must decide together what that >> direction is. I hope this email opens some new aspects to the basis of >> these decisions. I'm hoping that you, the reader of the email, >> participate to share your views and also help develop this direction. >> >> PIP-157 [3] assumes "Pulsar is able to manage millions of topics but the >> number of topics within a single namespace is limited by metadata >> storage.". Does this assumption hold? >> >> For example, "3) Scalability issue: all metadata changes are broadcasted >> to all brokers" will become a challenge in a large system with a high >> number of brokers. Together with the other Metadata consistency challenges >> ( 1 and 2 above), I have a doubt that after PIP-157 is implemented, the >> bottlenecks will move to these areas. In that sense, it might be a band-aid >> that won't address the root cause of the Pulsar Metadata handling >> scalability challenges. >> >> Let's discuss and address the challenges together! >> >> Regards, >> >> -Lari >> >> [1] - analysis about Metadata consistency from user's point of view - >> https://github.com/apache/pulsar/issues/12555#issuecomment-955748744 >> [2] - MetadataStore interface: >> https://github.com/apache/pulsar/blob/master/pulsar-metadata/src/main/java/org/apache/pulsar/metadata/api/MetadataStore.java >> [3] - PIP-157: Bucketing topic metadata to allow more topics per >> namespace - https://github.com/apache/pulsar/issues/15254 >> [4] - PIP-45: Pluggable metadata interface - >> https://github.com/apache/pulsar/wiki/PIP-45%3A-Pluggable-metadata-interface >> >> [5] - StreamNative's blog "Moving Toward a ZooKeeper-Less Apache Pulsar" >> - >> https://streamnative.io/blog/release/2022-01-25-moving-toward-a-zookeeperless-apache-pulsar/ >> >