Bumping up this thread.

-Lari

pe 20. toukok. 2022 klo 1.57 Lari Hotari <lhot...@apache.org> kirjoitti:

> Hi all,
>
> I started writing this email as feedback to "PIP-157: Bucketing topic
> metadata to allow more topics per namespace" [3].
> This email expanded to cover some analysis of "PIP-45: Pluggable metadata
> interface" [4] design. (A good introduction to PIP-45 is the StreamNative
> blog post "Moving Toward a ZooKeeper-Less Apache Pulsar" [5]).
>
> The intention is to start discussions for Pulsar 3.0 and beyond. Bouncing
> ideas and challenging the existing design with good intentions and the
> benefit of all.
>
> I'll share some thoughts that have come up in discussions together with my
> colleague Michael Marshall.  We have been bouncing some ideas together and
> that has been very helpful in being able to start building some
> understanding of the existing challenges and possible direction for solving
> these challenges. I hope that we could have broader conversations in the
> Pulsar Community for improving Pulsar's metadata management and load
> balancing designs in the long term.
>
> There are few areas where there are challenges with the current Metadata
> Store / PIP-45 solution:
>
> 1) Metadata consistency from user's point of view
>   - Summarized well in this great analysis and comment [1] by Zac Bentley
>    "Ideally, the resolution of all of these issues would be the same: a
> management API operation--any operation--should not return successfully
> until all observable side effects of that operation across a Pulsar cluster
> (including brokers, proxies, bookies, and ZK) were completed." (see [1] for
> the full analysis and comment)
>
> 2) Metadata consistency issues within Pulsar
>   - There are issues where the state in a single broker gets left in a bad
> state as a result of consistency and concurrency issues with metadata
> handling and caching.
>     Possible example https://github.com/apache/pulsar/issues/13946
>
> 3) Scalability issue: all metadata changes are broadcasted to all brokers
> - model doesn't scale out
>    - This is due to the change made in
> https://github.com/apache/pulsar/pull/11198 , "Use ZK persistent watches".
>    - The global broadcasting design of metadata changes doesn't follow
> typical scalable design principles such as the "Scale cube". This will pose
> limits on Pulsar clusters with large number of brokers. The current
> metadata change notification solution doesn't support scaling out when it's
> based on a design that broadcast all notifications to every participant.
>
> When doing some initial analysis and brainstorming on the above areas,
> there have been thoughts that PIP-45 Metadata Store API [2] abstractions
> are somewhat not optimal.
>
> A lot of the functionality that is provided in the PIP-45 Metadata Store
> API interface [2] could be solved more efficiently in a way where Pulsar
> itself would be a key part of the metadata storage solution.
>
> For example, listing topics in a namespace could be a "scatter-gather"
> query to all "metadata shards" that hold namespace topics. There's not
> necessarily a need to have a centralized external Metadata Store API
> interface [2] that replies to all queries. Pulsar metadata handling could
> be moving towards a distributed database type of design where consistent
> hashing plays a key role. Since the Metadata handling is an internal
> concern, the interface doesn't need to provide services directly to
> external users of Pulsar. The Pulsar Admin API should also be improved to
> scale for queries and listing of namespaces with millions of topics, and
> should have pagination to limit results. This implementation can internally
> handle possible "scatter-gather" queries when the metadata handling backend
> is not centralized. The point is that Metadata Store API [2] abstraction
> doesn't necessarily need to provide service for this, since it could be a
> different concern.
>
> Most of the complexity in the current PIP-45 MetaData Store comes from
> data consistency challenges. The solution is heavily based on caches and
> having ways to handle cache expirations and keeping data consistent. There
> are gaps in the caching solution since there are metadata consistency
> problems, as described in 1) and 2) above. A lot of the problems go away in
> a model where most processing and data access is local. Similar to how the
> broker handles the topics. The topic is owned on a single broker at a time.
> The approach could be extended to cover metadata changes and queries.
>
> What is interesting here regarding PIP-157 is that brainstorming led to a
> sharding (aka "bucketing") solution, where there are metadata shards in the
> system.
>
> metadata shard
>            |
> namespace bundle  (existing)
>            |
> namespace  (existing)
>
> Instead of having a specific solution in mind for dealing with the storage
> of the metadata, the main idea is that each metadata shard is independent
> and would be able to perform operations without coordination with other
> metadata shards. This does impact the storage of metadata so that
> operations to the storage system can be isolated (for example, it is
> necessary to be able to list the topics for a bundle without listing
> everything. PIP-157 provides one type of solution for this). We didn't let
> the existing solution limit our brainstorming.
>
> Since there is metadata that needs to be available in multiple locations
> in the system such as tenant / namespace level policies, it would be easier
> to handle the consistency aspects with a model that is not based on CRUD
> type of operations, but instead is event sourced where the state can be
> rebuilt from events (with the possibility to have state snapshots). There
> could be an internal metadata replication protocol which ensures
> consistency (some type of acknowledgements when followers have caught up
> with changes from the leader) when that is needed.
>
> metadata shard leader
>               |
> metadata shard follower  (namespace bundle, for example)
>
> The core principle is that all write operations will always be redirected
> to be handled to the leader, which is a single writer for a shard. The
> followers would get events for changes, and the followers could also notify
> the leader each time they have caught up with changes. This would be one
> way to solve "1) Metadata consistency from user's point of view" without
> having a complex Metadata cache invalidation solution. This would also
> solve the problem "2) Metadata consistency issues within Pulsar". In event
> sourcing, events are the truth and there are better ways to ensure "cache
> consistency" in a leader-follower model based on event sourcing.
>
> Everything above is just initial brainstorming, but it seems that it is
> going to a different direction than what PIP-45 is currently.
> Abstractions for coordination such as leader election and distributed
> locks will be necessary, and some external Metadata would have to be
> managed in a centralized fashion. In general, the model would be somewhat
> different compared to what PIP-45 has. Since the core idea would be to use
> an event sourced model, it would be optimal to use BookKeeper ledgers
> (Pulsar managed ledger) for storing the events.
> With the nature of event sourcing, it would be possible to create
> point-in-time backup and restore solutions for Pulsar metadata. Even today,
> it is very rare that Pulsar users would go directly to Zookeeper for
> observing the state of the metadata. In an event sourced system, this state
> could be stored to flat files on disk if that is needed for debugging and
> observability purposes besides back and restore. Metadata events could
> possibly also be exposed externally for building efficient management
> tooling for Pulsar.
>
> The metadata handling also expands to Pulsar load balancing, and that
> should also be considered when revisiting the design of PIP-45 to address
> the current challenges. There are also aspects of metadata where changes
> aren't immediate. For example, deleting a topic will require to delete the
> underlying data stored in bookkeeper. If the operation fails, there should
> be ways to keep on retrying. Similar approach for creation. Some operations
> might be asynchronous, and having support for a state machine for creation
> and deletion could be helpful. This is to bring up the point that it's not
> optimal to model a topic deletion as an atomic operation. The state change
> should be atomic, but the deletion from the metadata storage should not
> happen until all asynchronous operations have been completed. The metadata
> admin interface caller should be able to proceed after it is marked
> deleted, but the system should keep on managing the deletions in the
> background. Similarly, the creation of topics could have more states to
> deal with efficient creation of a large number of topics.
>
> This was a long email covering a subject that we haven't dealt with before
> in the Pulsar community. Usually, we have discussions about solutions that
> are very targeted. It isn't common to transparently discuss existing
> design challenges or problems and find ways to solve them together. Sharing
> observations about problems would be valuable. High-level problems don't
> get reported in the GitHub issue tracker since they aren't individual bugs.
> We should find ways to address also this type of challenges in the
> community.
>
> I hope we can change this and also take the opportunity to meet at Pulsar
> Community meetings and have more of these in-depth discussions that will
> help us improve Pulsar for the benefit of us all in the Apache Pulsar
> community.
>
> Since PIP-157 [3] is proceeding, I see that as an opportunity to start
> taking the design of Pulsar Metadata handling in the direction where we
> could address the challenges that there are currently in Pulsar with
> metadata handling and load balancing. We must decide together what that
> direction is. I hope this email opens some new aspects to the basis of
> these decisions. I'm hoping that you, the reader of the email,
> participate to share your views and also help develop this direction.
>
> PIP-157 [3] assumes "Pulsar is able to manage millions of topics but the
> number of topics within a single namespace is limited by metadata
> storage.". Does this assumption hold?
>
> For example, "3) Scalability issue: all metadata changes are broadcasted
> to all brokers" will become a challenge in a large system with a high
> number of brokers. Together with the other Metadata consistency challenges
> ( 1 and 2 above), I have a doubt that after PIP-157 is implemented, the
> bottlenecks will move to these areas. In that sense, it might be a band-aid
> that won't address the root cause of the Pulsar Metadata handling
> scalability challenges.
>
> Let's discuss and address the challenges together!
>
> Regards,
>
> -Lari
>
> [1] - analysis about Metadata consistency from user's point of view -
> https://github.com/apache/pulsar/issues/12555#issuecomment-955748744
> [2] - MetadataStore interface:
> https://github.com/apache/pulsar/blob/master/pulsar-metadata/src/main/java/org/apache/pulsar/metadata/api/MetadataStore.java
> [3] - PIP-157: Bucketing topic metadata to allow more topics per namespace
> - https://github.com/apache/pulsar/issues/15254
> [4] - PIP-45: Pluggable metadata interface -
> https://github.com/apache/pulsar/wiki/PIP-45%3A-Pluggable-metadata-interface
>
> [5] - StreamNative's blog "Moving Toward a ZooKeeper-Less Apache Pulsar" -
> https://streamnative.io/blog/release/2022-01-25-moving-toward-a-zookeeperless-apache-pulsar/
>

Reply via email to