Re: [DISCUSS] Pulsar 3.0 brainstorming: Going beyond PIP-45 Pluggable metadata interface

Qiang Huang Tue, 16 Aug 2022 04:16:52 -0700

It is a huge milestone, but a challenge for implementing pluggable metadata
storage. Will the plan go from providing pluggable metadata storage to
internalize the distributed coordination functionality of Pulsar itself
finally?


Lari Hotari <lhot...@apache.org> 于2022年8月16日周二 11:17写道：

> Bumping up this thread.
>
> -Lari
>
> pe 20. toukok. 2022 klo 1.57 Lari Hotari <lhot...@apache.org> kirjoitti:
>
> > Hi all,
> >
> > I started writing this email as feedback to "PIP-157: Bucketing topic
> > metadata to allow more topics per namespace" [3].
> > This email expanded to cover some analysis of "PIP-45: Pluggable metadata
> > interface" [4] design. (A good introduction to PIP-45 is the StreamNative
> > blog post "Moving Toward a ZooKeeper-Less Apache Pulsar" [5]).
> >
> > The intention is to start discussions for Pulsar 3.0 and beyond. Bouncing
> > ideas and challenging the existing design with good intentions and the
> > benefit of all.
> >
> > I'll share some thoughts that have come up in discussions together with
> my
> > colleague Michael Marshall.  We have been bouncing some ideas together
> and
> > that has been very helpful in being able to start building some
> > understanding of the existing challenges and possible direction for
> solving
> > these challenges. I hope that we could have broader conversations in the
> > Pulsar Community for improving Pulsar's metadata management and load
> > balancing designs in the long term.
> >
> > There are few areas where there are challenges with the current Metadata
> > Store / PIP-45 solution:
> >
> > 1) Metadata consistency from user's point of view
> >   - Summarized well in this great analysis and comment [1] by Zac Bentley
> >    "Ideally, the resolution of all of these issues would be the same: a
> > management API operation--any operation--should not return successfully
> > until all observable side effects of that operation across a Pulsar
> cluster
> > (including brokers, proxies, bookies, and ZK) were completed." (see [1]
> for
> > the full analysis and comment)
> >
> > 2) Metadata consistency issues within Pulsar
> >   - There are issues where the state in a single broker gets left in a
> bad
> > state as a result of consistency and concurrency issues with metadata
> > handling and caching.
> >     Possible example https://github.com/apache/pulsar/issues/13946
> >
> > 3) Scalability issue: all metadata changes are broadcasted to all brokers
> > - model doesn't scale out
> >    - This is due to the change made in
> > https://github.com/apache/pulsar/pull/11198 , "Use ZK persistent
> watches".
> >    - The global broadcasting design of metadata changes doesn't follow
> > typical scalable design principles such as the "Scale cube". This will
> pose
> > limits on Pulsar clusters with large number of brokers. The current
> > metadata change notification solution doesn't support scaling out when
> it's
> > based on a design that broadcast all notifications to every participant.
> >
> > When doing some initial analysis and brainstorming on the above areas,
> > there have been thoughts that PIP-45 Metadata Store API [2] abstractions
> > are somewhat not optimal.
> >
> > A lot of the functionality that is provided in the PIP-45 Metadata Store
> > API interface [2] could be solved more efficiently in a way where Pulsar
> > itself would be a key part of the metadata storage solution.
> >
> > For example, listing topics in a namespace could be a "scatter-gather"
> > query to all "metadata shards" that hold namespace topics. There's not
> > necessarily a need to have a centralized external Metadata Store API
> > interface [2] that replies to all queries. Pulsar metadata handling could
> > be moving towards a distributed database type of design where consistent
> > hashing plays a key role. Since the Metadata handling is an internal
> > concern, the interface doesn't need to provide services directly to
> > external users of Pulsar. The Pulsar Admin API should also be improved to
> > scale for queries and listing of namespaces with millions of topics, and
> > should have pagination to limit results. This implementation can
> internally
> > handle possible "scatter-gather" queries when the metadata handling
> backend
> > is not centralized. The point is that Metadata Store API [2] abstraction
> > doesn't necessarily need to provide service for this, since it could be a
> > different concern.
> >
> > Most of the complexity in the current PIP-45 MetaData Store comes from
> > data consistency challenges. The solution is heavily based on caches and
> > having ways to handle cache expirations and keeping data consistent.
> There
> > are gaps in the caching solution since there are metadata consistency
> > problems, as described in 1) and 2) above. A lot of the problems go away
> in
> > a model where most processing and data access is local. Similar to how
> the
> > broker handles the topics. The topic is owned on a single broker at a
> time.
> > The approach could be extended to cover metadata changes and queries.
> >
> > What is interesting here regarding PIP-157 is that brainstorming led to a
> > sharding (aka "bucketing") solution, where there are metadata shards in
> the
> > system.
> >
> > metadata shard
> >            |
> > namespace bundle  (existing)
> >            |
> > namespace  (existing)
> >
> > Instead of having a specific solution in mind for dealing with the
> storage
> > of the metadata, the main idea is that each metadata shard is independent
> > and would be able to perform operations without coordination with other
> > metadata shards. This does impact the storage of metadata so that
> > operations to the storage system can be isolated (for example, it is
> > necessary to be able to list the topics for a bundle without listing
> > everything. PIP-157 provides one type of solution for this). We didn't
> let
> > the existing solution limit our brainstorming.
> >
> > Since there is metadata that needs to be available in multiple locations
> > in the system such as tenant / namespace level policies, it would be
> easier
> > to handle the consistency aspects with a model that is not based on CRUD
> > type of operations, but instead is event sourced where the state can be
> > rebuilt from events (with the possibility to have state snapshots). There
> > could be an internal metadata replication protocol which ensures
> > consistency (some type of acknowledgements when followers have caught up
> > with changes from the leader) when that is needed.
> >
> > metadata shard leader
> >               |
> > metadata shard follower  (namespace bundle, for example)
> >
> > The core principle is that all write operations will always be redirected
> > to be handled to the leader, which is a single writer for a shard. The
> > followers would get events for changes, and the followers could also
> notify
> > the leader each time they have caught up with changes. This would be one
> > way to solve "1) Metadata consistency from user's point of view" without
> > having a complex Metadata cache invalidation solution. This would also
> > solve the problem "2) Metadata consistency issues within Pulsar". In
> event
> > sourcing, events are the truth and there are better ways to ensure "cache
> > consistency" in a leader-follower model based on event sourcing.
> >
> > Everything above is just initial brainstorming, but it seems that it is
> > going to a different direction than what PIP-45 is currently.
> > Abstractions for coordination such as leader election and distributed
> > locks will be necessary, and some external Metadata would have to be
> > managed in a centralized fashion. In general, the model would be somewhat
> > different compared to what PIP-45 has. Since the core idea would be to
> use
> > an event sourced model, it would be optimal to use BookKeeper ledgers
> > (Pulsar managed ledger) for storing the events.
> > With the nature of event sourcing, it would be possible to create
> > point-in-time backup and restore solutions for Pulsar metadata. Even
> today,
> > it is very rare that Pulsar users would go directly to Zookeeper for
> > observing the state of the metadata. In an event sourced system, this
> state
> > could be stored to flat files on disk if that is needed for debugging and
> > observability purposes besides back and restore. Metadata events could
> > possibly also be exposed externally for building efficient management
> > tooling for Pulsar.
> >
> > The metadata handling also expands to Pulsar load balancing, and that
> > should also be considered when revisiting the design of PIP-45 to address
> > the current challenges. There are also aspects of metadata where changes
> > aren't immediate. For example, deleting a topic will require to delete
> the
> > underlying data stored in bookkeeper. If the operation fails, there
> should
> > be ways to keep on retrying. Similar approach for creation. Some
> operations
> > might be asynchronous, and having support for a state machine for
> creation
> > and deletion could be helpful. This is to bring up the point that it's
> not
> > optimal to model a topic deletion as an atomic operation. The state
> change
> > should be atomic, but the deletion from the metadata storage should not
> > happen until all asynchronous operations have been completed. The
> metadata
> > admin interface caller should be able to proceed after it is marked
> > deleted, but the system should keep on managing the deletions in the
> > background. Similarly, the creation of topics could have more states to
> > deal with efficient creation of a large number of topics.
> >
> > This was a long email covering a subject that we haven't dealt with
> before
> > in the Pulsar community. Usually, we have discussions about solutions
> that
> > are very targeted. It isn't common to transparently discuss existing
> > design challenges or problems and find ways to solve them together.
> Sharing
> > observations about problems would be valuable. High-level problems don't
> > get reported in the GitHub issue tracker since they aren't individual
> bugs.
> > We should find ways to address also this type of challenges in the
> > community.
> >
> > I hope we can change this and also take the opportunity to meet at Pulsar
> > Community meetings and have more of these in-depth discussions that will
> > help us improve Pulsar for the benefit of us all in the Apache Pulsar
> > community.
> >
> > Since PIP-157 [3] is proceeding, I see that as an opportunity to start
> > taking the design of Pulsar Metadata handling in the direction where we
> > could address the challenges that there are currently in Pulsar with
> > metadata handling and load balancing. We must decide together what that
> > direction is. I hope this email opens some new aspects to the basis of
> > these decisions. I'm hoping that you, the reader of the email,
> > participate to share your views and also help develop this direction.
> >
> > PIP-157 [3] assumes "Pulsar is able to manage millions of topics but the
> > number of topics within a single namespace is limited by metadata
> > storage.". Does this assumption hold?
> >
> > For example, "3) Scalability issue: all metadata changes are broadcasted
> > to all brokers" will become a challenge in a large system with a high
> > number of brokers. Together with the other Metadata consistency
> challenges
> > ( 1 and 2 above), I have a doubt that after PIP-157 is implemented, the
> > bottlenecks will move to these areas. In that sense, it might be a
> band-aid
> > that won't address the root cause of the Pulsar Metadata handling
> > scalability challenges.
> >
> > Let's discuss and address the challenges together!
> >
> > Regards,
> >
> > -Lari
> >
> > [1] - analysis about Metadata consistency from user's point of view -
> > https://github.com/apache/pulsar/issues/12555#issuecomment-955748744
> > [2] - MetadataStore interface:
> >
> https://github.com/apache/pulsar/blob/master/pulsar-metadata/src/main/java/org/apache/pulsar/metadata/api/MetadataStore.java
> > [3] - PIP-157: Bucketing topic metadata to allow more topics per
> namespace
> > - https://github.com/apache/pulsar/issues/15254
> > [4] - PIP-45: Pluggable metadata interface -
> >
> https://github.com/apache/pulsar/wiki/PIP-45%3A-Pluggable-metadata-interface
> >
> > [5] - StreamNative's blog "Moving Toward a ZooKeeper-Less Apache Pulsar"
> -
> >
> https://streamnative.io/blog/release/2022-01-25-moving-toward-a-zookeeperless-apache-pulsar/
> >
>


-- 
BR,
Qiang Huang

Re: [DISCUSS] Pulsar 3.0 brainstorming: Going beyond PIP-45 Pluggable metadata interface

Reply via email to