Hi all,

I started writing this email as feedback to "PIP-157: Bucketing topic
metadata to allow more topics per namespace" [3].
This email expanded to cover some analysis of "PIP-45: Pluggable metadata
interface" [4] design. (A good introduction to PIP-45 is the StreamNative
blog post "Moving Toward a ZooKeeper-Less Apache Pulsar" [5]).

The intention is to start discussions for Pulsar 3.0 and beyond. Bouncing
ideas and challenging the existing design with good intentions and the
benefit of all.

I'll share some thoughts that have come up in discussions together with my
colleague Michael Marshall.  We have been bouncing some ideas together and
that has been very helpful in being able to start building some
understanding of the existing challenges and possible direction for solving
these challenges. I hope that we could have broader conversations in the
Pulsar Community for improving Pulsar's metadata management and load
balancing designs in the long term.

There are few areas where there are challenges with the current Metadata
Store / PIP-45 solution:

1) Metadata consistency from user's point of view
  - Summarized well in this great analysis and comment [1] by Zac Bentley
   "Ideally, the resolution of all of these issues would be the same: a
management API operation--any operation--should not return successfully
until all observable side effects of that operation across a Pulsar cluster
(including brokers, proxies, bookies, and ZK) were completed." (see [1] for
the full analysis and comment)

2) Metadata consistency issues within Pulsar
  - There are issues where the state in a single broker gets left in a bad
state as a result of consistency and concurrency issues with metadata
handling and caching.
    Possible example https://github.com/apache/pulsar/issues/13946

3) Scalability issue: all metadata changes are broadcasted to all brokers -
model doesn't scale out
   - This is due to the change made in
https://github.com/apache/pulsar/pull/11198 , "Use ZK persistent watches".
   - The global broadcasting design of metadata changes doesn't follow
typical scalable design principles such as the "Scale cube". This will pose
limits on Pulsar clusters with large number of brokers. The current
metadata change notification solution doesn't support scaling out when it's
based on a design that broadcast all notifications to every participant.

When doing some initial analysis and brainstorming on the above areas,
there have been thoughts that PIP-45 Metadata Store API [2] abstractions
are somewhat not optimal.

A lot of the functionality that is provided in the PIP-45 Metadata Store
API interface [2] could be solved more efficiently in a way where Pulsar
itself would be a key part of the metadata storage solution.

For example, listing topics in a namespace could be a "scatter-gather"
query to all "metadata shards" that hold namespace topics. There's not
necessarily a need to have a centralized external Metadata Store API
interface [2] that replies to all queries. Pulsar metadata handling could
be moving towards a distributed database type of design where consistent
hashing plays a key role. Since the Metadata handling is an internal
concern, the interface doesn't need to provide services directly to
external users of Pulsar. The Pulsar Admin API should also be improved to
scale for queries and listing of namespaces with millions of topics, and
should have pagination to limit results. This implementation can internally
handle possible "scatter-gather" queries when the metadata handling backend
is not centralized. The point is that Metadata Store API [2] abstraction
doesn't necessarily need to provide service for this, since it could be a
different concern.

Most of the complexity in the current PIP-45 MetaData Store comes from data
consistency challenges. The solution is heavily based on caches and having
ways to handle cache expirations and keeping data consistent. There are
gaps in the caching solution since there are metadata consistency problems,
as described in 1) and 2) above. A lot of the problems go away in a model
where most processing and data access is local. Similar to how the broker
handles the topics. The topic is owned on a single broker at a time. The
approach could be extended to cover metadata changes and queries.

What is interesting here regarding PIP-157 is that brainstorming led to a
sharding (aka "bucketing") solution, where there are metadata shards in the
system.

metadata shard
           |
namespace bundle  (existing)
           |
namespace  (existing)

Instead of having a specific solution in mind for dealing with the storage
of the metadata, the main idea is that each metadata shard is independent
and would be able to perform operations without coordination with other
metadata shards. This does impact the storage of metadata so that
operations to the storage system can be isolated (for example, it is
necessary to be able to list the topics for a bundle without listing
everything. PIP-157 provides one type of solution for this). We didn't let
the existing solution limit our brainstorming.

Since there is metadata that needs to be available in multiple locations in
the system such as tenant / namespace level policies, it would be easier to
handle the consistency aspects with a model that is not based on CRUD type
of operations, but instead is event sourced where the state can be rebuilt
from events (with the possibility to have state snapshots). There could be
an internal metadata replication protocol which ensures consistency (some
type of acknowledgements when followers have caught up with changes from
the leader) when that is needed.

metadata shard leader
              |
metadata shard follower  (namespace bundle, for example)

The core principle is that all write operations will always be redirected
to be handled to the leader, which is a single writer for a shard. The
followers would get events for changes, and the followers could also notify
the leader each time they have caught up with changes. This would be one
way to solve "1) Metadata consistency from user's point of view" without
having a complex Metadata cache invalidation solution. This would also
solve the problem "2) Metadata consistency issues within Pulsar". In event
sourcing, events are the truth and there are better ways to ensure "cache
consistency" in a leader-follower model based on event sourcing.

Everything above is just initial brainstorming, but it seems that it is
going to a different direction than what PIP-45 is currently.
Abstractions for coordination such as leader election and distributed locks
will be necessary, and some external Metadata would have to be managed in a
centralized fashion. In general, the model would be somewhat different
compared to what PIP-45 has. Since the core idea would be to use an event
sourced model, it would be optimal to use BookKeeper ledgers (Pulsar
managed ledger) for storing the events.
With the nature of event sourcing, it would be possible to create
point-in-time backup and restore solutions for Pulsar metadata. Even today,
it is very rare that Pulsar users would go directly to Zookeeper for
observing the state of the metadata. In an event sourced system, this state
could be stored to flat files on disk if that is needed for debugging and
observability purposes besides back and restore. Metadata events could
possibly also be exposed externally for building efficient management
tooling for Pulsar.

The metadata handling also expands to Pulsar load balancing, and that
should also be considered when revisiting the design of PIP-45 to address
the current challenges. There are also aspects of metadata where changes
aren't immediate. For example, deleting a topic will require to delete the
underlying data stored in bookkeeper. If the operation fails, there should
be ways to keep on retrying. Similar approach for creation. Some operations
might be asynchronous, and having support for a state machine for creation
and deletion could be helpful. This is to bring up the point that it's not
optimal to model a topic deletion as an atomic operation. The state change
should be atomic, but the deletion from the metadata storage should not
happen until all asynchronous operations have been completed. The metadata
admin interface caller should be able to proceed after it is marked
deleted, but the system should keep on managing the deletions in the
background. Similarly, the creation of topics could have more states to
deal with efficient creation of a large number of topics.

This was a long email covering a subject that we haven't dealt with before
in the Pulsar community. Usually, we have discussions about solutions that
are very targeted. It isn't common to transparently discuss existing
design challenges or problems and find ways to solve them together. Sharing
observations about problems would be valuable. High-level problems don't
get reported in the GitHub issue tracker since they aren't individual bugs.
We should find ways to address also this type of challenges in the
community.

I hope we can change this and also take the opportunity to meet at Pulsar
Community meetings and have more of these in-depth discussions that will
help us improve Pulsar for the benefit of us all in the Apache Pulsar
community.

Since PIP-157 [3] is proceeding, I see that as an opportunity to start
taking the design of Pulsar Metadata handling in the direction where we
could address the challenges that there are currently in Pulsar with
metadata handling and load balancing. We must decide together what that
direction is. I hope this email opens some new aspects to the basis of
these decisions. I'm hoping that you, the reader of the email,
participate to share your views and also help develop this direction.

PIP-157 [3] assumes "Pulsar is able to manage millions of topics but the
number of topics within a single namespace is limited by metadata
storage.". Does this assumption hold?

For example, "3) Scalability issue: all metadata changes are broadcasted to
all brokers" will become a challenge in a large system with a high number
of brokers. Together with the other Metadata consistency challenges ( 1 and
2 above), I have a doubt that after PIP-157 is implemented, the bottlenecks
will move to these areas. In that sense, it might be a band-aid that won't
address the root cause of the Pulsar Metadata handling scalability
challenges.

Let's discuss and address the challenges together!

Regards,

-Lari

[1] - analysis about Metadata consistency from user's point of view -
https://github.com/apache/pulsar/issues/12555#issuecomment-955748744
[2] - MetadataStore interface:
https://github.com/apache/pulsar/blob/master/pulsar-metadata/src/main/java/org/apache/pulsar/metadata/api/MetadataStore.java
[3] - PIP-157: Bucketing topic metadata to allow more topics per namespace
- https://github.com/apache/pulsar/issues/15254
[4] - PIP-45: Pluggable metadata interface -
https://github.com/apache/pulsar/wiki/PIP-45%3A-Pluggable-metadata-interface

[5] - StreamNative's blog "Moving Toward a ZooKeeper-Less Apache Pulsar" -
https://streamnative.io/blog/release/2022-01-25-moving-toward-a-zookeeperless-apache-pulsar/

Reply via email to