https://github.com/apache/pulsar/issues/12651
--- Pasted here for quoting convenience --- ## Motivation It's known that concurrent topic loading can create back pressure at zk and may increase zk latency, and Pulsar have introduced `maxConcurrentTopicLoadRequest` to limit concurrency of topic loading procedure. Currently we are running pulsar broker cluster with about 1 million topics on it, and it's possible that 20 percent of our brokers could break down at the same time (by SLA of our infrastructure provider). So when it happens, it means that 200K of topics needs to be loaded as soon as possible. Here is the problems: 1. It's hard to determine the proper value for `maxConcurrentTopicLoadRequest`. For example, if we reduce this value to half, the loading speed dose not become half. 2. By limiting the load concurrency, it's not easy to estimate the recover time when broker shutdown unexpectedly, and our SLA mostly depends on this in our case. ## Goal So I propose to add rate limit to the topic loading throttling. 1. It's easy to choose proper value in perf test. It has about linear relationship with zk cpu usage. 2. It's easy to estimate the max time of topic load costs. ## API Changes Add a config of `topicLoadRateLimit`, defining the max number of topics can be loaded in 1 second. ## Implementation The basic procedure is the same as throttling by concurrency. Add a field of type RateLimiter in BrokerService and try acquire a permit before `createPersistentTopic` in `BrokerService#loadOrCreatePersistentTopic`. ## Reject Alternatives No alternatives yet.