BewareMyPower opened a new pull request, #24577: URL: https://github.com/apache/pulsar/pull/24577
### Motivation I observed a topic has been loaded for more than 30 seconds in production environment recently after its namespace bundle ownership was transferred. From the logs and heap dump, the conclusion is that the topic has been stuck before calling `asyncOpen` in `createPersistentTopic0`. After revisiting the long code path of topic loading, I found many places are not efficient, and the existing `pulsar_topic_load_times` metric is incorrect. This metric only counts the time from the beginning of `createPersistentTopic0`. However, there are some other time-consuming tasks before it's called. https://github.com/apache/pulsar/blob/c96f27a4dde9e4211a5b83e91f5e384d7ac8d904/pulsar-broker/src/main/java/org/apache/pulsar/broker/service/BrokerService.java#L1171 It waits for topic policies of this topic are available before inserting a topic future to `BrokerService#topics`. In extreme cases, there could be many concurrent `TopicPoliciesService#getTopicPoliciesAsync` call. https://github.com/apache/pulsar/blob/c96f27a4dde9e4211a5b83e91f5e384d7ac8d904/pulsar-broker/src/main/java/org/apache/pulsar/broker/service/BrokerService.java#L1670 In `loadOrCreatePersistentTopic`, it will check the ownership via `checkTopicNsOwnership` before acquiring the topic load semaphore here: https://github.com/apache/pulsar/blob/c96f27a4dde9e4211a5b83e91f5e384d7ac8d904/pulsar-broker/src/main/java/org/apache/pulsar/broker/service/BrokerService.java#L1674 This violates the semantics of the `maxConcurrentTopicLoadRequest` config. After acquiring the semaphore, it calls `checkOwnershipAndCreatePersistentTopic`, which validates the ownership again via `NamespaceService#isServiceUnitActiveAsync`: https://github.com/apache/pulsar/blob/c96f27a4dde9e4211a5b83e91f5e384d7ac8d904/pulsar-broker/src/main/java/org/apache/pulsar/broker/service/BrokerService.java#L1739 Actually, the implementation of `isServiceUnitActiveAsync` is exactly the same with `checkTopicNsOwnership`, where the only difference is that the previous one returns a boolean future, while the latter one returns a failed future if it's false. Even after that, it could fetch the topic policies before `createPersistentTopic0`: https://github.com/apache/pulsar/blob/c96f27a4dde9e4211a5b83e91f5e384d7ac8d904/pulsar-broker/src/main/java/org/apache/pulsar/broker/service/BrokerService.java#L1745 ### Modifications Major changes: - Remove the `getTopicPoliciesBypassSystemTopic` call before inserting a topic future. This method is only used in `getManagedLedgerConfig`. - Remove the duplicated ownership validation before acquiring the topic load semaphore - Avoid calling `getManagedLedgerConfig` and `fetchPartitionedTopicMetadataAsync` repeatedly by executing these tasks before other tasks that depend on them. - Perform the validations concurrently, including `checkMaxTopicsPerNamespace`, `checkTopicAlreadyMigrated`, `validateTopicConsistency`. Though many tasks are metadata store access with `MetadataCache` used, executing them concurrently is still more efficient. For observability, - Take all tasks after acquiring the topic load semaphore into consideration of the `pulsar_topic_load_times` metric. - Add a log for topic policies get latency specifically when loading a topic. From my experience, using a reader with `hasMessageAvailable` and `readNext` loop could have poor performance when CPU pressure is high. This read loop is also too heavy. Other changes and refactoring: - Pass the `TopicName` instance across the whole flow to avoid unnecessary conversions between `TopicName` and `String`. - Move the `isTransactionInternalName` at the beginning - Replace `isServiceUnitActiveAsync` with `checkTopicNsOwnership` and remove this method to avoid users using this method, which makes code hard to read. ### Documentation <!-- DO NOT REMOVE THIS SECTION. CHECK THE PROPER BOX ONLY. --> - [ ] `doc` <!-- Your PR contains doc changes. --> - [ ] `doc-required` <!-- Your PR changes impact docs and you will update later --> - [x] `doc-not-needed` <!-- Your PR changes do not impact docs --> - [ ] `doc-complete` <!-- Docs have been already added --> ### Matching PR in forked repository PR in forked repository: https://github.com/BewareMyPower/pulsar/pull/50 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
