While researching https://github.com/apache/pulsar/issues/6054, I discovered some key things that revealed a major performance issue that manifests in high-velocity workloads:
1. The problem can be reproduced when batching is disabled on all Pulsar functions 2. When a subscription is "frozen," it's actually still processing messages but extremely slowly 3. When a subscription is "frozen," the broker that owns the "frozen" topic is doing nothing but processing pendingAddEntries in the do loop of *ManagedLedgerImpl.updateLedgersIdsComplete(..)*, here: https://github.com/apache/pulsar/blob/30d48cbbec8b522e8ab595f70fb1edf8b31bb51b/managed-ledger/src/main/java/org/apache/bookkeeper/mledger/impl/ManagedLedgerImpl.java#L1507 4. Restarting the broker that owns the topic will unblock the "freeze" for a non-deterministic amount of time (sometimes minutes, sometimes hours.) 5. Functions get blocked from producing due to waiting for ack's from the broker on pending messages, which blocks the function semaphore and results in large backlog accumulation. 6. If input topics stop being written to, the backlogs will eventually clear (after some amount of hours), even on topics that are "frozen." 7. The issue is easiest to reproduce in high-velocity/high-volume workloads, especially if partitioned topics and parallel function instances are used. 8. Permit behavior is not abnormal. Here are some thoughts: 1. I noticed that *ManagedLedgerImpl.updateLedgersIdsComplete(..)* is synchronized. I'm concerned that the mutex may be locking up too much of ManagedLedgerImpl, but I suspect this is not the root cause of the issue, and I'm not sure if it's possible to add parallelization in this area. (Someone with a deeper understanding of ManagedLedger could chime in here.) 2. Load isn't sufficiently balanced across the brokers. Checking debug logs for executions of *ManagedLedgerImpl.updateLedgersIdsComplete(..)* indicated: broker1: 0 executions broker2: 2228 executions broker3: 8221 executions. (This is the broker that owns the "frozen" topic.) In the flow that reproduced the issue, there were 3 topics separated by Pulsar Functions, the first topic (the input topic) has 4 partitions, both Pulsar Functions are filtering messages, and the first function (consuming from the first topic) has 4 parallel instances. Topics are all persistent without retention or tiered storage, and the subscriptions are all shared. Here is a graphic that shows the flow: [image: image.png] (The green numbers are backlog counts.) In the image, the topic that is "frozen" is the "siege-filtered" topic on broker3. Looking at the sizes of debug logs that accumulated on the brokers (with the most frequently executed debug log line occurring inside the do loop of *ManagedLedgerImpl.updateLedgersIdsComplete(..)*, the load imbalance is even clearer: broker1: 7.1MB broker2: 242 MB broker3: 6.6 GB CPU isn't fully utilized: broker1: 35.30% broker2: 53.24% broker3: 36.49% The bookies' CPU also isn't fully utilized: bookie1 (on broker1): 14.27% bookie2 (on broker2): 16.72% bookie3 (on broker3): 9.40% The thing that's not clear to me yet is why unloading the "frozen" topic would clear the blockage. I'd think it would just move the performance issue to another broker. Any ideas on why unloading the "frozen" topic to another broker might improve performance (which is often enough of a boost to allow backlogs to catch up) would be helpful. I'm planning to continue the investigation into the root cause of the performance issue, but I wanted to bring this to the attention of the community since I know a number of you (such as Rajan, Penghui, Jerry, Sijie, Lari Hotari, and others) have assisted with various aspects of the investigation at various points in time. Devin G. Bost