While researching https://github.com/apache/pulsar/issues/6054, I
discovered some key things that revealed a major performance issue that
manifests in high-velocity workloads:

1. The problem can be reproduced when batching is disabled on all Pulsar
functions
2. When a subscription is "frozen," it's actually still processing messages
but extremely slowly
3. When a subscription is "frozen," the broker that owns the "frozen" topic
is doing nothing but processing pendingAddEntries in the do loop of
*ManagedLedgerImpl.updateLedgersIdsComplete(..)*, here:
https://github.com/apache/pulsar/blob/30d48cbbec8b522e8ab595f70fb1edf8b31bb51b/managed-ledger/src/main/java/org/apache/bookkeeper/mledger/impl/ManagedLedgerImpl.java#L1507
4. Restarting the broker that owns the topic will unblock the "freeze" for
a non-deterministic amount of time (sometimes minutes, sometimes hours.)
5. Functions get blocked from producing due to waiting for ack's from the
broker on pending messages, which blocks the function semaphore and results
in large backlog accumulation.
6. If input topics stop being written to, the backlogs will eventually
clear (after some amount of hours), even on topics that are "frozen."
7. The issue is easiest to reproduce in high-velocity/high-volume
workloads, especially if partitioned topics and parallel function instances
are used.
8. Permit behavior is not abnormal.

Here are some thoughts:
1. I noticed that *ManagedLedgerImpl.updateLedgersIdsComplete(..)* is
synchronized. I'm concerned that the mutex may be locking up too much of
ManagedLedgerImpl, but I suspect this is not the root cause of the issue,
and I'm not sure if it's possible to add parallelization in this area.
(Someone with a deeper understanding of ManagedLedger could chime in here.)
2. Load isn't sufficiently balanced across the brokers. Checking debug logs
for executions of *ManagedLedgerImpl.updateLedgersIdsComplete(..)*
indicated:
   broker1: 0 executions
   broker2: 2228 executions
   broker3: 8221 executions. (This is the broker that owns the "frozen"
topic.)
In the flow that reproduced the issue, there were 3 topics separated by
Pulsar Functions, the first topic (the input topic) has 4 partitions, both
Pulsar Functions are filtering messages, and the first function (consuming
from the first topic) has 4 parallel instances. Topics are all persistent
without retention or tiered storage, and the subscriptions are all shared.
Here is a graphic that shows the flow:
[image: image.png]
(The green numbers are backlog counts.)
In the image, the topic that is "frozen" is the "siege-filtered" topic on
broker3.

Looking at the sizes of debug logs that accumulated on the brokers (with
the most frequently executed debug log line occurring inside the do loop of
*ManagedLedgerImpl.updateLedgersIdsComplete(..)*, the load imbalance is
even clearer:
   broker1: 7.1MB
   broker2: 242 MB
   broker3: 6.6 GB

CPU isn't fully utilized:
  broker1: 35.30%
  broker2: 53.24%
  broker3: 36.49%

The bookies' CPU also isn't fully utilized:
  bookie1 (on broker1): 14.27%
  bookie2 (on broker2): 16.72%
  bookie3 (on broker3): 9.40%

The thing that's not clear to me yet is why unloading the "frozen" topic
would clear the blockage. I'd think it would just move the performance
issue to another broker. Any ideas on why unloading the "frozen" topic to
another broker might improve performance (which is often enough of a boost
to allow backlogs to catch up) would be helpful.

I'm planning to continue the investigation into the root cause of the
performance issue, but I wanted to bring this to the attention of the
community since I know a number of you (such as Rajan, Penghui, Jerry,
Sijie, Lari Hotari, and others) have assisted with various aspects of the
investigation at various points in time.

Devin G. Bost

Reply via email to