Hi all

I want to satrt a discussion, which relates to the PR. #24423: Handling
Overloaded Netty Channels in Apache Pulsar

Problem Statement
We've encountered a critical issue in our Apache Pulsar clusters where
brokers experience Out-Of-Memory (OOM) errors and continuous restarts under
specific load patterns. This occurs when Netty channel write buffers become
full, leading to a buildup of unacknowledged responses in the broker's
memory.

Background
Our clusters are configured with numerous namespaces, each containing
approximately 8,000 to 10,000 topics. Our consumer applications are quite
large, with each consumer using a regular expression (regex) pattern to
subscribe to all topics within a namespace.

The problem manifests particularly during consumer application restarts.
When a consumer restarts, it issues a getTopicsOfNamespace request. Due to
the sheer number of topics, the response size is extremely large. This
massive response overwhelms the socket output buffer, causing it to fill up
rapidly. Consequently, the broker's responses get backlogged in memory,
eventually leading to the broker's OOM and subsequent restart loop.

Why "Returning an Error" Is Not a Solution
A common approach to handling overload is to simply return an error when
the broker cannot process a request. However, in this specific scenario,
this solution is ineffective. If a consumer application fails to start due
to an error, it triggers a user pod restart, which then leads to the same
getTopicsOfNamespace request being reissued, resulting in a continuous loop
of errors and restarts. This creates an unrecoverable state for the
consumer application and puts immense pressure on the brokers.

Proposed Solution and Justification
We believe the solution proposed in
https://github.com/apache/pulsar/pull/24423 is highly suitable for
addressing this issue. The core mechanism introduced in this PR – pausing
acceptance of new requests when a channel cannot handle more output – is
exceptionally reasonable and addresses the root cause of the memory
pressure.

This approach prevents the broker from accepting new requests when its
write buffers are full, effectively backpressuring the client and
preventing the memory buildup that leads to OOMs. Furthermore, we
anticipate that this mechanism will not significantly increase future
maintenance costs, as it elegantly handles overload scenarios at a
fundamental network layer.

I invite the community to discuss this solution and its potential benefits
for the overall stability and resilience of Apache Pulsar.

Thanks
Yubiao Feng

-- 
This email and any attachments are intended solely for the recipient(s) 
named above and may contain confidential, privileged, or proprietary 
information. If you are not the intended recipient, you are hereby notified 
that any disclosure, copying, distribution, or reproduction of this 
information is strictly prohibited. If you have received this email in 
error, please notify the sender immediately by replying to this email and 
delete it from your system.

Reply via email to