Connection-level rate limiting has the following drawbacks:

Lacks a global perspective. When the number of clients is sufficiently high, it 
can still lead to Out-Of-Memory (OOM) errors.

Mutual interference. It can negatively impact production, consumption, 
heartbeat, and other operations.

Design consistency. When should rate limiting be TCP-based, and when should it 
be based on a RateLimiter?

We can establish a multi-layered mechanism:

API layer restrictions: Limiting based on in-flight requests, such as producer 
throttling and consumer throttling.

Memory layer restrictions: Memory pool limitations, blocking new request 
processing rather than allocating more memory when the pool is exhausted.

On 2025/07/08 06:07:10 Yubiao Feng wrote:
> Hi all
> 
> I want to satrt a discussion, which relates to the PR. #24423: Handling
> Overloaded Netty Channels in Apache Pulsar
> 
> Problem Statement
> We've encountered a critical issue in our Apache Pulsar clusters where
> brokers experience Out-Of-Memory (OOM) errors and continuous restarts under
> specific load patterns. This occurs when Netty channel write buffers become
> full, leading to a buildup of unacknowledged responses in the broker's
> memory.
> 
> Background
> Our clusters are configured with numerous namespaces, each containing
> approximately 8,000 to 10,000 topics. Our consumer applications are quite
> large, with each consumer using a regular expression (regex) pattern to
> subscribe to all topics within a namespace.
> 
> The problem manifests particularly during consumer application restarts.
> When a consumer restarts, it issues a getTopicsOfNamespace request. Due to
> the sheer number of topics, the response size is extremely large. This
> massive response overwhelms the socket output buffer, causing it to fill up
> rapidly. Consequently, the broker's responses get backlogged in memory,
> eventually leading to the broker's OOM and subsequent restart loop.
> 
> Why "Returning an Error" Is Not a Solution
> A common approach to handling overload is to simply return an error when
> the broker cannot process a request. However, in this specific scenario,
> this solution is ineffective. If a consumer application fails to start due
> to an error, it triggers a user pod restart, which then leads to the same
> getTopicsOfNamespace request being reissued, resulting in a continuous loop
> of errors and restarts. This creates an unrecoverable state for the
> consumer application and puts immense pressure on the brokers.
> 
> Proposed Solution and Justification
> We believe the solution proposed in
> https://github.com/apache/pulsar/pull/24423 is highly suitable for
> addressing this issue. The core mechanism introduced in this PR – pausing
> acceptance of new requests when a channel cannot handle more output – is
> exceptionally reasonable and addresses the root cause of the memory
> pressure.
> 
> This approach prevents the broker from accepting new requests when its
> write buffers are full, effectively backpressuring the client and
> preventing the memory buildup that leads to OOMs. Furthermore, we
> anticipate that this mechanism will not significantly increase future
> maintenance costs, as it elegantly handles overload scenarios at a
> fundamental network layer.
> 
> I invite the community to discuss this solution and its potential benefits
> for the overall stability and resilience of Apache Pulsar.
> 
> Thanks
> Yubiao Feng
> 
> -- 
> This email and any attachments are intended solely for the recipient(s) 
> named above and may contain confidential, privileged, or proprietary 
> information. If you are not the intended recipient, you are hereby notified 
> that any disclosure, copying, distribution, or reproduction of this 
> information is strictly prohibited. If you have received this email in 
> error, please notify the sender immediately by replying to this email and 
> delete it from your system.
> 

Reply via email to