Connection-level rate limiting has the following drawbacks: Lacks a global perspective. When the number of clients is sufficiently high, it can still lead to Out-Of-Memory (OOM) errors.
Mutual interference. It can negatively impact production, consumption, heartbeat, and other operations. Design consistency. When should rate limiting be TCP-based, and when should it be based on a RateLimiter? We can establish a multi-layered mechanism: API layer restrictions: Limiting based on in-flight requests, such as producer throttling and consumer throttling. Memory layer restrictions: Memory pool limitations, blocking new request processing rather than allocating more memory when the pool is exhausted. On 2025/07/08 06:07:10 Yubiao Feng wrote: > Hi all > > I want to satrt a discussion, which relates to the PR. #24423: Handling > Overloaded Netty Channels in Apache Pulsar > > Problem Statement > We've encountered a critical issue in our Apache Pulsar clusters where > brokers experience Out-Of-Memory (OOM) errors and continuous restarts under > specific load patterns. This occurs when Netty channel write buffers become > full, leading to a buildup of unacknowledged responses in the broker's > memory. > > Background > Our clusters are configured with numerous namespaces, each containing > approximately 8,000 to 10,000 topics. Our consumer applications are quite > large, with each consumer using a regular expression (regex) pattern to > subscribe to all topics within a namespace. > > The problem manifests particularly during consumer application restarts. > When a consumer restarts, it issues a getTopicsOfNamespace request. Due to > the sheer number of topics, the response size is extremely large. This > massive response overwhelms the socket output buffer, causing it to fill up > rapidly. Consequently, the broker's responses get backlogged in memory, > eventually leading to the broker's OOM and subsequent restart loop. > > Why "Returning an Error" Is Not a Solution > A common approach to handling overload is to simply return an error when > the broker cannot process a request. However, in this specific scenario, > this solution is ineffective. If a consumer application fails to start due > to an error, it triggers a user pod restart, which then leads to the same > getTopicsOfNamespace request being reissued, resulting in a continuous loop > of errors and restarts. This creates an unrecoverable state for the > consumer application and puts immense pressure on the brokers. > > Proposed Solution and Justification > We believe the solution proposed in > https://github.com/apache/pulsar/pull/24423 is highly suitable for > addressing this issue. The core mechanism introduced in this PR – pausing > acceptance of new requests when a channel cannot handle more output – is > exceptionally reasonable and addresses the root cause of the memory > pressure. > > This approach prevents the broker from accepting new requests when its > write buffers are full, effectively backpressuring the client and > preventing the memory buildup that leads to OOMs. Furthermore, we > anticipate that this mechanism will not significantly increase future > maintenance costs, as it elegantly handles overload scenarios at a > fundamental network layer. > > I invite the community to discuss this solution and its potential benefits > for the overall stability and resilience of Apache Pulsar. > > Thanks > Yubiao Feng > > -- > This email and any attachments are intended solely for the recipient(s) > named above and may contain confidential, privileged, or proprietary > information. If you are not the intended recipient, you are hereby notified > that any disclosure, copying, distribution, or reproduction of this > information is strictly prohibited. If you have received this email in > error, please notify the sender immediately by replying to this email and > delete it from your system. >