Hi all I want to satrt a discussion, which relates to the PR. #24423: Handling Overloaded Netty Channels in Apache Pulsar
Problem Statement We've encountered a critical issue in our Apache Pulsar clusters where brokers experience Out-Of-Memory (OOM) errors and continuous restarts under specific load patterns. This occurs when Netty channel write buffers become full, leading to a buildup of unacknowledged responses in the broker's memory. Background Our clusters are configured with numerous namespaces, each containing approximately 8,000 to 10,000 topics. Our consumer applications are quite large, with each consumer using a regular expression (regex) pattern to subscribe to all topics within a namespace. The problem manifests particularly during consumer application restarts. When a consumer restarts, it issues a getTopicsOfNamespace request. Due to the sheer number of topics, the response size is extremely large. This massive response overwhelms the socket output buffer, causing it to fill up rapidly. Consequently, the broker's responses get backlogged in memory, eventually leading to the broker's OOM and subsequent restart loop. Why "Returning an Error" Is Not a Solution A common approach to handling overload is to simply return an error when the broker cannot process a request. However, in this specific scenario, this solution is ineffective. If a consumer application fails to start due to an error, it triggers a user pod restart, which then leads to the same getTopicsOfNamespace request being reissued, resulting in a continuous loop of errors and restarts. This creates an unrecoverable state for the consumer application and puts immense pressure on the brokers. Proposed Solution and Justification We believe the solution proposed in https://github.com/apache/pulsar/pull/24423 is highly suitable for addressing this issue. The core mechanism introduced in this PR – pausing acceptance of new requests when a channel cannot handle more output – is exceptionally reasonable and addresses the root cause of the memory pressure. This approach prevents the broker from accepting new requests when its write buffers are full, effectively backpressuring the client and preventing the memory buildup that leads to OOMs. Furthermore, we anticipate that this mechanism will not significantly increase future maintenance costs, as it elegantly handles overload scenarios at a fundamental network layer. I invite the community to discuss this solution and its potential benefits for the overall stability and resilience of Apache Pulsar. Thanks Yubiao Feng -- This email and any attachments are intended solely for the recipient(s) named above and may contain confidential, privileged, or proprietary information. If you are not the intended recipient, you are hereby notified that any disclosure, copying, distribution, or reproduction of this information is strictly prohibited. If you have received this email in error, please notify the sender immediately by replying to this email and delete it from your system.