Re: [Discussion] PR #24423 Handling Overloaded Netty Channels in Apache Pulsar

Lin Lin Fri, 11 Jul 2025 00:46:51 -0700

Hi Yubiao:

If a sufficient number of PulsarClients are started at one time, the number of 
channels will also increase. Concurrent requests can still cause Broker OOM.


For example, with each channel limited to 100MB, 1 million channels. This is 
what I meant by the lack of a global perspective. This throttling isn't 
calculated based on all connections.

If it's an direct memory overflow, should we impose limits on direct memory 
allocation? All components share the same direct memory pool, which is globally 
counted, this would enable more effective throttling.

If it is caused by excessive retries, the client should back off.

After this feature added, the client still keep retrying. Although the Broker 
will not have an OOM, produce and consume will still be affected. The 
unavailability issue remains.

On 2025/07/11 01:50:40 Yubiao Feng wrote:
> Hi LinLin
> 
> > We can establish a multi-layered mechanism:
> > API layer restrictions: Limiting based on in-flight requests, such as
> producer throttling and consumer throttling.
> > Memory layer restrictions: Memory pool limitations, blocking new request
> processing rather than allocating more memory when the pool is exhausted.
> 
> The PR #24423 happens to be the implementer of layer 2 of your suggested
> mechanism. We can separate a PR to do the layer 1
> 
> > Lacks a global perspective. When the number of clients is sufficiently
> high, it can still lead to Out-Of-Memory (OOM) errors.
> 
> As has been answered in the previous text, this fix only focuses on
> addressing the OOM caused by the accumulation of a large number of
> responses in memory due to the channel granularity being unwritable.
> BTW, If you have a large number of clients, you can reduce the backlog of
> responses allowed for each channel by adjusting the parameter
> `connectionMaxPendingWriteBytes`
> 
> > Mutual interference. It can negatively impact production, consumption,
> heartbeat, and other operations.
> 
> As has been answered in the previous text, once the channel is not
> writable, all requests that are sent to this channel can not receive
> a reply anymore because the response can not be written out. The results
> are the same; the clients receive replies delayed. To improve performance,
> users should consider using more channels; in other words, they can set a
> bigger `connectionsPerBroker` or separate clients.
> 
> 
> 
> On Fri, Jul 11, 2025 at 8:54 AM Lin Lin <lin...@apache.org> wrote:
> 
> > Connection-level rate limiting has the following drawbacks:
> >
> > Lacks a global perspective. When the number of clients is sufficiently
> > high, it can still lead to Out-Of-Memory (OOM) errors.
> >
> > Mutual interference. It can negatively impact production, consumption,
> > heartbeat, and other operations.
> >
> > Design consistency. When should rate limiting be TCP-based, and when
> > should it be based on a RateLimiter?
> >
> > We can establish a multi-layered mechanism:
> >
> > API layer restrictions: Limiting based on in-flight requests, such as
> > producer throttling and consumer throttling.
> >
> > Memory layer restrictions: Memory pool limitations, blocking new request
> > processing rather than allocating more memory when the pool is exhausted.
> >
> > On 2025/07/08 06:07:10 Yubiao Feng wrote:
> > > Hi all
> > >
> > > I want to satrt a discussion, which relates to the PR. #24423: Handling
> > > Overloaded Netty Channels in Apache Pulsar
> > >
> > > Problem Statement
> > > We've encountered a critical issue in our Apache Pulsar clusters where
> > > brokers experience Out-Of-Memory (OOM) errors and continuous restarts
> > under
> > > specific load patterns. This occurs when Netty channel write buffers
> > become
> > > full, leading to a buildup of unacknowledged responses in the broker's
> > > memory.
> > >
> > > Background
> > > Our clusters are configured with numerous namespaces, each containing
> > > approximately 8,000 to 10,000 topics. Our consumer applications are quite
> > > large, with each consumer using a regular expression (regex) pattern to
> > > subscribe to all topics within a namespace.
> > >
> > > The problem manifests particularly during consumer application restarts.
> > > When a consumer restarts, it issues a getTopicsOfNamespace request. Due
> > to
> > > the sheer number of topics, the response size is extremely large. This
> > > massive response overwhelms the socket output buffer, causing it to fill
> > up
> > > rapidly. Consequently, the broker's responses get backlogged in memory,
> > > eventually leading to the broker's OOM and subsequent restart loop.
> > >
> > > Why "Returning an Error" Is Not a Solution
> > > A common approach to handling overload is to simply return an error when
> > > the broker cannot process a request. However, in this specific scenario,
> > > this solution is ineffective. If a consumer application fails to start
> > due
> > > to an error, it triggers a user pod restart, which then leads to the same
> > > getTopicsOfNamespace request being reissued, resulting in a continuous
> > loop
> > > of errors and restarts. This creates an unrecoverable state for the
> > > consumer application and puts immense pressure on the brokers.
> > >
> > > Proposed Solution and Justification
> > > We believe the solution proposed in
> > > https://github.com/apache/pulsar/pull/24423 is highly suitable for
> > > addressing this issue. The core mechanism introduced in this PR – pausing
> > > acceptance of new requests when a channel cannot handle more output – is
> > > exceptionally reasonable and addresses the root cause of the memory
> > > pressure.
> > >
> > > This approach prevents the broker from accepting new requests when its
> > > write buffers are full, effectively backpressuring the client and
> > > preventing the memory buildup that leads to OOMs. Furthermore, we
> > > anticipate that this mechanism will not significantly increase future
> > > maintenance costs, as it elegantly handles overload scenarios at a
> > > fundamental network layer.
> > >
> > > I invite the community to discuss this solution and its potential
> > benefits
> > > for the overall stability and resilience of Apache Pulsar.
> > >
> > > Thanks
> > > Yubiao Feng
> > >
> > > --
> > > This email and any attachments are intended solely for the recipient(s)
> > > named above and may contain confidential, privileged, or proprietary
> > > information. If you are not the intended recipient, you are hereby
> > notified
> > > that any disclosure, copying, distribution, or reproduction of this
> > > information is strictly prohibited. If you have received this email in
> > > error, please notify the sender immediately by replying to this email
> > and
> > > delete it from your system.
> > >
> >
> 
> -- 
> This email and any attachments are intended solely for the recipient(s) 
> named above and may contain confidential, privileged, or proprietary 
> information. If you are not the intended recipient, you are hereby notified 
> that any disclosure, copying, distribution, or reproduction of this 
> information is strictly prohibited. If you have received this email in 
> error, please notify the sender immediately by replying to this email and 
> delete it from your system.
>

Re: [Discussion] PR #24423 Handling Overloaded Netty Channels in Apache Pulsar

Reply via email to