Thank you Mattison for starting the discussion. Just adding some more context.
In general, every type of processing pipeline solution needs a solution for backpressure. This could mean different types of solutions for different goals, depending on the context. Usually a backpressure solution is some type of flow control solution (wikipedia: https://en.wikipedia.org/wiki/Flow_control_(data) ). There are multiple backpressure solutions in Pulsar. For the broker, flow control is typically achieved using throttling based on explicit rate limits and there's also the solution of feedback-based flow control in the Pulsar binary protocol between the broker and consumers, where "permits" are exchanged between the broker and the consumer (document: https://pulsar.apache.org/docs/develop-binary-protocol/#flow-control ) In Netty servers and clients, a common technical solution to backpressure is to toggle the Netty "auto read" state of the TCP/IP connection to pause or resume reading more input. It is also common in Netty servers and clients to stop reading input from the TCP/IP connection when the output buffer of the connection is full and resume when the output buffer level goes below a threshold. Pulsar uses TCP/IP connection level controls for achieving backpressure for producers. The challenge is that Pulsar can share a single TCP/IP connection across multiple producers and consumers. Because of this, there could be multiple producer and dispatcher rate limiters operating on the same connection on the broker, and they will do conflicting decisions, which results in inconsistent behavior. For backpressure between Pulsar and Bookkeepers, there are multiple backpressure/flow control solutions. There's also an explicit backpressure feature in Bookkeeper that could be enabled on server and client side. There is an issue https://github.com/apache/pulsar/issues/10439 to track enabling this feature in Pulsar Broker and in the default configuration for Bookkeeper as part of Pulsar deployments. Even without this explicit feature, there are forms of backpressure in place between brokers and bookies. It would be helpful if the flow control between brokers and bookies would be explained and documented in detail. This would help both Pulsar users and developers to build understanding about the performance model of Pulsar. One properly of flow control is that it can fail if the whole pipeline isn't covered from end-to-end. Gaps in flow control could also lead to amplification issues where the system has to cope with even more load because retries that clients send when previous requests timeout or fail. There could also be thundering herds issues where the system gets high spikes of traffic when requests get buffered because of various reasons and this could lead to cascading failures when the total load of the system increases in unexpected ways because of amplification effects. Touching upon Pulsar Admin REST API. The context for back pressure is a lot different in the Pulsar Admin REST API. Before the PIP-149 async changes, there was explicit backpressure in the REST API implementation. The system could handle a limited amount of work and it would process downstream work items one-by-one. With "PIP-149 Making the REST Admin API fully async" (https://github.com/apache/pulsar/issues/14365), there are different challenges related to backpressure. It is usually about how to limit the in-progress work in the system. An async system will accept a lot of work compared to the previous solution and this accepted work will get processed in the async REST API backend eventually even when the clients have already closed the connection and sent a new retry. One possible solution to this issue is to limit incoming requests at the HTTP server level with features that Jetty provides for limiting concurrency. PRs https://github.com/apache/pulsar/pull/14353 and https://github.com/apache/pulsar/pull/15637 added this support to Pulsar. The values might have to be tuned to a lot lower values to prevent issues in practice. This is not a complete solution for REST API backend. It would be useful to also have a solution that would cancel down stream requests that are for incoming HTTP requests that no longer exist since the client stopped waiting for the response. The main down stream requests are towards the metadata store. It might also be necessary to limit the number of outstanding downstream requests. With batching in metadata store, that might not be an issue. I hope it was useful to share some thoughts around backpressure in Pulsar. I'm looking forward to learning more from others about the design for backpressure in the Pulsar broker. Please share your thoughts too! -Lari On Tue, Sep 13, 2022 at 7:36 AM mattison chao <mattisonc...@apache.org> wrote: > > Hi, All > > Since Pulsar has many asynchronous operations, we need to talk about > backpressure. Because this is something to watch out for in async and I've > seen comments about it in some other PRs. > > I think we need to open a separate discussion to collect the problems we > are having and the solutions we need to provide. > > > Best. > Mattison > > FYI: > https://github.com/apache/pulsar/issues/16998#issuecomment-1244833965 > https://github.com/apache/pulsar/pull/17349#discussion_r964399426