Hi Girish, Replies inline.
> > The current state of rate limiting is not acceptable in Pulsar. We > > need to fix things in the core. > > > > I wouldn't say that it's not acceptable. The precise one works as expected > as a basic rate limiter. Its only when there are complex requirements, the > current rate limiters fail. Just clarifying, I consider the situation with the default rate limiter not being optimal. The CPU overhead is significant. You must explicitly enable the "precise" rate limiters to resolve this issue. It's not very obvious for any Pulsar user. I don't think that this situation makes sense. If there's an abstraction of a rate limiter, it should be efficient and usable in the default configuration of Pulsar. > The key here being "complex requirement". I am with Rajan here that > whatever improvements we do to the core built-in rate limiter, would always > miss one or the other complex requirement. I haven't yet seen very complex requirements that relate directly to the rate limiter. The scope could expand to the area of capacity management, and I'm pretty sure that it gets there when we go further. Capacity management is a broader concern than rate limiting. We all know that capacity management is necessary in multi-tenant systems. Rate limiting and throttling is one way to handle that. When going to more complex requirements, it might be useful to go beyond rate limiting also in the conceptual design. For example DynamoDB has the concept of capacity units (CU). DynamoDB's conceptual design for capacity management is well described in a paper "Amazon DynamoDB: A Scalable, Predictably Performant, and Fully Managed NoSQL Database Service" and a related presentation [1]. There's also other related blog posts such as "Surprising Scalability of Multitenancy" [2] and "The Road To Serverless: Multi-tenancy" [3] which have been inspirational to me. The paper "Kora: A Cloud-Native Event Streaming Platform For Kafka" [4] is also a very useful read to learn about serverless capacity management. Capacity management goes beyond rate limiting since it has a tight relation to end-to-end flow control, load balancing, service levels and possible auto-scaling solutions. One of the goals of capacity management in a multi-tenant system is to address the dreaded "noisy neighbor" problem in a cost optimal and efficient way. 1 - https://www.usenix.org/conference/atc22/presentation/elhemali 2 - https://brooker.co.za/blog/2023/03/23/economics.html 3 - https://me.0xffff.me/dbaas3.html 4 - https://www.vldb.org/pvldb/vol16/p3822-povzner.pdf > > I feel like we need both the things here. We can work on improving the > built in rate limiter which does not try to solve all of my needs like > blacklisting support, limiting the number of bursts in a window etc. The > built in one can be improved with respect to code refactoring, > extensibility, modularization etc. along with implementing a more well > known rate limiting algorithm, like token bucket. > Along with this, the newly improved interface can now make way for > pluggable support. Yes, I agree. Improving the rate limiter and exposing an interface aren't exclusive. > This is assuming that we do not improve the default build in rate limiter > at all. Think of this like AuthorizationProvider - there is a built in one. > it's good enough, but each organization would have its own requirements wrt > how they handle authorization, and thus, most likely, any organization with > well defined AuthN/AuthZ constructs would be plugging their own providers > there. I don't think that this is a valid comparison. The current rate limiter has an explicit "contract". The user sets the maximum rate in bytes and/or messages and the rate limiter takes care of enforcing that limit. It's hard to see why that "contract" would have too many interpretations of what it means. Another reason is that I haven't seen any other messaging product where there would be a need to add support for user provided rate limiter algorithm. What makes Pulsar a special case that it would be needed? For authentication and authorization, it's a completely different story. The abstractions require that you pick a specific implementation for your way of doing authentication and authorization. Many other systems out there do it in a somewhat similar way as Pulsar. > The key thing here would be to make the public interface as minimal as > possible while allowing for custom complex rate limiters as well. I believe > this is easily doable without actually making the internal code more > complex. It's doable, but a different question is whether this is necessary in the end. We'll see over time how we can improve the Pulsar core rate limiter and whether there's a need to override it. The current interfaces will change when the Pulsar core rate limiter is improved. This work won't easily meet in the middle unless we start by improving the core rate limiter. What Pulsar does right now might have to change. In Pulsar, there's no actual control of the client application, like there is in Kafka Quotas [1]. You mentioned that you want to block individual producers in your complex requirements. That isn't supported in the current model of rate limiters in Pulsar. Does it even belong there? There are many questions to cover when extending the current model. It's not only about a "pluggable rate limiter" interface? 1 - https://docs.confluent.io/kafka/design/quotas.html > In fact, the way I envision the pluggable proposal, things become simpler > with respect to code flow, code ownership and custom if/else. Yes, that might work for you. For the Pulsar open source project, it's better to have a high quality rate limiter implementation which covers the majority of use cases. I don't see why this wouldn't be possible. Your expectation of what a pluggable rate limiter can do haven't been defined. Adding support for doing more, than what a rate limiter does, would be a larger change. It's hard to see why there would be a need to have a better variation of the token bucket algorithm when a lot of most advanced systems (such as network routers) use the token bucket algorithm for rate limiting. It's always possible that the messaging domain is very distinct... > I will give another example here, In big organizations, where pulsar is > actually being used at scale - and not just in terms of QPS/MBps, but also > in terms of number of teams, tenants, namespaces, number of unique features > being used, there always would be an in-house schema registry. Thus, while > pulsar already has a built in schema service and registry - it is important > that it also supports custom ones. This does not speak badly about the > based pulsar package, but actually speaks more about the adaptability of > the product. Sure, that makes sense for such features. I just don't see what you see with rate limiting and the variations that organizations need for it. My opinion might change once I learn more about your use case and what the "complex requirements" are that cannot be covered by the core functionality of a rate limiter. I noticed that a new message arrived while I have been writing this email. Perhaps that contains more details about your specific requirements and how that could impact the design? > I really respect and appreciate the discussions we have had. One of the > problems I've had in the pulsar community earlier is the lack of > participation. But I am getting a lot of participation this time, so its > really good. Thanks. Yes, we need a more active Pulsar community. It's improving every day. :) > I am willing to take this forward to improve the default rate limiters, but > since we would _have_ to meeting somewhere in the middle, at the end of all > this - our organizational requirements would still remain unfulfilled until > we build _all_ of the things that I have spoken about. Exactly, that's a challenge that I referred above with the pluggable interface design. Meeting in the middle won't happen easily unless there's focus on improving the core rate limiter at least simultaneously. > Just as Rajan was late to the discussion and he pointed out that they also > needed custom rate limiter a while back, there may he others who are either > unknownst to this mailing list, or are yet to look into rate limiter all > together who may find that whatever we have built/modified/improved is > still lacking. It will be interesting to hear what the details of Rajan's requirements for a custom rate limiter were. Perhaps those requirements could also be taken into account in improving the default core rate limiter? > I personally would like to work on improving the interface and making it > pluggable first. This is a smaller task both from a design and coding > perspective. Meanwhile, I will create another proposal for improving the > built in rate limiter. Since we have had a lot of discussion about how to > improve the rate limiter, and I will continue discussing which rate limiter > works best in my opinion, I think we can be in a liberty to take a bit of > extra time in discussing and closing the improved rate limiter design. Of > course, I will keep the interface definition in mind while proposing the > improved rate limiter and vice versa. I accept that you take that perspective, and it's understandable that it seems like the straightest path for you. We'll see how things go, and eventually we will all adapt on the way. Anyway, we are making progress at the moment. :) -Lari -Lari