Hello Lari, inline once again.
On Mon, Nov 6, 2023 at 5:44 PM Lari Hotari <lhot...@apache.org> wrote: > Hi Girish, > > Replies inline. We are getting into a very detailed discussion. We > could also discuss this topic in one of the upcoming Pulsar Community > meetings. However, I might miss the next meeting that is scheduled > this Thursday. > Is this every thursday? I am willing to meet at a separate time as well if enough folks with a viewpoint on this can meet together. I assume that the community meeting has a much bigger agenda with detailed discussions not possible? > Although I am currently opposing to your proposal PIP-310, I am > supporting solving your problems related to rate limiting. :) > Let's continue the discussion since that is necessary so that we could > make progress. I hope this makes sense from your perspective. > > It is all good, as long as the final goal is met within reasonable timelines. > > I acknowledge that there are different usages, but my assumption is > that we could implement a generic solution that could be configured to > handle each specific use case. > I haven't yet seen any evidence that the requirements in your case are > so special that it justifies adding a pluggable interface for rate > Well, the blacklisting use case is a very specific use case. I am explaining below why that can't be done using metrics and a separate blacklisting API. > limiters. Exposing yet another pluggable interface in Pulsar will add > complexity without gains. Each supported public interface is a > maintenance burden if we care about the quality of the exposed > interfaces and put effort in ensuring that the interfaces are > supported in future versions. Exposing an interface will also lock > down or slow down some future refactorings. > This actually might be a blessing in disguise, at least for RateLimiter and PublishRateLimiter.java, being an internal interface, it has gone out of hand and unchecked. Explained more below. > One concrete example of this is the desired behavior of bursting. In > token bucket rate limiting, bursting is about using the buffered > tokens in the "token bucket" and having a configurable limit for the > buffer (the "bucket"). This buffer will usually only contain tokens > when the actual rate has been lower than the configured maximum rate > for some duration. > > However, there could be an expectation for a different type of > bursting which is more like "auto scaling" of the rate limit in a way > where the end-to-end latency of the produced messages > is taken into account. The expected behavior might be about scaling > the rate temporarily to a higher rate so that the queues can be > I would like to keep auto-scaling out of scope for this discussion. That opens up another huge can of worms, specially given the gaps in proper scale down support in pulsar. > > I don't know what "bursting" means for you. Would it be possible to > provide concrete examples of desired behavior? That would be very > helpful in making progress. > > Here are a few different use cases: - A producer(s) is producing at a near constant rate into a topic, with equal distribution among partitions. Due to a hiccup in their downstream component, the produce rate goes to 0 for a few seconds, and thus, to compensate, in the next few seconds, the produce rate tries to double up. - In a visitor based produce rate (where produce rate goes up in the day and goes down in the night, think in terms of popular website hourly view counts pattern) , there are cases when, due to certain external/internal triggers, the views - and thus - the produce rate spikes for a few minutes. It is also important to keep this in check so as to not allow bots to do DDOS into your system, while that might be a responsibility of an upstream system like API gateway, but we cannot be ignorant about that completely. - In streaming systems, where there are micro batches, there might be constant fluctuations in produce rate from time to time, based on batch failure or retries. In all of these situations, setting the throughput of the topic to be the absolute maximum of the various spikes observed during the day is very suboptimal. Moreover, in each of these situations, once bursting support is present in the system, it would also need to have proper checks in place to penalize the producers from trying to mis-use the system. In a true multi-tenant platform, this is very critical. Thus, blacklisting actually goes hand in hand here. Explained more below. > It's interesting that you mention that you would like to improve the > PublishRateLimiter interface. > How would you change it? > > The current interface of PublishRateLimiter has duplicate methods. I am assuming after an initial implementation (poller), the next implementation simply added more methods into the interface rather than actually using the ones already existing. For instance, there are both `tryAcquire` and `incrementPublishCount` methods, there are both `checkPublishRate` and `isPublishRateExceeded`. Then there is the issue of misplaced responsibilities where when its precise, the complete responsibility of checking and responding back with whether its rate limited or not lies with the `PrecisePublishLimiter.java` but when its poller based, there is some logic inside `PublishRateLimiterImpl` and rest of the logic is spread across `AbstractTopic.java` > short and medium term blacklisting of topics based on breach of rate > > limiter beyond a given SOP. I feel this is very very specific to our > > organization right now to be included inside pulsar itself. > > This is outside of rate limiters. IIRC, there have been some > discussions in the community that an API for blocking individual > producers or producing to specific topics could be useful in some > cases. > An external component could observe metrics and control blocking if > there's an API for doing so. > > Actually, putting this in an external component that's based off of metrics is not a scalable or responsive solution. First of all, it puts a lot of pressure on the metrics system (prometheus) where we are now querying 1000s of metrics every minute/sub-minute uselessly. Since majority of the time, not even a single topic may need blacklisting, this is very very inefficient. Secondly, it makes the design such that this external component now needs to be in sync about the existing topics and their rate set inside the pulsar's zookeeper. This also puts extra pressure on zk-reads. Lastly, the response time for blacklisting the topic increases a lot in this approach. This would be a much simpler and efficient model if it were reactive, based/triggered directly from within the rate limiter component. It can be fast and responsive, which is very critical when trying to prevent the system from abuse. > > Actually, by default, and for 99% of the cases, multiplexing isn't an > issue > > assuming: > > * A single producer object is producing to a single topic (one or more > > partition) > > * Produce is happening in a round robin manner (by default) > > Connection multiplexing issue is also a problem in the cases you > listed since multiple partitions might be served by the same broker > and the connection to that broker could be shared. Producing in a > round-robin manner does not eliminate the issue because the sending > process is asynchronous in the background. Therefore, it is a problem > It's only async if the producers use `sendAsync` . Moreover, even if it is async in nature, practically it ends up being quite linear and well-homogenised. I am speaking from experience of running 1000s of partitioned topics in production > Let's continue in getting into more details of the intended behavior > and define what bursting really means and how we believe that solves a > problem with "hot source" producers. > Makes sense? > > I am happy if the discussion concludes eitherways - a pluggable implementation or a 99% use-case capturing configurable rate limiter. But from what I've seen, the participation in OSS threads can be very random and thus, I am afraid it might take a while before more folks pitch in their inputs and a clear direction of discussion is formed. -- Girish Sharma