Hi Jun, Coming back to your question regarding the differences between the token bucket algorithm and our current quota mechanism. I did some tests and they confirmed my first intuition that our current mechanism does not work well with a bursty workload. Let me try to illustrate the difference with an example. One important aspect to keep in mind is that we don't want to reject requests when the quota is exhausted.
Let's say that we want to guarantee an average rate R=5 partitions/sec while allowing a burst B=500 partitions. With our current mechanism, this translates to following parameters: - Quota = 5 - Samples = B / R + 1 = 101 (to allow the burst) - Time Window = 1s (the default) Now, let's say that a client wants to create 7 topics with 80 partitions each at the time T. It brings the rate to 5.6 (7 * 80 / 100) which is above the quota so any new request is rejected until the quota gets back above R. In theory, the client must wait 12 secs ((5.6 - 5) / 5 * 100) to get it back to R. In practice, due to the sparse samples (one sample worth 560), the rate won't decrease until that sample is dropped and it will be after 101 secs. It gets worse if the burst is increased. With the token bucket algorithm, this translate to the following parameters: - Rate = 5 - Tokens = 500 The same request decreases the number of available tokens to -60 which is below 0 so any new request is rejected until the number of available tokens gets back above 0. This takes 12 secs ((60+1) / 5). The token bucket algorithm is more suited for bursty workloads which is our case here. I hope that this example helps to clarify the choice. Best, David On Tue, May 12, 2020 at 3:19 PM Tom Bentley <tbent...@redhat.com> wrote: > Hi David, > > Thanks for the reply. > > >> If I understand the proposed throttling algorithm, an initial request > > would > > >> be allowed (possibly making K negative) and only subsequent requests > > >> (before K became positive) would receive the QUOTA_VIOLATED. That > would > > >> mean it was still possible to block the controller from handling other > > >> events – you just need to do so via making one big request. > > > > That is correct. One could still create one big topic (not request) and > > that would > > create some load on the controller. All options suffer from this issue as > > we can > > stop clients from creating a very large topic. At least, when it happens, > > the client > > will have to wait to pay back its credits which guarantee that we control > > the average > > load on the controller. > > > > I can see that the admission throttling is better than nothing. It's just > that it doesn't fully solve the problem described in the KIP's motivation. > That doesn't mean it's not worth doing. I certainly prefer this approach > over that taken by KIP-578 (which cites the effects of deleting a single > topic with many partitions). > > > > >> While the reasons for rejecting execution throttling make sense given > > the > > >> RPCs we have today that seems to be at the cost of still allowing harm > > to > > >> the cluster, or did I misunderstand? > > > > Execution throttling would also suffer from large topics being created. > We > > have > > rejected it due to the current RPCs and also because it does not limit > the > > amount > > of work queued up in the controller. Imagine a low quota, that would > result > > in a huge > > backlog of pending operations. > > > > What exactly is the problem with having a huge backlog of pending > operations? I can see that the backlog would need persisting so that the > controller could change without losing track of the topics to be mutated, > and the mutations would need to be submitted in batches to the controller > event queue (thus allowing other controller requests to be interleaved). I > realise this is not feasible right now, I'm just trying to understand if > it's feasible at all and if there's any appetite for making the requisite > API changes in the future in order to prevent these problems even for large > single requests. > > Kind regards, > > Tom >