Hi Tom, >> What exactly is the problem with having a huge backlog of pending >> operations? I can see that the backlog would need persisting so that the >> controller could change without losing track of the topics to be mutated, >> and the mutations would need to be submitted in batches to the controller >> event queue (thus allowing other controller requests to be interleaved). I >> realise this is not feasible right now, I'm just trying to understand if >> it's feasible at all and if there's any appetite for making the requisite >> API changes in the future in order to prevent these problems even for large >> single requests.
It is definitely feasible. My concern with the approach is about the way our current API works. Let me try to illustrate it with an example. When the admin client sends a CreateTopicsRequest to the controller, the request goes to the purgatory waiting until all the topics are created or the timeout specified in the request is reached. If the timeout is reached, a RequestTimeoutException is returned to the client. This is used to fail the future of the caller. In conjunction, the admin client fails any pending request with a TimeoutException after the request timeout is reached (30s by default). In the former case, the caller will likely retry. In the later case, the admin client will automatically retry. In both cases, the broker will respond with a TopicExistsException. Having a huge backlog of pending operations will amplify this weird behavior. Clients will tend to get TopicExistsException errors when they create topics for the first time which is really weird. I think that our current API is not well suited for this. An asynchronous API workflow with one API to create/delete and another one to query the status of completion would be better suited. We can definitely involve our API towards this but we need to figure out a compatibility story for existing clients. Another aspect is the fairness among the clients. Imagine a case where one client continuously creates and deletes topics in a tight loop. This would flood the queue and delay the creations and the deletions of the other clients. Throttling at admission time mitigates this directly. Throttling at execution would need to take this into account to ensure fairness among the clients. It is a little harder to do this in the controller as the controller is completely agnostic from the principals and the client ids. These reasons made me lean towards the current proposal. Does that make sense? Best, David On Wed, May 13, 2020 at 10:05 AM David Jacot <dja...@confluent.io> wrote: > Hi Jun, > > Coming back to your question regarding the differences between the token > bucket algorithm and our current quota mechanism. I did some tests and > they confirmed my first intuition that our current mechanism does not work > well with a bursty workload. Let me try to illustrate the difference with > an > example. One important aspect to keep in mind is that we don't want to > reject requests when the quota is exhausted. > > Let's say that we want to guarantee an average rate R=5 partitions/sec > while > allowing a burst B=500 partitions. > > With our current mechanism, this translates to following parameters: > - Quota = 5 > - Samples = B / R + 1 = 101 (to allow the burst) > - Time Window = 1s (the default) > > Now, let's say that a client wants to create 7 topics with 80 partitions > each at > the time T. It brings the rate to 5.6 (7 * 80 / 100) which is above the > quota so > any new request is rejected until the quota gets back above R. In theory, > the > client must wait 12 secs ((5.6 - 5) / 5 * 100) to get it back to R. In > practice, due > to the sparse samples (one sample worth 560), the rate won't decrease until > that sample is dropped and it will be after 101 secs. It gets worse if the > burst > is increased. > > With the token bucket algorithm, this translate to the following > parameters: > - Rate = 5 > - Tokens = 500 > > The same request decreases the number of available tokens to -60 which is > below 0 so any new request is rejected until the number of available tokens > gets back above 0. This takes 12 secs ((60+1) / 5). > > The token bucket algorithm is more suited for bursty workloads which is our > case here. I hope that this example helps to clarify the choice. > > Best, > David > > On Tue, May 12, 2020 at 3:19 PM Tom Bentley <tbent...@redhat.com> wrote: > >> Hi David, >> >> Thanks for the reply. >> >> >> If I understand the proposed throttling algorithm, an initial request >> > would >> > >> be allowed (possibly making K negative) and only subsequent requests >> > >> (before K became positive) would receive the QUOTA_VIOLATED. That >> would >> > >> mean it was still possible to block the controller from handling >> other >> > >> events – you just need to do so via making one big request. >> > >> > That is correct. One could still create one big topic (not request) and >> > that would >> > create some load on the controller. All options suffer from this issue >> as >> > we can >> > stop clients from creating a very large topic. At least, when it >> happens, >> > the client >> > will have to wait to pay back its credits which guarantee that we >> control >> > the average >> > load on the controller. >> > >> >> I can see that the admission throttling is better than nothing. It's just >> that it doesn't fully solve the problem described in the KIP's motivation. >> That doesn't mean it's not worth doing. I certainly prefer this approach >> over that taken by KIP-578 (which cites the effects of deleting a single >> topic with many partitions). >> >> >> > >> While the reasons for rejecting execution throttling make sense given >> > the >> > >> RPCs we have today that seems to be at the cost of still allowing >> harm >> > to >> > >> the cluster, or did I misunderstand? >> > >> > Execution throttling would also suffer from large topics being created. >> We >> > have >> > rejected it due to the current RPCs and also because it does not limit >> the >> > amount >> > of work queued up in the controller. Imagine a low quota, that would >> result >> > in a huge >> > backlog of pending operations. >> > >> >> What exactly is the problem with having a huge backlog of pending >> operations? I can see that the backlog would need persisting so that the >> controller could change without losing track of the topics to be mutated, >> and the mutations would need to be submitted in batches to the controller >> event queue (thus allowing other controller requests to be interleaved). >> I >> realise this is not feasible right now, I'm just trying to understand if >> it's feasible at all and if there's any appetite for making the requisite >> API changes in the future in order to prevent these problems even for >> large >> single requests. >> >> Kind regards, >> >> Tom >> >