Hi Tom, Thanks for the feedback.
>> If I understand the proposed throttling algorithm, an initial request would >> be allowed (possibly making K negative) and only subsequent requests >> (before K became positive) would receive the QUOTA_VIOLATED. That would >> mean it was still possible to block the controller from handling other >> events – you just need to do so via making one big request. That is correct. One could still create one big topic (not request) and that would create some load on the controller. All options suffer from this issue as we can stop clients from creating a very large topic. At least, when it happens, the client will have to wait to pay back its credits which guarantee that we control the average load on the controller. >> While the reasons for rejecting execution throttling make sense given the >> RPCs we have today that seems to be at the cost of still allowing harm to >> the cluster, or did I misunderstand? Execution throttling would also suffer from large topics being created. We have rejected it due to the current RPCs and also because it does not limit the amount of work queued up in the controller. Imagine a low quota, that would result in a huge backlog of pending operations. Best, David On Mon, May 11, 2020 at 4:57 PM David Jacot <dja...@confluent.io> wrote: > Hi Anna and Jun, > > Anna, thanks for your thoughtful feedback. Overall, I agree with what you > said. If I summarize, you said that using time on server threads is not > easier > to tune than a rate based approach and it does not really capture all the > load neither as the control requests are not taken into account. > > I have been thinking about using an approach similar to the request quota > and > here are my thoughts: > > 1. With the current architecture of the controller, the ZK writes > resulting from a > topic being created, expanded or deleted are spread among the api handler > threads and the controller thread. This is problematic to measure the real > thread > usage. I suppose that this will change with KIP-500 though. > > 2. I do agree with Anna that we don't only care about requests tying up > the controller > thread but also care about the control requests. A rate based approach > would allow > us to define a value which copes with both dimensions. > > 3. Doing the accounting in the controller requires to attach a principal > and a client id > to each topic as the quotas are expressed per principal and/or client id. > I find this a > little odd. These information would have to be updated whenever a topic is > expanded > or deleted and any subsequent operations for that topic in the controller > would be > accounted for the last principal and client id. As an example, imagine a > client that > deletes topics which have partitions on a dead broker. In this case, the > deletion would > be retried until it succeeds. If that client has a low quota, that may > prevent it from > doing any new operations until the delete succeeds. This is a strange > behavior. > > 4. One important aspect of the proposal is that we want to be able to > reject requests > when the quota is exhausted. With a usage based approach, it is difficult > to compute > the time to wait before being able to create the topics. The only that we > could do is > to ask the client to wait until the measure time drops to the quota to > ensure that a new > operations can be accepted. With a rate based approach, we can > precisely compute > the time to wait. > > 5. I find the experience for users slightly better with a rate based > approach. When I say > users here, I mean the users of the admin client or the cli. When > throttled, they would > get an error saying: "can't create because you have reached X partition > mutations/sec". > With the other approach, we could only say: "quota exceeded". > > 6. I do agree that the rate based approach is less generic in the long > term, especially if > other resource types are added in the controller. > > Altogether, I am not convinced by a usage based approach and I would > rather lean > towards keeping the current proposal. > > Best, > David > > On Thu, May 7, 2020 at 2:11 AM Anna Povzner <a...@confluent.io> wrote: > >> Hi David and Jun, >> >> I wanted to add to the discussion about using requests/sec vs. time on >> server threads (similar to request quota) for expressing quota for topic >> ops. >> >> I think request quota does not protect the brokers from overload by itself >> -- it still requires tuning and sometimes re-tuning, because it depends on >> the workload behavior of all users (like a relative share of requests >> exempt from throttling). This makes it not that easy to set. Let me give >> you more details: >> >> 1. >> >> The amount of work that the user can get from the request quota depends >> on the load from other users. We measure and enforce user's clock time >> on >> threads---the time between 2 timestamps, one when the operation starts >> and >> one when the operation ends. If the user is the only load on the >> broker, it >> is less likely that their operation will be interrupted by the kernel >> to >> switch to another thread, and time away from the thread still counts. >> 1. >> >> Pros: this makes it more work-conserving, the user is less limited >> when there are more resources available. >> 2. >> >> Cons: Harder to capacity plan for the user, and could be confusing >> when the broker will suddenly stop supporting the load which it was >> supporting before. >> 2. >> >> For the above reason, it makes most sense to maximize the user's quota >> and set it as a percent of the maximum thread capacity (1100 with >> default >> broker config). >> 3. >> >> However, the actual maximum threads capacity is not really 1100: >> 1. >> >> Some of it will be taken by requests exempt from throttling, and the >> amount depends on the workload. We have seen cases (somewhat rare) >> where >> requests exempt from throttling take like ⅔ of the time on threads. >> 2. >> >> We have also seen cases of an overloaded cluster (full queues, >> timeouts, etc) due to high request rate while the time used on >> threads was >> way below the max (1100), like 600 or 700 (total exempt + non-exempt >> usage). Basically, when a broker is close to 100% CPU, it takes >> more and >> more time for the "unaccounted" work like thread getting a chance >> to pick >> up a request from the queue and get a timestamp. >> 4. >> >> As a result, there will be some tuning to decide on a safe value for >> total thread capacity, from where users can carve out their quotas. >> Some >> changes in users' workloads may require re-tuning, if, for example, it >> dramatically changes the relative share of non-exempt load. >> >> >> I think request quota works well for client request load in a sense that >> it >> ensures that different users get a fair/proportional share of resources >> during high broker load. If the user cannot get enough resources from >> their >> quota to support their request rate anymore, they can monitor their load >> and expand the cluster if needed (or rebalance). >> >> However, I think using time on threads for topic ops could be even more >> difficult than simple request rate (as proposed): >> >> 1. >> >> I understand that we don't only care about topic requests tying up the >> controller thread, but we also care that it does not create a large >> extra >> load on the cluster due to LeaderAndIsr and other related requests >> (this is >> more important for small clusters). >> 2. >> >> For that reason, tuning quota in terms of time on threads can be >> harder, >> because there is no easy way to say how this quota would translate to a >> number of operations (because that would depend on other broker load). >> >> >> Since tuning would be required anyway, I see the following workflow if we >> express controller quota in terms of partition mutations per second: >> >> 1. >> >> Run topic workload in isolation (the most expensive one, like create >> topic vs. add partitions) and see how much load it adds based on >> incoming >> rate. Choose quota depending on how much extra load your cluster can >> sustain in addition to its normal load. >> 2. >> >> Could be useful to publish some experimental results to give some >> ballpark numbers to make this sizing easier. >> >> >> I am interested to see if you agree with the listed assumptions here. I >> may >> have missed something, especially if there is an easier workflow for >> setting quota based on time on threads. >> >> Thanks, >> >> Anna >> >> >> On Thu, Apr 30, 2020 at 8:13 AM Tom Bentley <tbent...@redhat.com> wrote: >> >> > Hi David, >> > >> > Thanks for the KIP. >> > >> > If I understand the proposed throttling algorithm, an initial request >> would >> > be allowed (possibly making K negative) and only subsequent requests >> > (before K became positive) would receive the QUOTA_VIOLATED. That would >> > mean it was still possible to block the controller from handling other >> > events – you just need to do so via making one big request. >> > >> > While the reasons for rejecting execution throttling make sense given >> the >> > RPCs we have today that seems to be at the cost of still allowing harm >> to >> > the cluster, or did I misunderstand? >> > >> > Kind regards, >> > >> > Tom >> > >> > >> > >> > On Tue, Apr 28, 2020 at 1:49 AM Jun Rao <j...@confluent.io> wrote: >> > >> > > Hi, David, >> > > >> > > Thanks for the reply. A few more comments. >> > > >> > > 1. I am actually not sure if a quota based on request rate is easier >> for >> > > the users. For context, in KIP-124, we started with a request rate >> quota, >> > > but ended up not choosing it. The main issues are (a) requests are not >> > > equal; some are more expensive than others; (b) the users typically >> don't >> > > know how expensive each type of request is. For example, a big part of >> > > the controller cost is ZK write. To create a new topic with 1 >> partition, >> > > the number of ZK writes is 4 (1 for each segment >> > > in /brokers/topics/[topic]/partitions/[partitionId]/state). The cost >> of >> > > adding one partition to an existing topic requires 2 ZK writes. The >> cost >> > of >> > > deleting a topic with 1 partition requires 6 to 7 ZK writes. It's >> > unlikely >> > > for a user to know the exact cost associated with those >> > > implementation details. If users don't know the cost, it's not clear >> if >> > > they can set the rate properly. >> > > >> > > 2. I think that depends on the goal. To me, the common problem is that >> > you >> > > have many applications running on a shared Kafka cluster and one of >> the >> > > applications abuses the broker by issuing too many requests. In this >> > case, >> > > a global quota will end up throttling every application. However, >> what we >> > > really want in this case is to only throttle the application that >> causes >> > > the problem. A user level quota solves this problem more effectively. >> We >> > > may still need some sort of global quota when the total usage from all >> > > applications exceeds the broker resource. But that seems to be >> secondary >> > > since it's uncommon for all applications' usage to go up at the same >> > time. >> > > We can also solve this problem by reducing the per user quota for >> every >> > > application if there is a user level quota. >> > > >> > > 3. Not sure that I fully understand the difference in burst balance. >> The >> > > current throttling logic works as follows. Each quota is measured >> over a >> > > number of time windows. Suppose the Quota is to X/sec. If time passes >> and >> > > the quota is not being used, we are accumulating credit at the rate of >> > > X/sec. If a quota is being used, we are reducing that credit based on >> the >> > > usage. The credit expires when the corresponding window rolls out. The >> > max >> > > credit that can be accumulated is X * number of windows * window size. >> > So, >> > > in some sense, the current logic also supports burst and a way to cap >> the >> > > burst. Could you explain the difference with Token Bucket a bit more? >> > Also, >> > > the current quota system always executes the current request even if >> it's >> > > being throttled. It just informs the client to back off a throttled >> > amount >> > > of time before sending another request. >> > > >> > > Jun >> > > >> > > >> > > >> > > On Mon, Apr 27, 2020 at 5:15 AM David Jacot <dja...@confluent.io> >> wrote: >> > > >> > > > Hi Jun, >> > > > >> > > > Thank you for the feedback. >> > > > >> > > > 1. You are right. At the end, we do care about the percentage of >> time >> > > that >> > > > an operation ties up the controller thread. I thought about this >> but I >> > > was >> > > > not entirely convinced by it for following reasons: >> > > > >> > > > 1.1. While I do agree that setting up a rate and a burst is a bit >> > harder >> > > > than >> > > > allocating a percentage for the administrator of the cluster, I >> believe >> > > > that a >> > > > rate and a burst are way easier to understand for the users of the >> > > cluster. >> > > > >> > > > 1.2. Measuring the time that a request ties up the controller >> thread is >> > > not >> > > > as straightforward as it sounds because the controller reacts to ZK >> > > > TopicChange and TopicDeletion events in lieu of handling requests >> > > directly. >> > > > These events do not carry on the client id nor the user information >> so >> > > the >> > > > best would be to refactor the controller to accept requests instead >> of >> > > > reacting >> > > > to the events. This will be possible with KIP-590. It has obviously >> > other >> > > > side effects in the controller (e.g. batching). >> > > > >> > > > I leaned towards the current proposal mainly due to 1.1. as 1.2. >> can be >> > > (or >> > > > will be) fixed. Does 1.1. sound like a reasonable trade off to you? >> > > > >> > > > 2. It is not in the current proposal. I thought that a global quota >> > would >> > > > be >> > > > enough to start with. We can definitely make it work like the other >> > > quotas. >> > > > >> > > > 3. The main difference is that the Token Bucket algorithm defines an >> > > > explicit >> > > > burst B while guaranteeing an average rate R whereas our existing >> quota >> > > > guarantees an average rate R as well but starts to throttle as soon >> as >> > > the >> > > > rate goes above the defined quota. >> > > > >> > > > Creating and deleting topics is bursty by nature. Applications >> create >> > or >> > > > delete >> > > > topics occasionally by usually sending one request with multiple >> > topics. >> > > > The >> > > > reasoning behind allowing a burst is to allow such requests with a >> > > > reasonable >> > > > size to pass without being throttled whereas our current quota >> > mechanism >> > > > would reject any topics as soon as the rate is above the quota >> > requiring >> > > > the >> > > > applications to send subsequent requests to create or to delete all >> the >> > > > topics. >> > > > >> > > > Best, >> > > > David >> > > > >> > > > >> > > > On Fri, Apr 24, 2020 at 9:03 PM Jun Rao <j...@confluent.io> wrote: >> > > > >> > > > > Hi, David, >> > > > > >> > > > > Thanks for the KIP. A few quick comments. >> > > > > >> > > > > 1. About quota.partition.mutations.rate. I am not sure if it's >> very >> > > easy >> > > > > for the user to set the quota as a rate. For example, each >> partition >> > > > > mutation could take a different number of ZK operations >> (depending on >> > > > > things like retry). The time to process each ZK operation may also >> > vary >> > > > > from cluster to cluster. An alternative way to model this is to do >> > sth >> > > > > similar to the request (CPU) quota, which exposes quota as a >> > percentage >> > > > of >> > > > > the server threads that can be used. The current request quota >> > doesn't >> > > > > include the controller thread. We could add something that >> > > > measures/exposes >> > > > > the percentage of time that a request ties up the controller >> thread, >> > > > which >> > > > > seems to be what we really care about. >> > > > > >> > > > > 2. Is the new quota per user? Intuitively, we want to only >> penalize >> > > > > applications that overuse the broker resources, but not others. >> Also, >> > > in >> > > > > existing types of quotas (request, bandwidth), there is a >> hierarchy >> > > among >> > > > > clientId vs user and default vs customized (see >> > > > > >> > > > > >> > > > >> > > >> > >> https://cwiki.apache.org/confluence/display/KAFKA/KIP-55%3A+Secure+Quotas+for+Authenticated+Users >> > > > > ). Does the new quota fit into the existing hierarchy? >> > > > > >> > > > > 3. It seems that you are proposing a new quota mechanism based on >> > Token >> > > > > Bucket algorithm. Could you describe its tradeoff with the >> existing >> > > quota >> > > > > mechanism? Ideally, it would be better if we have a single quota >> > > > mechanism >> > > > > within Kafka. >> > > > > >> > > > > Jun >> > > > > >> > > > > >> > > > > >> > > > > >> > > > > On Fri, Apr 24, 2020 at 9:52 AM David Jacot <dja...@confluent.io> >> > > wrote: >> > > > > >> > > > > > Hi folks, >> > > > > > >> > > > > > I'd like to start the discussion for KIP-599: >> > > > > > >> > > > > > >> > > > > >> > > > >> > > >> > >> https://cwiki.apache.org/confluence/display/KAFKA/KIP-599%3A+Throttle+Create+Topic%2C+Create+Partition+and+Delete+Topic+Operations >> > > > > > >> > > > > > It proposes to introduce quotas for the create topics, create >> > > > partitions >> > > > > > and delete topics operations. Let me know what you think, >> thanks. >> > > > > > >> > > > > > Best, >> > > > > > David >> > > > > > >> > > > > >> > > > >> > > >> > >> >