Hi, David, Thanks for the reply. A few more comments.
1. I am actually not sure if a quota based on request rate is easier for the users. For context, in KIP-124, we started with a request rate quota, but ended up not choosing it. The main issues are (a) requests are not equal; some are more expensive than others; (b) the users typically don't know how expensive each type of request is. For example, a big part of the controller cost is ZK write. To create a new topic with 1 partition, the number of ZK writes is 4 (1 for each segment in /brokers/topics/[topic]/partitions/[partitionId]/state). The cost of adding one partition to an existing topic requires 2 ZK writes. The cost of deleting a topic with 1 partition requires 6 to 7 ZK writes. It's unlikely for a user to know the exact cost associated with those implementation details. If users don't know the cost, it's not clear if they can set the rate properly. 2. I think that depends on the goal. To me, the common problem is that you have many applications running on a shared Kafka cluster and one of the applications abuses the broker by issuing too many requests. In this case, a global quota will end up throttling every application. However, what we really want in this case is to only throttle the application that causes the problem. A user level quota solves this problem more effectively. We may still need some sort of global quota when the total usage from all applications exceeds the broker resource. But that seems to be secondary since it's uncommon for all applications' usage to go up at the same time. We can also solve this problem by reducing the per user quota for every application if there is a user level quota. 3. Not sure that I fully understand the difference in burst balance. The current throttling logic works as follows. Each quota is measured over a number of time windows. Suppose the Quota is to X/sec. If time passes and the quota is not being used, we are accumulating credit at the rate of X/sec. If a quota is being used, we are reducing that credit based on the usage. The credit expires when the corresponding window rolls out. The max credit that can be accumulated is X * number of windows * window size. So, in some sense, the current logic also supports burst and a way to cap the burst. Could you explain the difference with Token Bucket a bit more? Also, the current quota system always executes the current request even if it's being throttled. It just informs the client to back off a throttled amount of time before sending another request. Jun On Mon, Apr 27, 2020 at 5:15 AM David Jacot <dja...@confluent.io> wrote: > Hi Jun, > > Thank you for the feedback. > > 1. You are right. At the end, we do care about the percentage of time that > an operation ties up the controller thread. I thought about this but I was > not entirely convinced by it for following reasons: > > 1.1. While I do agree that setting up a rate and a burst is a bit harder > than > allocating a percentage for the administrator of the cluster, I believe > that a > rate and a burst are way easier to understand for the users of the cluster. > > 1.2. Measuring the time that a request ties up the controller thread is not > as straightforward as it sounds because the controller reacts to ZK > TopicChange and TopicDeletion events in lieu of handling requests directly. > These events do not carry on the client id nor the user information so the > best would be to refactor the controller to accept requests instead of > reacting > to the events. This will be possible with KIP-590. It has obviously other > side effects in the controller (e.g. batching). > > I leaned towards the current proposal mainly due to 1.1. as 1.2. can be (or > will be) fixed. Does 1.1. sound like a reasonable trade off to you? > > 2. It is not in the current proposal. I thought that a global quota would > be > enough to start with. We can definitely make it work like the other quotas. > > 3. The main difference is that the Token Bucket algorithm defines an > explicit > burst B while guaranteeing an average rate R whereas our existing quota > guarantees an average rate R as well but starts to throttle as soon as the > rate goes above the defined quota. > > Creating and deleting topics is bursty by nature. Applications create or > delete > topics occasionally by usually sending one request with multiple topics. > The > reasoning behind allowing a burst is to allow such requests with a > reasonable > size to pass without being throttled whereas our current quota mechanism > would reject any topics as soon as the rate is above the quota requiring > the > applications to send subsequent requests to create or to delete all the > topics. > > Best, > David > > > On Fri, Apr 24, 2020 at 9:03 PM Jun Rao <j...@confluent.io> wrote: > > > Hi, David, > > > > Thanks for the KIP. A few quick comments. > > > > 1. About quota.partition.mutations.rate. I am not sure if it's very easy > > for the user to set the quota as a rate. For example, each partition > > mutation could take a different number of ZK operations (depending on > > things like retry). The time to process each ZK operation may also vary > > from cluster to cluster. An alternative way to model this is to do sth > > similar to the request (CPU) quota, which exposes quota as a percentage > of > > the server threads that can be used. The current request quota doesn't > > include the controller thread. We could add something that > measures/exposes > > the percentage of time that a request ties up the controller thread, > which > > seems to be what we really care about. > > > > 2. Is the new quota per user? Intuitively, we want to only penalize > > applications that overuse the broker resources, but not others. Also, in > > existing types of quotas (request, bandwidth), there is a hierarchy among > > clientId vs user and default vs customized (see > > > > > https://cwiki.apache.org/confluence/display/KAFKA/KIP-55%3A+Secure+Quotas+for+Authenticated+Users > > ). Does the new quota fit into the existing hierarchy? > > > > 3. It seems that you are proposing a new quota mechanism based on Token > > Bucket algorithm. Could you describe its tradeoff with the existing quota > > mechanism? Ideally, it would be better if we have a single quota > mechanism > > within Kafka. > > > > Jun > > > > > > > > > > On Fri, Apr 24, 2020 at 9:52 AM David Jacot <dja...@confluent.io> wrote: > > > > > Hi folks, > > > > > > I'd like to start the discussion for KIP-599: > > > > > > > > > https://cwiki.apache.org/confluence/display/KAFKA/KIP-599%3A+Throttle+Create+Topic%2C+Create+Partition+and+Delete+Topic+Operations > > > > > > It proposes to introduce quotas for the create topics, create > partitions > > > and delete topics operations. Let me know what you think, thanks. > > > > > > Best, > > > David > > > > > >