Hi, Rajini, Ismael, Yes, I can see the argument for making CreateAcls/DeleteAcls async. I am ok with that if you feel the implementation is not too complicated. Should we consider adding some additional metric to reflect the portion of the time spent in waiting for the async operation to complete?
Thanks, Jun On Mon, Sep 9, 2019 at 5:20 AM Ismael Juma <isma...@gmail.com> wrote: > Hi Jun, > > As you say, even though the average number of operations may be low, the > request rate can be high in bursts. This can overwhelm the request queue > and cause an outage for no good reason. Or the backing storage system may > be slow for a period of time, causing a similar issue. > > I've seen several such issues in production affecting Kafka's > create/alter/delete paths (e.g people creating/deleting thousands/tens of > thousands of topics in a hot loop). As you say, it's not just ACLs that are > vulnerable. In my opinion, we should learn from these experiences when > designing new interfaces and we should adjust the existing ones to be more > resilient. It's also worth mentioning that create/delete topics already > relies on the purgatory if the request timeout is greater than 0. > > Can you elaborate on how CPU throttling would help? It seems to me that it > would not trigger fast enough if there was a slow back-end, for example > (the number of request threads is often lower than 10, so it doesn't take > many blocked requests to cause major problems). > > Ismael > > On Sun, Sep 8, 2019, 9:5k2 PM Jun Rao <j...@confluent.io> wrote: > > > Hi, Rajini, > > > > Thanks for the reply. The 4-step approach that you outlined seems to > work. > > Overall, I can see that the async authorize() api could lead to an > overall > > more efficient implementation. The tradeoff is that we have to code every > > request with an extra stage. To me, this optimization seems too early. > With > > 100K topics, each with 1KB worth of users, the required ACL space is > about > > 100MB and we should be able to cache everything. Beyond that, we still > have > > the option of dividing up the topic and ACL metadata such that every > broker > > is only required to cache a subset of all metadata. > > > > Regarding making create/delete api async, I still have mixed feelings on > > this. I am wondering if the benefit is worth the added complexity. To me, > > both operations are rare. If one request thread is blocked occasionally > for > > a few millisecs, it's probably ok since it mostly just affects one > client. > > In the case that a particular user issues many those operations in a > short > > window. Could we just use CPU throttling to prevent too much resource > being > > used? There are a few other similar types of requests such as > create/alter > > configs, create/alter topic, etc. Do we plan to add an extra processing > > stage for each of them too in the future? > > > > Thanks, > > > > Jun > > >