Hi Jun,

As you say, even though the average number of operations may be low, the
request rate can be high in bursts. This can overwhelm the request queue
and cause an outage for no good reason. Or the backing storage system may
be slow for a period of time, causing a similar issue.

I've seen several such issues in production affecting Kafka's
create/alter/delete paths (e.g people creating/deleting thousands/tens of
thousands of topics in a hot loop). As you say, it's not just ACLs that are
vulnerable. In my opinion, we should learn from these experiences when
designing new interfaces and we should adjust the existing ones to be more
resilient. It's also worth mentioning that create/delete topics already
relies on the purgatory if the request timeout is greater than 0.

Can you elaborate on how CPU throttling would help? It seems to me that it
would not trigger fast enough if there was a slow back-end, for example
(the number of request threads is often lower than 10, so it doesn't take
many blocked requests to cause major problems).

Ismael

On Sun, Sep 8, 2019, 9:5k2 PM Jun Rao <j...@confluent.io> wrote:

> Hi, Rajini,
>
> Thanks for the reply. The 4-step approach that you outlined seems to work.
> Overall, I can see that the async authorize() api could lead to an overall
> more efficient implementation. The tradeoff is that we have to code every
> request with an extra stage. To me, this optimization seems too early. With
> 100K topics, each with 1KB worth of users, the required ACL space is about
> 100MB and we should be able to cache everything. Beyond that, we still have
> the option of dividing up the topic and ACL metadata such that every broker
> is only required to cache a subset of all metadata.
>
> Regarding making create/delete api async, I still have mixed feelings on
> this. I am wondering if the benefit is worth the added complexity. To me,
> both operations are rare. If one request thread is blocked occasionally for
> a few millisecs, it's probably ok since it mostly just affects one client.
> In the case that a particular user issues many those operations in a short
> window. Could we just use CPU throttling to prevent too much resource being
> used? There are a few other similar types of requests such as create/alter
> configs, create/alter topic, etc. Do we plan to add an extra processing
> stage for each of them too in the future?
>
> Thanks,
>
> Jun
>

Reply via email to