Hi Jun,

Thanks for the response. If we use the existing purgatory implementation,
we should get additional purgatory metrics for ACL updates with the new
purgatory name as tag, consistent with what we have for other delayed
operations. I will add these to the KIP. We also have request metrics which
indicate local vs remote time with the API key as tag, so that should
indicate the portion of time waiting for async operations.

Regards,

Rajini

On Mon, Sep 9, 2019 at 4:31 PM Jun Rao <j...@confluent.io> wrote:

> Hi, Rajini, Ismael,
>
> Yes, I can see the argument for making CreateAcls/DeleteAcls async. I am ok
> with that if you feel the implementation is not too complicated. Should we
> consider adding some additional metric to reflect the portion of the time
> spent in waiting for the async operation to complete?
>
> Thanks,
>
> Jun
>
> On Mon, Sep 9, 2019 at 5:20 AM Ismael Juma <isma...@gmail.com> wrote:
>
> > Hi Jun,
> >
> > As you say, even though the average number of operations may be low, the
> > request rate can be high in bursts. This can overwhelm the request queue
> > and cause an outage for no good reason. Or the backing storage system may
> > be slow for a period of time, causing a similar issue.
> >
> > I've seen several such issues in production affecting Kafka's
> > create/alter/delete paths (e.g people creating/deleting thousands/tens of
> > thousands of topics in a hot loop). As you say, it's not just ACLs that
> are
> > vulnerable. In my opinion, we should learn from these experiences when
> > designing new interfaces and we should adjust the existing ones to be
> more
> > resilient. It's also worth mentioning that create/delete topics already
> > relies on the purgatory if the request timeout is greater than 0.
> >
> > Can you elaborate on how CPU throttling would help? It seems to me that
> it
> > would not trigger fast enough if there was a slow back-end, for example
> > (the number of request threads is often lower than 10, so it doesn't take
> > many blocked requests to cause major problems).
> >
> > Ismael
> >
> > On Sun, Sep 8, 2019, 9:5k2 PM Jun Rao <j...@confluent.io> wrote:
> >
> > > Hi, Rajini,
> > >
> > > Thanks for the reply. The 4-step approach that you outlined seems to
> > work.
> > > Overall, I can see that the async authorize() api could lead to an
> > overall
> > > more efficient implementation. The tradeoff is that we have to code
> every
> > > request with an extra stage. To me, this optimization seems too early.
> > With
> > > 100K topics, each with 1KB worth of users, the required ACL space is
> > about
> > > 100MB and we should be able to cache everything. Beyond that, we still
> > have
> > > the option of dividing up the topic and ACL metadata such that every
> > broker
> > > is only required to cache a subset of all metadata.
> > >
> > > Regarding making create/delete api async, I still have mixed feelings
> on
> > > this. I am wondering if the benefit is worth the added complexity. To
> me,
> > > both operations are rare. If one request thread is blocked occasionally
> > for
> > > a few millisecs, it's probably ok since it mostly just affects one
> > client.
> > > In the case that a particular user issues many those operations in a
> > short
> > > window. Could we just use CPU throttling to prevent too much resource
> > being
> > > used? There are a few other similar types of requests such as
> > create/alter
> > > configs, create/alter topic, etc. Do we plan to add an extra processing
> > > stage for each of them too in the future?
> > >
> > > Thanks,
> > >
> > > Jun
> > >
> >
>

Reply via email to