Re: [DISCUSS] KIP-936 Throttle number of active PIDs

Justine Olshan Tue, 07 May 2024 14:04:52 -0700

Hi Omnia,

Thanks for the detailed response.
I agree that the client ID solution can be tricky (and could even run into
the same problem if the client ID is not unique).


As for the waiting one day -- that was not meant to be an exact value, but
my point was that there will be some time where nothing changes as we wait
for the old IDs to expire. But I think I was misunderstanding how the rate
worked. (I was thinking more of a hard limit) I see we will restart the
rate counter on each window. I guess then the concern is to make sure that
given our window size and rate, we set a default that ensures we don't OOM.
👍

Justine


On Tue, May 7, 2024 at 1:25 PM Omnia Ibrahim <o.g.h.ibra...@gmail.com>
wrote:

> Hi Justine Thanks for the feedback
>
> > So consider a case where there is a storm for a given principal. We could
> > have a large mass of short lived producers in addition to some
> > "well-behaved" ones. My understanding is that if the "well-behaved" one
> > doesn't produce as frequently ie less than once per hour, it will also
> get
> > throttled when a storm of short-lived producers leads the principal to
> hit
> > the given rate necessary for throttling. The idea of the filter is that
> we
> > don't throttle existing producers, but in this case, we will.
> I believe this would be part of the limitation of the KIP (which is
> similar to some extend with the rest of Kafka Quotas) If KafkaPrincipal of
> ClientId is shared between different use-cases where we have few well
> behaving and some misbehaving we will punish all as these are the
> identifiers we have about them.
>
> If I understand the scenario you described correctly I think I can break
> it to the following uses cases
> 1. KafkaPrincipal is shared between multiple applications with different
> ClientIds then It is not fair to throttle one application because the other
> one that have been configured with same KafkaPrincipal is misbehaving. This
> can be solved by breaking the throttling to be based on the combination of
> KafkaPrincipal-ClientId, it will increase the number of entries in the
> cache but will at least isolate applications. And I still believe it might
> be tricky for whoever is manning the cluster to list and throttle most
> client ids.
> 2. A new version of the application has been misconfigured and during
> rolling the upgrades only half the instances of this app has the
> misconfigured version and it keep creating short-lived PIDs while the other
> half has the well behaving but it produce on slower base so it will produce
> every two hours. This one will be impacted and it bit tricky to track these.
>
> In both cases I don’t believe we need to wait for the 24hr expiration of
> PID to hit however, I believe we need to follow one of these solutions
> 1. Break these uses cases to use different KafkaPrincipal (and or ClientId
> if we opt-in for this)
> 2. If both cases are using the same KafkaPrincipal then they are most
> likely the same owner of both apps and they will need to shutdown or fix
> the misbehaving/misconfigured application that create all these instances
> and we need to wait for the throttle time to pass before the well behaved
> client proceed with the unseen PID.
>
> > Note -- one thing that wasn't totally clear from the KIP was whether we
> > throttle all new produce requests from the client or just the ones with
> > unseen IDs. If we throttle them all, perhaps this point isn't a huge
> deal.
> I’ll clarify this. But the KIP is aiming to throttle all unseen PID for X
> amount of time which is the throttle time.
>
> > The other concern that I brought up is that when we throttle, we will
> > likely continue to throttle until the storm stops. This is because we
> will
> > have to wait 1 day or so for IDs to expire, and we will likely replace
> them
> > at a pretty fast rate. This can be acceptable if we believe that it is
> > helpful to getting the behavior to stop, but I just wanted to call out
> that
> > the user will likely not be able to start clients in the meantime.
> Am not sure the producer need to wait for 1 day (unless the PID quota is
> set too high) as we are throttling unseen PID per user any PID above the
> quota will not be registered at the leader side and we don’t store anything
> for idempotent at initialising so am not sure I see need to wait for 1 day
> unless am missing something.
> If User-A has `producer_ids_rate` 100 and the broker can store 2,000,000
> before hit out of memory. Then the leader will only store a 100 PIDs in 1
> hour and throttle any unseen PID as these will be considered new PIDs. If
> we ht the scenario that you described within  1 hr window then this client
> should be alerted that it didn’t produce and got throttled. Then we can
> apply one of the solutions I mentioned in first point. They either split
> the use cases to different KafkaPrincipals or shutdown the misconfigured
> app and wait the throttle time pass. The throttle doesn’t get controlled by
> the 24hr expiration of PID and we don’t even check if PID has expired or
> not.
>
> I think I might be missing something regarding the need to wait for 24hr
> to resume!
>
> Omnia
>
>
> > On 6 May 2024, at 20:46, Justine Olshan <jols...@confluent.io.INVALID>
> wrote:
> >
> > Hi Claude,
> >
> > I can clarify my comments.
> >
> > Just to clarify -- my understanding is that we don't intend to throttle
> any
> > new producer IDs at the beginning. I believe this amount is specified by
> > `producer_ids_rate`, but you can see this as a number of producer IDs per
> > hour.
> >
> > So consider a case where there is a storm for a given principal. We could
> > have a large mass of short lived producers in addition to some
> > "well-behaved" ones. My understanding is that if the "well-behaved" one
> > doesn't produce as frequently ie less than once per hour, it will also
> get
> > throttled when a storm of short-lived producers leads the principal to
> hit
> > the given rate necessary for throttling. The idea of the filter is that
> we
> > don't throttle existing producers, but in this case, we will.
> >
> > Note -- one thing that wasn't totally clear from the KIP was whether we
> > throttle all new produce requests from the client or just the ones with
> > unseen IDs. If we throttle them all, perhaps this point isn't a huge
> deal.
> >
> > The other concern that I brought up is that when we throttle, we will
> > likely continue to throttle until the storm stops. This is because we
> will
> > have to wait 1 day or so for IDs to expire, and we will likely replace
> them
> > at a pretty fast rate. This can be acceptable if we believe that it is
> > helpful to getting the behavior to stop, but I just wanted to call out
> that
> > the user will likely not be able to start clients in the meantime.
> >
> > Justine
> >
> > On Sun, May 5, 2024 at 6:35 AM Claude Warren <cla...@xenei.com> wrote:
> >
> >> Justine,
> >>
> >> I am new here so please excuse the ignorance.
> >>
> >> When you talk about "seen" producers I assume you mean the PIDs that the
> >> Bloom filter has seen.
> >> When you say "producer produces every 2 hours" are you the producer
> writes
> >> to a topic every 2 hours and uses the same PID?
> >> When you say "hitting the limit" what limit is reached?
> >>
> >> Given the default setup, A producer that produces a PID every 2 hours,
> >> regardless of whether or not it is a new PID, will be reported as a new
> PID
> >> being seen.  But I would expect the throttling system to accept that as
> a
> >> new PID for the producer and look at the frequency of PIDs and accept
> >> without throttling.
> >>
> >> If the actual question is "how many PIDs did this Principal produce in
> the
> >> last hour"  Or "Has this Principal produced more than X PIDs in the last
> >> hour", there are probably cleaner ways to do this.  If this is the
> >> question, I would use CPC from Apache Data Sketches [1] and keep
> multiple
> >> CPC (say every 15 minutes -- to match the KIP-936 proposal) for each
> >> Principal.  You could then do a quick check on the current CPC to see
> if it
> >> exceeds hour-limit / 4 and if so check the hour rate (by summing the 4
> >> 15-minute CPCs).  Then the code could simply notify when to throttle and
> >> when to stop throttling.
> >>
> >> Claude
> >>
> >>
> >> https://datasketches.apache.org/docs/CPC/CpcPerformance.html
> >>
> >> On Fri, May 3, 2024 at 4:21 PM Justine Olshan
> <jols...@confluent.io.invalid
> >>>
> >> wrote:
> >>
> >>> Hey folks,
> >>>
> >>> I shared this with Omnia offline:
> >>> One concern I have is with the length of time we keep "seen" producer
> >> IDs.
> >>> It seems like the default is 1 hour. If a producer produces every 2
> hours
> >>> or so, and we are hitting the limit, it seems like we will throttle it
> >> even
> >>> though we've seen it before and have state for it on the server. Then,
> it
> >>> seems like we will have to wait for the natural expiration of producer
> >> ids
> >>> (via producer.id.expiration.ms) before we allow new or idle producers
> to
> >>> join again without throttling. I think this proposal is a step in the
> >> right
> >>> direction when it comes to throttling the "right" clients, but I want
> to
> >>> make sure we have reasonable defaults. Keep in mind that idempotent
> >>> producers are the default, so most folks won't be tuning these values
> out
> >>> of the box.
> >>>
> >>> As for Igor's questions about InitProducerId -- I think the main reason
> >> we
> >>> have avoided that solution is that there is no state stored for
> >> idempotent
> >>> producers when grabbing an ID. My concern there is either storing too
> >> much
> >>> state to track this or throttling before we need to.
> >>>
> >>> Justine
> >>>
> >>> On Thu, May 2, 2024 at 2:36 PM Claude Warren, Jr
> >>> <claude.war...@aiven.io.invalid> wrote:
> >>>
> >>>> There is some question about whether or not we need the configuration
> >>>> options.  My take on them is as follows:
> >>>>
> >>>> producer.id.quota.window.num  No opinion.  I don't know what this is
> >> used
> >>>> for, but I suspect that there is a good reason to have it.  It is not
> >>> used
> >>>> within the Bloom filter caching mechanism
> >>>> producer.id.quota.window.size.seconds Leave it as it is one of the
> most
> >>>> effective ways to tune the filter and determines how long a PID is
> >>>> recognized.
> >>>> producer.id.quota.cache.cleanup.scheduler.interval.ms  Remove it
> >> unless
> >>>> there is another use for it.   We can get a better calculation for
> >>>> internals.
> >>>> producer.id.quota.cache.layer.count Leave it as it is one of the most
> >>>> effective ways to tune the filter.
> >>>> producer.id.quota.cache.false.positive.rate Replace it with a
> constant,
> >>> I
> >>>> don't think any other Bloom filter solution provides access to this
> >> knob
> >>>> for end users.
> >>>> producer_ids_rate Leave this one, it is critical for reasonable
> >>> operation.
> >>>>
> >>>
> >>
> >>
> >> --
> >> LinkedIn: http://www.linkedin.com/in/claudewarren
> >>
>
>

Re: [DISCUSS] KIP-936 Throttle number of active PIDs

Reply via email to