Re: [DISCUSS] solutions for broker OOM caused by many producer IDs

Omnia Ibrahim Mon, 31 Oct 2022 11:16:02 -0700

Hi Luke and Justine,

> For (3), you said:
> > - I have some concerns about the impact of this option on the
> transactional
> producers, for example, what will happen to an ongoing transaction
> associated with an expired PID? Would this leave the transactions in a
> "hanging" state?
>
> - How will we notify the client that the transaction can't continue due to
> an expired PID?
>
> - If PID got marked as `expired` this will mean that
> `admin.DescribeProducers` will not list them which will make
> *`kafka-transactions.sh
> --list`* a bit tricky as we can't identify if there are transactions linked
> to this expired PID or not. The same concern applies to
> *`kafka-transactions.sh
> --find-hanging`*.
>
> --> Yes, you're right. Those are also concerns for this solution.
> Currently, there's no way to notify clients about the expiration.
> Also, the ongoing transactions will be hanging. For the admin cli, we've
> never thought about it. Good point.
> In summary, to adopt this solution, there are many issues needed to get
> fixed.
>


Justin already clarified that if PID is attached to a transaction it will
not expire so identifying the transactions shouldn't be a concern anymore.
The only concern here will be that this solution will not solve the problem
if the rouge producer is a transactional producer with hanging
transactions.
If anyone faced this situation they will need to abort the hanging
transactions manually and then the solution to expire a PID can then work.

--> Yes, I mean KafkaPrinciple (sorry, I didn't make it clear)
> Yes, We were thinking about throttling by KafkaPrinciple. Client Id is also
> workable.
> It's just these 2 attributes are not required.
> That is, it's possible we take all clients as the same one: {default
> KafkaPrinciple + default clientID}, and apply throttling on it.
> Do you have any thoughts about it?
> Maybe skip throttling for {default KafkaPrinciple + default clientID}
>

Throttling for default KafkaPrinciple and default ClientID is useful when
we need to have a hard limit on the whole cluster and whoever is running
the cluster doesn't knowclientsntIDs or if a KafkaPrinciple is reused
between different producer applications.
I usually find it helpful to have a way to apply throttling only on the
rough clients only once I identify them without punishing everyone on the
cluster. However, there are two problems with this
- There's no easy way at the moment to link PIDs to clientId or
KafkaPrinciple. This need to be addressed first.
- Is Justin's comment on the throttling, and the fact that will mean we
either block all requests or have to store the request in memory which in
both cases has side downs on the producer experince.

I recently had another discussion with my team and it does seem like there
> should be a way to make it more clear to the clients what is going on. A
> lot of this protocol is implicit. I'm wondering if maybe there is a way to
> improve the story for newer clients. (Ie if we choose to expire based on a
> size limit, we should include a response indicating the ID has expired.) We
> also discussed ways to redefine the guarantees so that users who have
> stronger idempotency requirements can ensure them (over availability/memory
> concerns). Let me know if you have any ideas here.
>

It may be easier to improve the experience for new clients. However, if we
improved only the new clients we may need a way to help teams who run Kafka
with rough clients on old versions by at least giving them an easy way to
identify the clientId/ or KafkaPrinciple that generated these PIDs.

For context, it's very tricky to even identify which clientId is creating
all these PIDs that caused OOM, which is a contributing part of the issue
at the moment. So maybe one option here could be adding a new metric that
tracks the number of generated PIDs per clientId. This will help the team
who runs the Kafka cluster to
- contact these rough clients and ask them to fix their clients or upgrade
to a new client if the new client version has a better experience.
- or if ended with a throttling solution this may help identify which
clientId needs to be throttled.

Maybe we can start with a solution for identifying the rough clients first
and keep looking for a solution to limit them, what do you think?

Thanks

On Tue, Oct 18, 2022 at 5:24 PM Justine Olshan <jols...@confluent.io.invalid>
wrote:

> Oops.  I realized I just replied to Omnia 🤦‍♀️
>
> Here was my response for the mailing thread:
>
> Hey Omnia,
> Sorry to hear this is a problem for you as well. :(
> > * I have some concerns about the impact of this option on the
> transactional producers, for example, what will happen to an ongoing
> transaction associated with an expired PID? Would this leave the
> transactions in a "hanging" state?*
> We currently check if a transaction is ongoing and do not expire the
> producer ID if it has an ongoing transaction. I suspect we will continue to
> do this with any solution we pick.
>
> My team members and I looked a bit into the throttling case and it can get
> a bit tricky since it means we need to throttle the produce request before
> it is processed. This means we either block all requests or have to store
> the request in memory (which is not great if we are trying to save memory).
>
> I recently had another discussion with my team and it does seem like there
> should be a way to make it more clear to the clients what is going on. A
> lot of this protocol is implicit. I'm wondering if maybe there is a way to
> improve the story for newer clients. (Ie if we choose to expire based on a
> size limit, we should include a response indicating the ID has expired.) We
> also discussed ways to redefine the guarantees so that users who have
> stronger idempotency requirements can ensure them (over availability/memory
> concerns). Let me know if you have any ideas here.
>
> Thanks again for commenting here, hopefully we can come to a good solution.
>
> On Tue, Oct 18, 2022 at 1:11 AM Luke Chen <show...@gmail.com> wrote:
>
> > Hi Omnia,
> >
> > Thanks for your reply.
> >
> > For (3), you said:
> > > - I have some concerns about the impact of this option on the
> > transactional
> > producers, for example, what will happen to an ongoing transaction
> > associated with an expired PID? Would this leave the transactions in a
> > "hanging" state?
> >
> > - How will we notify the client that the transaction can't continue due
> to
> > an expired PID?
> >
> > - If PID got marked as `expired` this will mean that
> > `admin.DescribeProducers` will not list them which will make
> > *`kafka-transactions.sh
> > --list`* a bit tricky as we can't identify if there are transactions
> linked
> > to this expired PID or not. The same concern applies to
> > *`kafka-transactions.sh
> > --find-hanging`*.
> >
> > --> Yes, you're right. Those are also concerns for this solution.
> > Currently, there's no way to notify clients about the expiration.
> > Also, the ongoing transactions will be hanging. For the admin cli, we've
> > never thought about it. Good point.
> > In summary, to adopt this solution, there are many issues needed to get
> > fixed.
> >
> >
> > For (5), you said:
> > > I am assuming you mean KafkaPrincipal here! If so is your concern here
> > that
> > those good clients that use the same principal as a rogue one will get
> > throttled?
> >
> > If this is the case, then I believe it should be okay as other throttling
> > in Kafka on *`/config/users/<user>`* has the same behaviour.
> >
> >
> > What about applying limit/throttling to
> > *`/config/users/<user>/clients/<client-id>`
> > *similar to what we have with client quota? This should reduce the
> concern
> > about throttling good clients, right?
> >
> > --> Yes, I mean KafkaPrinciple (sorry, I didn't make it clear)
> > Yes, We were thinking about throttling by KafkaPrinciple. Client Id is
> > also workable.
> > It's just these 2 attributes are not required.
> > That is, it's possible we take all clients as the same one: {default
> > KafkaPrinciple + default clientID}, and apply throttling on it.
> > Do you have any thoughts about it?
> > Maybe skip throttling for {default KafkaPrinciple + default clientID} ?
> >
> > Luke
> >
> >
> >
> > On Sat, Oct 15, 2022 at 2:33 AM Omnia Ibrahim <o.g.h.ibra...@gmail.com>
> > wrote:
> >
> >> Hi Luke & Justine,
> >> Thanks for looking into this issue, we have been experiencing this
> because
> >> of rouge clients as well.
> >>
> >> > 3. Having a limit to the number of active producer IDs (sort of like
> an
> >> LRU
> >> >cache)
> >> >-> The idea here is that if we hit a misconfigured client, we will
> expire
> >> >the older entries. The concern here is we have risks to lose
> idempotency
> >> >guarantees, and currently, we don't have a way to notify clients about
> >> >losing idempotency guarantees. Besides, the least  recently used
> entries
> >> >got removed are not always from the "bad" clients.
> >>
> >> - I have some concerns about the impact of this option on the
> >> transactional
> >> producers, for example, what will happen to an ongoing transaction
> >> associated with an expired PID? Would this leave the transactions in a
> >> "hanging" state?
> >>
> >> - How will we notify the client that the transaction can't continue due
> to
> >> an expired PID?
> >>
> >> - If PID got marked as `expired` this will mean that
> >> `admin.DescribeProducers` will not list them which will make
> >> *`kafka-transactions.sh
> >> --list`* a bit tricky as we can't identify if there are transactions
> >> linked
> >> to this expired PID or not. The same concern applies to
> >> *`kafka-transactions.sh
> >> --find-hanging`*.
> >>
> >>
> >> >5. limit/throttling the producer id based on the principle
> >> >-> Although we can limit the impact to a certain principle with this
> >> idea,
> >> >same concern still exists as solution #1 #2.
> >>
> >> I am assuming you mean KafkaPrincipal here! If so is your concern here
> >> that
> >> those good clients that use the same principal as a rogue one will get
> >> throttled?
> >>
> >> If this is the case, then I believe it should be okay as other
> throttling
> >> in Kafka on *`/config/users/<user>`* has the same behaviour.
> >>
> >>
> >> What about applying limit/throttling to
> >> *`/config/users/<user>/clients/<client-id>`
> >> *similar to what we have with client quota? This should reduce the
> concern
> >> about throttling good clients, right?
> >>
> >> best,
> >>
> >> Omnia
> >>
> >> On Tue, Oct 11, 2022 at 4:18 AM Luke Chen <show...@gmail.com> wrote:
> >>
> >> > Bump this thread to see if there are any comments/thoughts.
> >> > Thanks.
> >> >
> >> > Luke
> >> >
> >> > On Mon, Sep 26, 2022 at 11:06 AM Luke Chen <show...@gmail.com> wrote:
> >> >
> >> > > Hi devs,
> >> > >
> >> > > As stated in the motivation section in KIP-854
> >> > > <
> >> >
> >>
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-854+Separate+configuration+for+producer+ID+expiry
> >> > >:
> >> > >
> >> > > With idempotent producers becoming the default in Kafka, this means
> >> that
> >> > > unless otherwise specified, all new producers will be given producer
> >> IDs.
> >> > > Some (inefficient) applications may now create many
> non-transactional
> >> > > idempotent producers. Each of these producers will be assigned a
> >> producer
> >> > > ID and these IDs and their metadata are stored in the broker memory,
> >> > which
> >> > > might cause brokers out of memory.
> >> > >
> >> > > Justine (in cc.) and I and some other team members are working on
> the
> >> > > solutions for this issue. But none of them solves it completely
> >> without
> >> > > side effects. Among them, "availability" VS "idempotency guarantees"
> >> is
> >> > > what we can't decide which to sacrifice. Some of these solutions
> >> > sacrifice
> >> > > availability of produce (1,2,5) and others sacrifice idempotency
> >> > guarantees
> >> > > (3). It could be useful to know if people generally have a
> preference
> >> one
> >> > > way or the other. Or what other better solutions there might be.
> >> > >
> >> > > Here are the proposals we came up with:
> >> > >
> >> > > 1. Limit the total active producer ID allocation number.
> >> > > -> This is the simplest solution. But since the OOM issue is usually
> >> > > caused by a rogue or misconfigured client, and this solution might
> >> > "punish"
> >> > > the good client from sending messages.
> >> > >
> >> > > 2. Throttling the producer ID allocation rate
> >> > > -> Same concern as the solution #1.
> >> > >
> >> > > 3. Having a limit to the number of active producer IDs (sort of like
> >> an
> >> > > LRU cache)
> >> > > -> The idea here is that if we hit a misconfigured client, we will
> >> expire
> >> > > the older entries. The concern here is we have risks to lose
> >> idempotency
> >> > > guarantees, and currently, we don't have a way to notify clients
> about
> >> > > losing idempotency guarantees. Besides, the least  recently used
> >> entries
> >> > > got removed are not always from the "bad" clients.
> >> > >
> >> > > 4. allow clients to "close" the producer ID usage
> >> > > -> We can provide a way for producer to "close" producerID usage.
> >> > > Currently, we only have a way to INIT_PRODUCER_ID requested to
> >> allocate
> >> > > one. After that, we'll keep the producer ID metadata in broker even
> if
> >> > the
> >> > > producer is "closed". Having a closed API (ex: END_PRODUCER_ID), we
> >> can
> >> > > remove the entry from broker side. In client side, we can send it
> when
> >> > > producer closing. The concern is, the old clients (including
> non-java
> >> > > clients) will still suffer from the OOM issue.
> >> > >
> >> > > 5. limit/throttling the producer id based on the principle
> >> > > -> Although we can limit the impact to a certain principle with this
> >> > idea,
> >> > > same concern still exists as solution #1 #2.
> >> > >
> >> > > Any thoughts/feedback are welcomed.
> >> > >
> >> > > Thank you.
> >> > > Luke
> >> > >
> >> >
> >>
> >
>

Re: [DISCUSS] solutions for broker OOM caused by many producer IDs

Reply via email to