Hi Justine,

Thanks for the KIP! I can see that this is a pragmatic attempt to address a
nasty problem. I have a few questions:

1. The KIP makes the problem significantly harder to trigger, but doesn't
eliminate it entirely. How confident are you that it will be sufficient in
practice? We can point to applications which are creating idempotent
producers at a high rate and say they're broken, but that doesn't do
anything to defend the broker from an interaction pattern that differs only
in rate from a "good application". Did you consider a new quota to limit
the rate at which a (principal, clientId) can allocate new PIDs?

2. The KIP contains this sentence: "when an idempotent producer’s ID
expires, it silently loses its idempotency guarantees." That's at odds with
my reading of "PID expiration" in the KIP-98 design[1], but it does seem
consistent with a (brief!) look at the code. I accept that the risk should
be minimal so long as the expiration time is > the producer's delivery
timeout, but it would still be nice if we could detect this situation and
return an error to the client. Is there a reason for the apparent deviation
from KIP-98 (or am I misreading the code?)

3. Could the KIP be explicit on whether the new config will be dynamically
changeable?

4. The description of producer.id.expiration.ms mentions the
ProducerStateManager, which will mean nothing to a normal user. We could
probably change it to "a topic partition leader" without loss of meaning.

5. The description also says "Producer IDs will not expire while a
transaction associated to them is still ongoing." To me this suggests that
a more intuitive name for this config (from the user PoV) would include
"idempotent", since it does not cover the transactional case. (I would
suggest "idempotent.pid.expiration.ms" (c.f. transactional.id.expiration.ms),
but the distinction between "id" and "pid" is easily missed–even if it's
technically correct–so I'm not sure it's better than what you're
proposing). I only mention it in case it prompts someone else to find a
better name.

Thanks again,

Tom

[1]:
https://docs.google.com/document/d/11Jqy_GjUGtdXJK94XGsEIK7CP1SnQGdp2eF0wSw9ra8/edit#heading=h.loujdamc9ptj

On Tue, 2 Aug 2022 at 22:00, Justine Olshan <jols...@confluent.io.invalid>
wrote:

> I've updated the KIP to make the new minimum value 1 and remove the -1
> configuration.
> I've also added the low priority to the configuration and edited the
> description as Ismael mentioned.
>
> I'm thinking about bringing this KIP to a vote soon! Let me know if there
> are any other comments or questions.
>
> Thanks,
> Justine
>
> On Tue, Aug 2, 2022 at 9:02 AM Jason Gustafson <ja...@confluent.io.invalid
> >
> wrote:
>
> > I agree with Ismael that we should remove -1. I think we tend to view the
> > coupling of these behaviors into a single configuration as a mistake, so
> > it's a little odd to keep it (even if in a weakened form).
> >
> > -Jason
> >
> > On Sat, Jul 30, 2022 at 7:37 AM Ismael Juma <ism...@juma.me.uk> wrote:
> >
> > > I would remove -1 altogether. Two more comments:
> > >
> > > 1. The current description kind of leads people towards aligning the
> > config
> > > with delivery.timeout.ms. Is that what we want? We could say it should
> > be
> > > higher than delivery.timeout.ms but indicate that the default is
> usually
> > > fine. The main reason to reduce it would be to save memory, I guess.
> > > 2. Each config has a priority, we should specify it for this one. I'm
> > > assuming it will be "low".
> > >
> > > Ismael
> > >
> > > On Fri, Jul 29, 2022 at 2:38 PM Justine Olshan
> > > <jols...@confluent.io.invalid>
> > > wrote:
> > >
> > > > Hi all,
> > > >
> > > > I've updated the KIP to include the new default of 1 day and
> > information
> > > > about -1 in the description of the config.
> > > > I wonder though if including -1 makes sense now that it is not the
> > > default
> > > > value. Is there a benefit for manually setting -1 vs manually setting
> > the
> > > > value that transactional.id.expiration.ms has?
> > > >
> > > > Let me know your thoughts.
> > > > Thanks,
> > > > Justine
> > > >
> > > > On Fri, Jul 29, 2022 at 10:54 AM Ismael Juma <ism...@juma.me.uk>
> > wrote:
> > > >
> > > > > +1 for having 1 day as the default and for including this change in
> > the
> > > > > release notes.
> > > > >
> > > > > Ismael
> > > > >
> > > > > On Wed, Jul 27, 2022 at 9:16 AM Jason Gustafson
> > > > <ja...@confluent.io.invalid
> > > > > >
> > > > > wrote:
> > > > >
> > > > > > I don't think a major release is a requirement for a default
> change
> > > for
> > > > > > what it's worth. I do appreciate that there is a preference for
> not
> > > > > rocking
> > > > > > the boat though. For a little bit of background here, the problem
> > we
> > > > > > have encountered in production since the idempotent producer
> became
> > > the
> > > > > > default is OOM errors due to huge numbers of producerIds that had
> > to
> > > be
> > > > > > retained in the cache for 7 days. It is hard to prevent use cases
> > > from
> > > > > > emerging where producers are used and discarded rapidly. We will
> be
> > > > > using a
> > > > > > lower value for sure, but it would also be nice to reduce the
> > > > likelihood
> > > > > > for the community to see this problem. The benefit of the caching
> > > > > > diminishes quickly over time since it is primarily meant to
> handle
> > > > > producer
> > > > > > retry windows. I do not think there is much difference between 1
> > days
> > > > > and 7
> > > > > > days from an application perspective, but it is a huge difference
> > for
> > > > the
> > > > > > broker's memory usage.
> > > > > >
> > > > > > Best,
> > > > > > Jason
> > > > > >
> > > > > > On Wed, Jul 27, 2022 at 2:57 AM Sagar <sagarmeansoc...@gmail.com
> >
> > > > wrote:
> > > > > >
> > > > > > > Thanks Justine for the KIP. I think it might be better to
> > document
> > > > the
> > > > > > > correlation between the new config and delivery.timeout.ms in
> > the
> > > > > Public
> > > > > > > Interfaces Description.
> > > > > > >
> > > > > > > Also, I agree with Luke that for now setting a default to -1
> > should
> > > > be
> > > > > > > good. We can look to switch to 1 day with major release.
> > > > > > >
> > > > > > > Thanks!
> > > > > > > Sagar.
> > > > > > >
> > > > > > > On Wed, Jul 27, 2022 at 9:05 AM Luke Chen <show...@gmail.com>
> > > wrote:
> > > > > > >
> > > > > > > > Hi Justine,
> > > > > > > >
> > > > > > > > Thanks for the KIP.
> > > > > > > > I agree with you that we should try our best to keep backward
> > > > > > > > compatibility, although our intention is to have lower
> producer
> > > id
> > > > > > > > expiration timeout.
> > > > > > > > So, I think we should keep default to -1 IMO.
> > > > > > > > Maybe we change the default to 1 day in next major release
> > (4.0)?
> > > > > > > >
> > > > > > > > Thank you.
> > > > > > > > Luke
> > > > > > > >
> > > > > > > > On Wed, Jul 27, 2022 at 7:13 AM Justine Olshan
> > > > > > > > <jols...@confluent.io.invalid>
> > > > > > > > wrote:
> > > > > > > >
> > > > > > > > > Thanks for taking a look Jason!
> > > > > > > > >
> > > > > > > > > I wondered if we wanted to have a smaller default but
> wasn't
> > > sure
> > > > > > about
> > > > > > > > the
> > > > > > > > > compatibility story -- especially since there is the chance
> > for
> > > > > > > producer
> > > > > > > > > IDs to expire silently.
> > > > > > > > > I do think that 1 day is fairly reasonable. If I don't hear
> > any
> > > > > > > > conflicting
> > > > > > > > > opinions I can go ahead and update the default.
> > > > > > > > >
> > > > > > > > > Justine
> > > > > > > > >
> > > > > > > > > On Tue, Jul 26, 2022 at 12:23 PM Jason Gustafson
> > > > > > > > > <ja...@confluent.io.invalid>
> > > > > > > > > wrote:
> > > > > > > > >
> > > > > > > > > > Hi Justine,
> > > > > > > > > >
> > > > > > > > > > Thanks for the KIP. Although I hate seeing new
> > > configurations,
> > > > I
> > > > > > > think
> > > > > > > > > this
> > > > > > > > > > is a good change. Combining these timeout behaviors into
> a
> > > > single
> > > > > > > > > > configuration was definitely a mistake, but we didn't
> > > > anticipate
> > > > > > the
> > > > > > > > > > problem with the producer id cache. I do wonder if we can
> > > make
> > > > > the
> > > > > > > > > default
> > > > > > > > > > a bit lower to reduce the chances that users will hit the
> > > same
> > > > > > memory
> > > > > > > > > > issues we have seen. After decoupling this configuration
> > from
> > > > > > > > > > transactional.id.expiration.ms, the new timeout just
> needs
> > > to
> > > > > > cover
> > > > > > > > the
> > > > > > > > > > longest duration that a producer might be retrying the
> same
> > > > > Produce
> > > > > > > > > > request. 7 days seems too high. Although I think it could
> > go
> > > a
> > > > > fair
> > > > > > > > even
> > > > > > > > > > lower, perhaps 1 day is a reasonable place to start?
> > > > > > > > > >
> > > > > > > > > > Thanks,
> > > > > > > > > > Jason
> > > > > > > > > >
> > > > > > > > > > On Mon, Jul 25, 2022 at 9:25 AM Justine Olshan
> > > > > > > > > > <jols...@confluent.io.invalid>
> > > > > > > > > > wrote:
> > > > > > > > > >
> > > > > > > > > > > Hey Bill,
> > > > > > > > > > > Thanks! I was just going to say that hopefully
> > > > > > > > > > > transactional.id.expiration.ms would also be over the
> > > > delivery
> > > > > > > > > timeout.
> > > > > > > > > > :)
> > > > > > > > > > > Thanks for the +1!
> > > > > > > > > > >
> > > > > > > > > > > Justine
> > > > > > > > > > >
> > > > > > > > > > > On Mon, Jul 25, 2022 at 9:17 AM Bill Bejeck <
> > > > bbej...@gmail.com
> > > > > >
> > > > > > > > wrote:
> > > > > > > > > > >
> > > > > > > > > > > > Hi Justine,
> > > > > > > > > > > >
> > > > > > > > > > > > I just took another look at the KIP, and I realize my
> > > > > > > > > > question/suggestion
> > > > > > > > > > > > about default values has already been addressed in
> the
> > > > > > > > > `Compatibility`
> > > > > > > > > > > > section.
> > > > > > > > > > > >
> > > > > > > > > > > > I'm +1 on the KIP.
> > > > > > > > > > > >
> > > > > > > > > > > > -Bill
> > > > > > > > > > > >
> > > > > > > > > > > > On Thu, Jul 21, 2022 at 6:20 PM Bill Bejeck <
> > > > > bbej...@gmail.com
> > > > > > >
> > > > > > > > > wrote:
> > > > > > > > > > > >
> > > > > > > > > > > > > Hi Justine,
> > > > > > > > > > > > >
> > > > > > > > > > > > > Thanks for the well written KIP, this looks like it
> > > will
> > > > > be a
> > > > > > > > > useful
> > > > > > > > > > > > > addition.
> > > > > > > > > > > > >
> > > > > > > > > > > > > Overall the KIP looks good to me, I have one
> > > > > > question/comment.
> > > > > > > > > > > > >
> > > > > > > > > > > > > You mentioned that setting the `
> > > > producer.id.expiration.ms`
> > > > > > > less
> > > > > > > > > than
> > > > > > > > > > > the
> > > > > > > > > > > > > delivery timeout could lead to duplicates, which
> > makes
> > > > > sense.
> > > > > > > To
> > > > > > > > > > help
> > > > > > > > > > > > > avoid this situation, do we want to consider a
> > default
> > > > > value
> > > > > > > that
> > > > > > > > > is
> > > > > > > > > > > the
> > > > > > > > > > > > > same as the delivery timeout?
> > > > > > > > > > > > >
> > > > > > > > > > > > > Thanks again for the KIP.
> > > > > > > > > > > > >
> > > > > > > > > > > > > Bill
> > > > > > > > > > > > >
> > > > > > > > > > > > > On Thu, Jul 21, 2022 at 4:54 PM Justine Olshan
> > > > > > > > > > > > > <jols...@confluent.io.invalid> wrote:
> > > > > > > > > > > > >
> > > > > > > > > > > > >> Hey all!
> > > > > > > > > > > > >>
> > > > > > > > > > > > >> I'd like to start a discussion on my proposal to
> > > > separate
> > > > > > > > > time-based
> > > > > > > > > > > > >> producer ID expiration from transactional ID
> > > expiration
> > > > by
> > > > > > > > > > > introducing a
> > > > > > > > > > > > >> new configuration.
> > > > > > > > > > > > >>
> > > > > > > > > > > > >> The KIP Is pretty small and simple, but will be
> > > helpful
> > > > in
> > > > > > > > > > controlling
> > > > > > > > > > > > >> memory usage in brokers -- especially now that by
> > > > default
> > > > > > > > > producers
> > > > > > > > > > > are
> > > > > > > > > > > > >> idempotent and create producer ID state.
> > > > > > > > > > > > >>
> > > > > > > > > > > > >> Please take a look and leave any comments you may
> > > have!
> > > > > > > > > > > > >>
> > > > > > > > > > > > >> KIP:
> > > > > > > > > > > > >>
> > > > > > > > > > > > >>
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-854+Separate+configuration+for+producer+ID+expiry
> > > > > > > > > > > > >> JIRA:
> > > https://issues.apache.org/jira/browse/KAFKA-14097
> > > > > > > > > > > > >>
> > > > > > > > > > > > >> Thanks!
> > > > > > > > > > > > >> Justine
> > > > > > > > > > > > >>
> > > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Reply via email to