Why does this have to be an external config? We can provide an internal mechanism to configure this, no?
Ismael On Wed, Aug 17, 2022 at 9:22 AM Justine Olshan <jols...@confluent.io.invalid> wrote: > Hey all, > Quick update to this KIP. While working on the PR and tests I realized that > we have a hardcoded value for how often we clean up producer IDs. Currently > this value is 10 minutes and is defined in LogManager. > I thought for better testing and ease of use of the new configuration, we > should also be able to configure the cleanup interval. > > Here is the new configuration I'm hoping to add. I also added it to the > KIP. > name: producer.id.expiration.check.interval.ms > description: The interval at which to remove producer IDs that have expired > due to producer.id.expiration.ms passing > default: 600000 (10 minutes) > valid values: [1,...] > priority: low > update-mode: read-only > > I left the default as the current hardcoded value to avoid disruptions. If > there are any issues with this change let me know. > Thanks, > Justine > > On Fri, Aug 5, 2022 at 1:40 PM Justine Olshan <jols...@confluent.io> > wrote: > > > Awesome. Thanks Tom! > > > > I plan to open this KIP vote at the start of next week. > > Thanks all for the discussion! Let me know if there is anything else. :) > > > > Justine > > > > On Wed, Aug 3, 2022 at 11:32 AM Tom Bentley <tbent...@redhat.com> wrote: > > > >> Hi Justine, > >> > >> That all seems reasonable to me, thanks! > >> > >> On Wed, 3 Aug 2022 at 19:14, Justine Olshan > <jols...@confluent.io.invalid > >> > > >> wrote: > >> > >> > Hi Tom and Ismael, > >> > > >> > 1. Yes, there are definitely many ways to improve this issue and I > plan > >> to > >> > write followup KIPs to address some of the larger changes. > >> > Just wanted to get this simple fix in as a short term measure to > prevent > >> > issues with too many producer IDs in the cache. Stay tuned :) > >> > > >> > 2. I did have some offline discussion about informing the client. I > >> think > >> > for this specific KIP the default behavior in practice should not > change > >> > enough to require this information to go back to the client. In other > >> > words, a reasonable configuration should not regress behavior. > However, > >> > with the further changes I mention in 1, perhaps this is something we > >> want > >> > to do. And yes -- unfortunately the current state of Kafka is no > longer > >> > totally consistent with KIP-98. This is something we probably want to > >> > clarify in the future. > >> > > >> > 3. I will update the config to mention it is not dynamic. I think > since > >> the > >> > transactional id configuration is read-only, this should be too. > >> > > >> > 4. I can update this wording. > >> > > >> > 5. I think there are definitely benefits to the name ` > >> > idempotent.pid.expiration.ms` but there are other ways this could > cause > >> > confusion. And to be clear -- the configuration can expire a producer > ID > >> > for a transactional producer as long as there isn't an ongoing > >> transaction. > >> > > >> > Let me know if you have any questions and thanks for taking a look! > >> > > >> > Justine > >> > > >> > On Wed, Aug 3, 2022 at 9:30 AM Ismael Juma <ism...@juma.me.uk> wrote: > >> > > >> > > Regarding 1, more can certainly be done, but I think it would be > >> > > complementary. As such, I think this KIP stands on its own and > >> additional > >> > > improvements can be handled via future KIPs (unless Justine wants to > >> > > combine things, of course). > >> > > > >> > > Ismael > >> > > > >> > > On Wed, Aug 3, 2022 at 9:12 AM Tom Bentley <tbent...@redhat.com> > >> wrote: > >> > > > >> > > > Hi Justine, > >> > > > > >> > > > Thanks for the KIP! I can see that this is a pragmatic attempt to > >> > > address a > >> > > > nasty problem. I have a few questions: > >> > > > > >> > > > 1. The KIP makes the problem significantly harder to trigger, but > >> > doesn't > >> > > > eliminate it entirely. How confident are you that it will be > >> sufficient > >> > > in > >> > > > practice? We can point to applications which are creating > idempotent > >> > > > producers at a high rate and say they're broken, but that doesn't > do > >> > > > anything to defend the broker from an interaction pattern that > >> differs > >> > > only > >> > > > in rate from a "good application". Did you consider a new quota to > >> > limit > >> > > > the rate at which a (principal, clientId) can allocate new PIDs? > >> > > > > >> > > > 2. The KIP contains this sentence: "when an idempotent producer’s > ID > >> > > > expires, it silently loses its idempotency guarantees." That's at > >> odds > >> > > with > >> > > > my reading of "PID expiration" in the KIP-98 design[1], but it > does > >> > seem > >> > > > consistent with a (brief!) look at the code. I accept that the > risk > >> > > should > >> > > > be minimal so long as the expiration time is > the producer's > >> delivery > >> > > > timeout, but it would still be nice if we could detect this > >> situation > >> > and > >> > > > return an error to the client. Is there a reason for the apparent > >> > > deviation > >> > > > from KIP-98 (or am I misreading the code?) > >> > > > > >> > > > 3. Could the KIP be explicit on whether the new config will be > >> > > dynamically > >> > > > changeable? > >> > > > > >> > > > 4. The description of producer.id.expiration.ms mentions the > >> > > > ProducerStateManager, which will mean nothing to a normal user. We > >> > could > >> > > > probably change it to "a topic partition leader" without loss of > >> > meaning. > >> > > > > >> > > > 5. The description also says "Producer IDs will not expire while a > >> > > > transaction associated to them is still ongoing." To me this > >> suggests > >> > > that > >> > > > a more intuitive name for this config (from the user PoV) would > >> include > >> > > > "idempotent", since it does not cover the transactional case. (I > >> would > >> > > > suggest "idempotent.pid.expiration.ms" (c.f. > >> > > > transactional.id.expiration.ms), > >> > > > but the distinction between "id" and "pid" is easily missed–even > if > >> > it's > >> > > > technically correct–so I'm not sure it's better than what you're > >> > > > proposing). I only mention it in case it prompts someone else to > >> find a > >> > > > better name. > >> > > > > >> > > > Thanks again, > >> > > > > >> > > > Tom > >> > > > > >> > > > [1]: > >> > > > > >> > > > > >> > > > >> > > >> > https://docs.google.com/document/d/11Jqy_GjUGtdXJK94XGsEIK7CP1SnQGdp2eF0wSw9ra8/edit#heading=h.loujdamc9ptj > >> > > > > >> > > > On Tue, 2 Aug 2022 at 22:00, Justine Olshan > >> > <jols...@confluent.io.invalid > >> > > > > >> > > > wrote: > >> > > > > >> > > > > I've updated the KIP to make the new minimum value 1 and remove > >> the > >> > -1 > >> > > > > configuration. > >> > > > > I've also added the low priority to the configuration and edited > >> the > >> > > > > description as Ismael mentioned. > >> > > > > > >> > > > > I'm thinking about bringing this KIP to a vote soon! Let me know > >> if > >> > > there > >> > > > > are any other comments or questions. > >> > > > > > >> > > > > Thanks, > >> > > > > Justine > >> > > > > > >> > > > > On Tue, Aug 2, 2022 at 9:02 AM Jason Gustafson > >> > > > <ja...@confluent.io.invalid > >> > > > > > > >> > > > > wrote: > >> > > > > > >> > > > > > I agree with Ismael that we should remove -1. I think we tend > to > >> > view > >> > > > the > >> > > > > > coupling of these behaviors into a single configuration as a > >> > mistake, > >> > > > so > >> > > > > > it's a little odd to keep it (even if in a weakened form). > >> > > > > > > >> > > > > > -Jason > >> > > > > > > >> > > > > > On Sat, Jul 30, 2022 at 7:37 AM Ismael Juma < > ism...@juma.me.uk> > >> > > wrote: > >> > > > > > > >> > > > > > > I would remove -1 altogether. Two more comments: > >> > > > > > > > >> > > > > > > 1. The current description kind of leads people towards > >> aligning > >> > > the > >> > > > > > config > >> > > > > > > with delivery.timeout.ms. Is that what we want? We could > say > >> it > >> > > > should > >> > > > > > be > >> > > > > > > higher than delivery.timeout.ms but indicate that the > >> default is > >> > > > > usually > >> > > > > > > fine. The main reason to reduce it would be to save memory, > I > >> > > guess. > >> > > > > > > 2. Each config has a priority, we should specify it for this > >> one. > >> > > I'm > >> > > > > > > assuming it will be "low". > >> > > > > > > > >> > > > > > > Ismael > >> > > > > > > > >> > > > > > > On Fri, Jul 29, 2022 at 2:38 PM Justine Olshan > >> > > > > > > <jols...@confluent.io.invalid> > >> > > > > > > wrote: > >> > > > > > > > >> > > > > > > > Hi all, > >> > > > > > > > > >> > > > > > > > I've updated the KIP to include the new default of 1 day > and > >> > > > > > information > >> > > > > > > > about -1 in the description of the config. > >> > > > > > > > I wonder though if including -1 makes sense now that it is > >> not > >> > > the > >> > > > > > > default > >> > > > > > > > value. Is there a benefit for manually setting -1 vs > >> manually > >> > > > setting > >> > > > > > the > >> > > > > > > > value that transactional.id.expiration.ms has? > >> > > > > > > > > >> > > > > > > > Let me know your thoughts. > >> > > > > > > > Thanks, > >> > > > > > > > Justine > >> > > > > > > > > >> > > > > > > > On Fri, Jul 29, 2022 at 10:54 AM Ismael Juma < > >> > ism...@juma.me.uk> > >> > > > > > wrote: > >> > > > > > > > > >> > > > > > > > > +1 for having 1 day as the default and for including > this > >> > > change > >> > > > in > >> > > > > > the > >> > > > > > > > > release notes. > >> > > > > > > > > > >> > > > > > > > > Ismael > >> > > > > > > > > > >> > > > > > > > > On Wed, Jul 27, 2022 at 9:16 AM Jason Gustafson > >> > > > > > > > <ja...@confluent.io.invalid > >> > > > > > > > > > > >> > > > > > > > > wrote: > >> > > > > > > > > > >> > > > > > > > > > I don't think a major release is a requirement for a > >> > default > >> > > > > change > >> > > > > > > for > >> > > > > > > > > > what it's worth. I do appreciate that there is a > >> preference > >> > > for > >> > > > > not > >> > > > > > > > > rocking > >> > > > > > > > > > the boat though. For a little bit of background here, > >> the > >> > > > problem > >> > > > > > we > >> > > > > > > > > > have encountered in production since the idempotent > >> > producer > >> > > > > became > >> > > > > > > the > >> > > > > > > > > > default is OOM errors due to huge numbers of > producerIds > >> > that > >> > > > had > >> > > > > > to > >> > > > > > > be > >> > > > > > > > > > retained in the cache for 7 days. It is hard to > prevent > >> use > >> > > > cases > >> > > > > > > from > >> > > > > > > > > > emerging where producers are used and discarded > >> rapidly. We > >> > > > will > >> > > > > be > >> > > > > > > > > using a > >> > > > > > > > > > lower value for sure, but it would also be nice to > >> reduce > >> > the > >> > > > > > > > likelihood > >> > > > > > > > > > for the community to see this problem. The benefit of > >> the > >> > > > caching > >> > > > > > > > > > diminishes quickly over time since it is primarily > >> meant to > >> > > > > handle > >> > > > > > > > > producer > >> > > > > > > > > > retry windows. I do not think there is much difference > >> > > between > >> > > > 1 > >> > > > > > days > >> > > > > > > > > and 7 > >> > > > > > > > > > days from an application perspective, but it is a huge > >> > > > difference > >> > > > > > for > >> > > > > > > > the > >> > > > > > > > > > broker's memory usage. > >> > > > > > > > > > > >> > > > > > > > > > Best, > >> > > > > > > > > > Jason > >> > > > > > > > > > > >> > > > > > > > > > On Wed, Jul 27, 2022 at 2:57 AM Sagar < > >> > > > sagarmeansoc...@gmail.com > >> > > > > > > >> > > > > > > > wrote: > >> > > > > > > > > > > >> > > > > > > > > > > Thanks Justine for the KIP. I think it might be > >> better to > >> > > > > > document > >> > > > > > > > the > >> > > > > > > > > > > correlation between the new config and > >> > delivery.timeout.ms > >> > > > in > >> > > > > > the > >> > > > > > > > > Public > >> > > > > > > > > > > Interfaces Description. > >> > > > > > > > > > > > >> > > > > > > > > > > Also, I agree with Luke that for now setting a > >> default to > >> > > -1 > >> > > > > > should > >> > > > > > > > be > >> > > > > > > > > > > good. We can look to switch to 1 day with major > >> release. > >> > > > > > > > > > > > >> > > > > > > > > > > Thanks! > >> > > > > > > > > > > Sagar. > >> > > > > > > > > > > > >> > > > > > > > > > > On Wed, Jul 27, 2022 at 9:05 AM Luke Chen < > >> > > show...@gmail.com > >> > > > > > >> > > > > > > wrote: > >> > > > > > > > > > > > >> > > > > > > > > > > > Hi Justine, > >> > > > > > > > > > > > > >> > > > > > > > > > > > Thanks for the KIP. > >> > > > > > > > > > > > I agree with you that we should try our best to > keep > >> > > > backward > >> > > > > > > > > > > > compatibility, although our intention is to have > >> lower > >> > > > > producer > >> > > > > > > id > >> > > > > > > > > > > > expiration timeout. > >> > > > > > > > > > > > So, I think we should keep default to -1 IMO. > >> > > > > > > > > > > > Maybe we change the default to 1 day in next major > >> > > release > >> > > > > > (4.0)? > >> > > > > > > > > > > > > >> > > > > > > > > > > > Thank you. > >> > > > > > > > > > > > Luke > >> > > > > > > > > > > > > >> > > > > > > > > > > > On Wed, Jul 27, 2022 at 7:13 AM Justine Olshan > >> > > > > > > > > > > > <jols...@confluent.io.invalid> > >> > > > > > > > > > > > wrote: > >> > > > > > > > > > > > > >> > > > > > > > > > > > > Thanks for taking a look Jason! > >> > > > > > > > > > > > > > >> > > > > > > > > > > > > I wondered if we wanted to have a smaller > default > >> but > >> > > > > wasn't > >> > > > > > > sure > >> > > > > > > > > > about > >> > > > > > > > > > > > the > >> > > > > > > > > > > > > compatibility story -- especially since there is > >> the > >> > > > chance > >> > > > > > for > >> > > > > > > > > > > producer > >> > > > > > > > > > > > > IDs to expire silently. > >> > > > > > > > > > > > > I do think that 1 day is fairly reasonable. If I > >> > don't > >> > > > hear > >> > > > > > any > >> > > > > > > > > > > > conflicting > >> > > > > > > > > > > > > opinions I can go ahead and update the default. > >> > > > > > > > > > > > > > >> > > > > > > > > > > > > Justine > >> > > > > > > > > > > > > > >> > > > > > > > > > > > > On Tue, Jul 26, 2022 at 12:23 PM Jason Gustafson > >> > > > > > > > > > > > > <ja...@confluent.io.invalid> > >> > > > > > > > > > > > > wrote: > >> > > > > > > > > > > > > > >> > > > > > > > > > > > > > Hi Justine, > >> > > > > > > > > > > > > > > >> > > > > > > > > > > > > > Thanks for the KIP. Although I hate seeing new > >> > > > > > > configurations, > >> > > > > > > > I > >> > > > > > > > > > > think > >> > > > > > > > > > > > > this > >> > > > > > > > > > > > > > is a good change. Combining these timeout > >> behaviors > >> > > > into > >> > > > > a > >> > > > > > > > single > >> > > > > > > > > > > > > > configuration was definitely a mistake, but we > >> > didn't > >> > > > > > > > anticipate > >> > > > > > > > > > the > >> > > > > > > > > > > > > > problem with the producer id cache. I do > wonder > >> if > >> > we > >> > > > can > >> > > > > > > make > >> > > > > > > > > the > >> > > > > > > > > > > > > default > >> > > > > > > > > > > > > > a bit lower to reduce the chances that users > >> will > >> > hit > >> > > > the > >> > > > > > > same > >> > > > > > > > > > memory > >> > > > > > > > > > > > > > issues we have seen. After decoupling this > >> > > > configuration > >> > > > > > from > >> > > > > > > > > > > > > > transactional.id.expiration.ms, the new > timeout > >> > just > >> > > > > needs > >> > > > > > > to > >> > > > > > > > > > cover > >> > > > > > > > > > > > the > >> > > > > > > > > > > > > > longest duration that a producer might be > >> retrying > >> > > the > >> > > > > same > >> > > > > > > > > Produce > >> > > > > > > > > > > > > > request. 7 days seems too high. Although I > >> think it > >> > > > could > >> > > > > > go > >> > > > > > > a > >> > > > > > > > > fair > >> > > > > > > > > > > > even > >> > > > > > > > > > > > > > lower, perhaps 1 day is a reasonable place to > >> > start? > >> > > > > > > > > > > > > > > >> > > > > > > > > > > > > > Thanks, > >> > > > > > > > > > > > > > Jason > >> > > > > > > > > > > > > > > >> > > > > > > > > > > > > > On Mon, Jul 25, 2022 at 9:25 AM Justine Olshan > >> > > > > > > > > > > > > > <jols...@confluent.io.invalid> > >> > > > > > > > > > > > > > wrote: > >> > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > Hey Bill, > >> > > > > > > > > > > > > > > Thanks! I was just going to say that > hopefully > >> > > > > > > > > > > > > > > transactional.id.expiration.ms would also > be > >> > over > >> > > > the > >> > > > > > > > delivery > >> > > > > > > > > > > > > timeout. > >> > > > > > > > > > > > > > :) > >> > > > > > > > > > > > > > > Thanks for the +1! > >> > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > Justine > >> > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > On Mon, Jul 25, 2022 at 9:17 AM Bill Bejeck > < > >> > > > > > > > bbej...@gmail.com > >> > > > > > > > > > > >> > > > > > > > > > > > wrote: > >> > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > > Hi Justine, > >> > > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > > I just took another look at the KIP, and I > >> > > realize > >> > > > my > >> > > > > > > > > > > > > > question/suggestion > >> > > > > > > > > > > > > > > > about default values has already been > >> addressed > >> > > in > >> > > > > the > >> > > > > > > > > > > > > `Compatibility` > >> > > > > > > > > > > > > > > > section. > >> > > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > > I'm +1 on the KIP. > >> > > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > > -Bill > >> > > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > > On Thu, Jul 21, 2022 at 6:20 PM Bill > Bejeck > >> < > >> > > > > > > > > bbej...@gmail.com > >> > > > > > > > > > > > >> > > > > > > > > > > > > wrote: > >> > > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > > > Hi Justine, > >> > > > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > > > Thanks for the well written KIP, this > >> looks > >> > > like > >> > > > it > >> > > > > > > will > >> > > > > > > > > be a > >> > > > > > > > > > > > > useful > >> > > > > > > > > > > > > > > > > addition. > >> > > > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > > > Overall the KIP looks good to me, I have > >> one > >> > > > > > > > > > question/comment. > >> > > > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > > > You mentioned that setting the ` > >> > > > > > > > producer.id.expiration.ms` > >> > > > > > > > > > > less > >> > > > > > > > > > > > > than > >> > > > > > > > > > > > > > > the > >> > > > > > > > > > > > > > > > > delivery timeout could lead to > duplicates, > >> > > which > >> > > > > > makes > >> > > > > > > > > sense. > >> > > > > > > > > > > To > >> > > > > > > > > > > > > > help > >> > > > > > > > > > > > > > > > > avoid this situation, do we want to > >> consider > >> > a > >> > > > > > default > >> > > > > > > > > value > >> > > > > > > > > > > that > >> > > > > > > > > > > > > is > >> > > > > > > > > > > > > > > the > >> > > > > > > > > > > > > > > > > same as the delivery timeout? > >> > > > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > > > Thanks again for the KIP. > >> > > > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > > > Bill > >> > > > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > > > On Thu, Jul 21, 2022 at 4:54 PM Justine > >> > Olshan > >> > > > > > > > > > > > > > > > > <jols...@confluent.io.invalid> wrote: > >> > > > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > > >> Hey all! > >> > > > > > > > > > > > > > > > >> > >> > > > > > > > > > > > > > > > >> I'd like to start a discussion on my > >> > proposal > >> > > to > >> > > > > > > > separate > >> > > > > > > > > > > > > time-based > >> > > > > > > > > > > > > > > > >> producer ID expiration from > >> transactional ID > >> > > > > > > expiration > >> > > > > > > > by > >> > > > > > > > > > > > > > > introducing a > >> > > > > > > > > > > > > > > > >> new configuration. > >> > > > > > > > > > > > > > > > >> > >> > > > > > > > > > > > > > > > >> The KIP Is pretty small and simple, but > >> will > >> > > be > >> > > > > > > helpful > >> > > > > > > > in > >> > > > > > > > > > > > > > controlling > >> > > > > > > > > > > > > > > > >> memory usage in brokers -- especially > now > >> > that > >> > > > by > >> > > > > > > > default > >> > > > > > > > > > > > > producers > >> > > > > > > > > > > > > > > are > >> > > > > > > > > > > > > > > > >> idempotent and create producer ID > state. > >> > > > > > > > > > > > > > > > >> > >> > > > > > > > > > > > > > > > >> Please take a look and leave any > comments > >> > you > >> > > > may > >> > > > > > > have! > >> > > > > > > > > > > > > > > > >> > >> > > > > > > > > > > > > > > > >> KIP: > >> > > > > > > > > > > > > > > > >> > >> > > > > > > > > > > > > > > > >> > >> > > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > >> > > > > > > > > > > > > > >> > > > > > > > > > > > > >> > > > > > > > > > > > >> > > > > > > > > > > >> > > > > > > > > > >> > > > > > > > > >> > > > > > > > >> > > > > > > >> > > > > > >> > > > > >> > > > >> > > >> > https://cwiki.apache.org/confluence/display/KAFKA/KIP-854+Separate+configuration+for+producer+ID+expiry > >> > > > > > > > > > > > > > > > >> JIRA: > >> > > > > > > https://issues.apache.org/jira/browse/KAFKA-14097 > >> > > > > > > > > > > > > > > > >> > >> > > > > > > > > > > > > > > > >> Thanks! > >> > > > > > > > > > > > > > > > >> Justine > >> > > > > > > > > > > > > > > > >> > >> > > > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > >> > > > > > > > > > > > > > >> > > > > > > > > > > > > >> > > > > > > > > > > > >> > > > > > > > > > > >> > > > > > > > > > >> > > > > > > > > >> > > > > > > > >> > > > > > > >> > > > > > >> > > > > >> > > > >> > > >> > > >