I agree with Justine, especially considering that the number of producers on a Kafka cluster is usually very limited. This makes me think we are focusing on the symptom in my KIP-936 and this KIP, which is the memory issue, instead of addressing the root cause. The root cause is that the idempotent session protocol tolerates and allows short-lived session metadata (PID and state) to crowd the cluster.
So, similar to Justine, I am more in favour of considering a new protocol that offers true idempotent producer capabilities across sessions, that can also truly identify the producer client. However, if we can address the memory issue in the meantime with a simple solution to protect against the anti-patterns/misconfigured use-cases of the idempotent session protocol, this would be a win until we come up with a new protocol. So far, both KIP-936 and this KIP propose somewhat complicated solutions to this symptom. Maybe we should divide the focus into: • Finding a simple way to protect against OOM caused by short-lived idempotent sessions. This might be a bit complicated as identifying short-lived producers is tricky with the current protocol. For instance, we could revisit the reject alternative in KIP-936 to throttle INIT_PID requests. It is not perfect, but it is the simplest. • Developing Idempotent Protocol V2 that addresses this issue and the client-side issues with idempotent producers. @Justine are you aware of anyone looking into such protocol at the moment in details? Omnia > On 17 May 2024, at 16:26, Justine Olshan <jols...@confluent.io.INVALID> wrote: > > Respectfully, I don't agree. Why should we persist useless information > for clients that are long gone and will never use it? > This is why I'm suggesting we do something smarter when it comes to storing > data and only store data we actually need and have a use for. > > This is why I suggest the heartbeat. It gives us clear information (up to > the heartbeat interval) of which producers are worth keeping and which that > are not. > I'm not in favor of building a new and complicated system to try to guess > which information is needed. In my mind, if we have a ton of legitimately > active producers, we should scale up memory. If we don't there is no reason > to have high memory usage. > > Fixing the client also allows us to fix some of the other issues we have > with idempotent producers. > > Justine > > On Fri, May 17, 2024 at 12:46 AM Claude Warren <cla...@xenei.com> wrote: > >> I think that the point here is that the design that assumes that you can >> keep all the PIDs in memory for all server configurations and all usages >> and all client implementations is fraught with danger. >> >> Yes, there are solutions already in place (KIP-854) that attempt to address >> this problem, and other proposed solutions to remove that have undesirable >> side effects (e.g. Heartbeat interrupted by IP failure for a slow producer >> with a long delay between posts). KAFKA-16229 (Slow expiration of Producer >> IDs leading to high CPU usage) dealt with how to expire data from the cache >> so that there was minimal lag time. >> >> But the net issue is still the underlying design/architecture. >> >> There are a couple of salient points here: >> >> - The state of a state machine is only a view on its transactions. This >> is the classic stream / table dichotomy. >> - What the "cache" is trying to do is create that view. >> - In some cases the size of the state exceeds the storage of the cache >> and the systems fail. >> - The current solutions have attempted to place limits on the size of >> the state. >> - Errors in implementation and or configuration will eventually lead to >> "problem producers" >> - Under the adopted fixes and current slate of proposals, the "problem >> producers" solutions have cascading side effects on properly behaved >> producers. (e.g. dropping long running, slow producing producers) >> >> For decades (at least since the 1980's and anecdotally since the 1960's) >> there has been a solution to processing state where the size of the state >> exceeded the memory available. It is the solution that drove the idea that >> you could have tables in Kafka. The idea that we can store the hot PIDs in >> memory using an LRU and write data to storage so that we can quickly find >> things not in the cache is not new. It has been proven. >> >> I am arguing that we should not throw away state data because we are >> running out of memory. We should persist that data to disk and consider >> the disk as the source of truth for state. >> >> Claude >> >> >> On Wed, May 15, 2024 at 7:42 PM Justine Olshan >> <jols...@confluent.io.invalid> >> wrote: >> >>> +1 to the comment. >>> >>>> I still feel we are doing all of this only because of a few >> anti-pattern >>> or misconfigured producers and not because we have “too many Producer”. >> I >>> believe that implementing Producer heartbeat and remove short-lived PIDs >>> from the cache if we didn’t receive heartbeat will be more simpler and >> step >>> on right direction to improve idempotent logic and maybe try to make PID >>> get reused between session which will implement a real idempotent >> producer >>> instead of idempotent session. I admit this wouldn’t help with old >> clients >>> but it will put us on the right path. >>> >>> This issue is very complicated and I appreciate the attention on it. >>> Hopefully we can find a good solution working together :) >>> >>> Justine >>> >>> On Wed, May 15, 2024 at 8:36 AM Omnia Ibrahim <o.g.h.ibra...@gmail.com> >>> wrote: >>> >>>> Also in the rejection alternatives you listed an approved KIP which is >> a >>>> bit confusing can you move this to motivations instead >>>> >>>>> On 15 May 2024, at 14:35, Claude Warren <cla...@apache.org> wrote: >>>>> >>>>> This is a proposal that should solve the OOM problem on the servers >>>> without >>>>> some of the other proposed KIPs being active. >>>>> >>>>> Full details in >>>>> >>>> >>> >> https://cwiki.apache.org/confluence/display/KAFKA/KIP-1044%3A+A+proposal+to+change+idempotent+producer+--+server+implementation >>>> >>>> >>> >> >> >> -- >> LinkedIn: http://www.linkedin.com/in/claudewarren >>