Hi Claude
Thanks for raising this KIP. It is an interesting idea. I had a quick review 
for the KIP and I have few notes 
10. 
> The issue is that the number of PIDs that need to be tracked has exploded and 
> has resulted in OOM failures that cause the entire cluster to crash.  There 
> are multiple efforts underway to mitigate the OOM problem through cache 
> cleanup and throttling of clients.

I think we should clarify here that this only happened  when the cluster has an 
abusive/misconfigured client that initialises too many PIDs. For example I saw 
this issue 3 times 
1. First one was because of an application that kept re-initalizing producer on 
every single error message they received from Kafka instead of retrying or 
skipping the records. This one took longer to fill the memory ~ 24hr (this was 
before the 24hr expiration) but eventually it did. 
2. one producer deployment stuck in crashing loop which created  >500,000 PID 
in few hours due to some misconfiguration that led the application to crash 
after sending the first batch 
3. another encounter was a producer initialising PID on each record which led 
to creation of 1M PID in few hours which is an anti-pattern.
So technically this OOM only happened when we get small number of misconfigured 
producers or anti-pattern design. 
11. Another thing is maybe worth pointing out here is KIP-936 as throttling is 
the other option we are weighting against in this KIP. 
12. I feel the motivation isn’t clear enough for people who aren’t familiar 
with this OOM issue. Especially that not a lot of people experienced this issue.
13. I still feel we are doing all of this only because of a few anti-pattern or 
misconfigured producers and not because we have “too many Producer”.  I believe 
that implementing Producer heartbeat and remove short-lived PIDs from the cache 
if we didn’t receive heartbeat will be more simpler and step on right direction 
 to improve idempotent logic and maybe try to make PID get reused between 
session which will implement a real idempotent producer instead of idempotent 
session.  I admit this wouldn’t help with old clients but it will put us on the 
right path.


Omnia



> On 15 May 2024, at 14:35, Claude Warren <cla...@apache.org> wrote:
> 
> This is a proposal that should solve the OOM problem on the servers without
> some of the other proposed KIPs being active.
> 
> Full details in
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-1044%3A+A+proposal+to+change+idempotent+producer+--+server+implementation

Reply via email to