I agree with Justine, especially considering that the number of producers on a 
Kafka cluster is usually very limited. 
This makes me think we are focusing on the symptom in my KIP-936 and this KIP, 
which is the memory issue, instead of addressing the root cause.
The root cause is that the idempotent session protocol tolerates and allows 
short-lived session metadata (PID and state) to crowd the cluster.

So, similar to Justine, I am more in favour of considering a new protocol that 
offers true idempotent producer capabilities across sessions, that can also 
truly identify the producer client.
However, if we can address the memory issue in the meantime with a simple 
solution to protect against the anti-patterns/misconfigured use-cases of the 
idempotent session protocol, 
this would be a win until we come up with a new protocol.
So far, both KIP-936 and this KIP propose somewhat complicated solutions to 
this symptom.

Maybe we should divide the focus into:
    • Finding a simple way to protect against OOM caused by short-lived 
idempotent sessions. This might be a bit complicated as identifying short-lived 
producers is tricky with the current protocol. 
        For instance, we could revisit the reject alternative in KIP-936 to 
throttle INIT_PID requests. It is not perfect, but it is the simplest.
    • Developing Idempotent Protocol V2 that addresses this issue and the 
client-side issues with idempotent producers. @Justine are you aware of anyone 
looking into such protocol at the moment in details? 

Omnia

> On 17 May 2024, at 16:26, Justine Olshan <jols...@confluent.io.INVALID> wrote:
> 
> Respectfully, I don't agree. Why should we persist useless information
> for clients that are long gone and will never use it?
> This is why I'm suggesting we do something smarter when it comes to storing
> data and only store data we actually need and have a use for.
> 
> This is why I suggest the heartbeat. It gives us clear information (up to
> the heartbeat interval) of which producers are worth keeping and which that
> are not.
> I'm not in favor of building a new and complicated system to try to guess
> which information is needed. In my mind, if we have a ton of legitimately
> active producers, we should scale up memory. If we don't there is no reason
> to have high memory usage.
> 
> Fixing the client also allows us to fix some of the other issues we have
> with idempotent producers.
> 
> Justine
> 
> On Fri, May 17, 2024 at 12:46 AM Claude Warren <cla...@xenei.com> wrote:
> 
>> I think that the point here is that the design that assumes that you can
>> keep all the PIDs in memory for all server configurations and all usages
>> and all client implementations is fraught with danger.
>> 
>> Yes, there are solutions already in place (KIP-854) that attempt to address
>> this problem, and other proposed solutions to remove that have undesirable
>> side effects (e.g. Heartbeat interrupted by IP failure for a slow producer
>> with a long delay between posts).  KAFKA-16229 (Slow expiration of Producer
>> IDs leading to high CPU usage) dealt with how to expire data from the cache
>> so that there was minimal lag time.
>> 
>> But the net issue is still the underlying design/architecture.
>> 
>> There are a  couple of salient points here:
>> 
>>   - The state of a state machine is only a view on its transactions.  This
>>   is the classic stream / table dichotomy.
>>   - What the "cache" is trying to do is create that view.
>>   - In some cases the size of the state exceeds the storage of the cache
>>   and the systems fail.
>>   - The current solutions have attempted to place limits on the size of
>>   the state.
>>   - Errors in implementation and or configuration will eventually lead to
>>   "problem producers"
>>   - Under the adopted fixes and current slate of proposals, the "problem
>>   producers" solutions have cascading side effects on properly behaved
>>   producers. (e.g. dropping long running, slow producing producers)
>> 
>> For decades (at least since the 1980's and anecdotally since the 1960's)
>> there has been a solution to processing state where the size of the state
>> exceeded the memory available.  It is the solution that drove the idea that
>> you could have tables in Kafka.  The idea that we can store the hot PIDs in
>> memory using an LRU and write data to storage so that we can quickly find
>> things not in the cache is not new.  It has been proven.
>> 
>> I am arguing that we should not throw away state data because we are
>> running out of memory.  We should persist that data to disk and consider
>> the disk as the source of truth for state.
>> 
>> Claude
>> 
>> 
>> On Wed, May 15, 2024 at 7:42 PM Justine Olshan
>> <jols...@confluent.io.invalid>
>> wrote:
>> 
>>> +1 to the comment.
>>> 
>>>> I still feel we are doing all of this only because of a few
>> anti-pattern
>>> or misconfigured producers and not because we have “too many Producer”.
>> I
>>> believe that implementing Producer heartbeat and remove short-lived PIDs
>>> from the cache if we didn’t receive heartbeat will be more simpler and
>> step
>>> on right direction  to improve idempotent logic and maybe try to make PID
>>> get reused between session which will implement a real idempotent
>> producer
>>> instead of idempotent session.  I admit this wouldn’t help with old
>> clients
>>> but it will put us on the right path.
>>> 
>>> This issue is very complicated and I appreciate the attention on it.
>>> Hopefully we can find a good solution working together :)
>>> 
>>> Justine
>>> 
>>> On Wed, May 15, 2024 at 8:36 AM Omnia Ibrahim <o.g.h.ibra...@gmail.com>
>>> wrote:
>>> 
>>>> Also in the rejection alternatives you listed an approved KIP which is
>> a
>>>> bit confusing can you move this to motivations instead
>>>> 
>>>>> On 15 May 2024, at 14:35, Claude Warren <cla...@apache.org> wrote:
>>>>> 
>>>>> This is a proposal that should solve the OOM problem on the servers
>>>> without
>>>>> some of the other proposed KIPs being active.
>>>>> 
>>>>> Full details in
>>>>> 
>>>> 
>>> 
>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-1044%3A+A+proposal+to+change+idempotent+producer+--+server+implementation
>>>> 
>>>> 
>>> 
>> 
>> 
>> --
>> LinkedIn: http://www.linkedin.com/in/claudewarren
>> 

Reply via email to