apoorvmittal10 commented on PR #17474:
URL: https://github.com/apache/kafka/pull/17474#issuecomment-2408639782

   Thanks for reviewing and feedback. Here are my thoughts.
   
   > I assume the 'short span' is caused by an unexpected connection issue 
rather than a malicious client, as preventing malicious clients is a different 
issue. The root cause of hitting the cache limit is that ClientMetricsManager 
creates many meaningless clientInstanceIds for the 'short span' because their 
GetTelemetrySubscriptionsRequest doesn't include a valid UUID.
   
   Though not neccessarily. There happens to be workloads where application 
creates short lived producer/consumers which does specific read/write and 
closes. The TTL for client metrics is 3x of push interval which makes the 
instances to live longer in cache. For a push interval of say 5 minutes 
(default push interval), the instances can live for 15 minutes in cache. Hence 
if just 50 connections are created and closed per sec then cache can get full 
in 6 minutes. 
   
   So there could be 3 questions:
   1. Why short lived clients are created? This depends on application but this 
is seen in numerous cases where short lived client connections are created.
   2. Is 16384 limit enough? Though the limit seems to be a good enough but it 
will depend on the running infrastructure and what broker can support.
   3. Is 5 minutes default push interval too large? It could be lower or higher 
depends how much cost an operator wants to have to store the client metrics as 
the clients can be numerous hence cardinality will be high. So a higher push 
interval makes sense, but it can be dynamically changed is some client seems to 
be problematic and operator decided to fetch metrics aggressively to determinw 
the issue.
   
   So, I think a clear way to clean up resources is missing in Kafka where 
decision can be made on client disconnection. Rather waiting for a request from 
client which specifies the client is being disconnected.
   
   > Perhaps we could update the protocol to require that 
GetTelemetrySubscriptionsRequest must include a client instance ID? In 
KIP-1082, we agreed that clients are capable of generating UUIDs. With this 
approach, ClientMetricsManager would no longer create meaningless 
clientInstanceIds for short spans, as they would always use the same UUID 
(acting as a kind of incarnation).
   
   This might not solve the problem specified in 1 above. As when new client is 
created we should have a unique client instance id to determine uniquene 
instance. As the client metrics are tied to specific instance which is being 
created. 
   
   In general, the requirement for connection disconnet listener can also 
address use cases where closing request from client is never reached to broker. 
Say for Share Groups where the records fetched by a client instance and not yet 
acknowledged waits for the lock timeout to be re-delivered, can use this 
functionality to release such records where disconnect request is missed.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to