apoorvmittal10 commented on PR #17474: URL: https://github.com/apache/kafka/pull/17474#issuecomment-2408639782
Thanks for reviewing and feedback. Here are my thoughts. > I assume the 'short span' is caused by an unexpected connection issue rather than a malicious client, as preventing malicious clients is a different issue. The root cause of hitting the cache limit is that ClientMetricsManager creates many meaningless clientInstanceIds for the 'short span' because their GetTelemetrySubscriptionsRequest doesn't include a valid UUID. Though not neccessarily. There happens to be workloads where application creates short lived producer/consumers which does specific read/write and closes. The TTL for client metrics is 3x of push interval which makes the instances to live longer in cache. For a push interval of say 5 minutes (default push interval), the instances can live for 15 minutes in cache. Hence if just 50 connections are created and closed per sec then cache can get full in 6 minutes. So there could be 3 questions: 1. Why short lived clients are created? This depends on application but this is seen in numerous cases where short lived client connections are created. 2. Is 16384 limit enough? Though the limit seems to be a good enough but it will depend on the running infrastructure and what broker can support. 3. Is 5 minutes default push interval too large? It could be lower or higher depends how much cost an operator wants to have to store the client metrics as the clients can be numerous hence cardinality will be high. So a higher push interval makes sense, but it can be dynamically changed is some client seems to be problematic and operator decided to fetch metrics aggressively to determinw the issue. So, I think a clear way to clean up resources is missing in Kafka where decision can be made on client disconnection. Rather waiting for a request from client which specifies the client is being disconnected. > Perhaps we could update the protocol to require that GetTelemetrySubscriptionsRequest must include a client instance ID? In KIP-1082, we agreed that clients are capable of generating UUIDs. With this approach, ClientMetricsManager would no longer create meaningless clientInstanceIds for short spans, as they would always use the same UUID (acting as a kind of incarnation). This might not solve the problem specified in 1 above. As when new client is created we should have a unique client instance id to determine uniquene instance. As the client metrics are tied to specific instance which is being created. In general, the requirement for connection disconnet listener can also address use cases where closing request from client is never reached to broker. Say for Share Groups where the records fetched by a client instance and not yet acknowledged waits for the lock timeout to be re-delivered, can use this functionality to release such records where disconnect request is missed. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
