Dear Flink Community, I've encountered an issue during the testing of my Flink autoscaler. It appears that Flink is losing track of specific Kafka partitions, leading to a persistent increase in lag on these partitions. The logs indicate a 'kafka connector metricGroup name collision exception.' Notably, the consumption on these Kafka partitions returns to normal after restarting the Kafka broker. For context, I have enabled in-place rescaling support with 'jobmanager.scheduler: Adaptive.'
I suspect the problem may stem from: The in-place rescaling support triggering restarts of some taskmanagers. This process might not be restarting the metric groups registered by the Kafka source connector correctly, leading to a name collision exception and preventing Flink from accurately reporting metrics related to Kafka consumption. A potential edge case in the metric for pending records, especially when different partitions exhibit varying lags. This discrepancy might be causing the pending record metric to malfunction. I would appreciate your insights on these observations. Best regards, Yang LI