Dear Flink Community,

I've encountered an issue during the testing of my Flink autoscaler. It
appears that Flink is losing track of specific Kafka partitions, leading to
a persistent increase in lag on these partitions. The logs indicate a
'kafka connector metricGroup name collision exception.' Notably, the
consumption on these Kafka partitions returns to normal after restarting
the Kafka broker. For context, I have enabled in-place rescaling support
with 'jobmanager.scheduler: Adaptive.'

I suspect the problem may stem from:

The in-place rescaling support triggering restarts of some taskmanagers.
This process might not be restarting the metric groups registered by the
Kafka source connector correctly, leading to a name collision exception and
preventing Flink from accurately reporting metrics related to Kafka
consumption.
A potential edge case in the metric for pending records, especially when
different partitions exhibit varying lags. This discrepancy might be
causing the pending record metric to malfunction.
I would appreciate your insights on these observations.

Best regards,
Yang LI

Reply via email to