Yes, https://issues.apache.org/jira/browse/FLINK-34063 <https://issues.apache.org/jira/browse/FLINK-34063>does match quite well the issues I have encountered. I'll leave a comment in that ticket then.
Thanks, Yang On Fri, 12 Jan 2024 at 15:39, Gyula Fóra <gyula.f...@gmail.com> wrote: > Could this be related to the issue reported here? > https://issues.apache.org/jira/browse/FLINK-34063 > > Gyula > > On Wed, Jan 10, 2024 at 4:04 PM Yang LI <yang.hunter...@gmail.com> wrote: > > > Just to give more context, my setup uses Apache Flink 1.18 with the > > adaptive scheduler enabled, issues happen randomly particularly > > post-restart behaviors. > > > > After each restart, the system logs indicate "Adding split(s) to > reader:", > > signifying the reassignment of partitions across different TaskManagers. > An > > anomaly arises with specific partitions, for example, partition-10. This > > partition does not appear in the logs immediately post-restart. It > remains > > unlogged for several hours, during which no data consumption from > > partition-10 occurs. Subsequently, the logs display "Discovered new > > partitions:", and only then does the consumption of data from > partition-10 > > recommence. > > > > Could you provide any insights or hypotheses regarding the underlying > cause > > of this delayed recognition and processing of certain partitions? > > > > Best regards, > > Yang > > > > On Mon, 8 Jan 2024 at 16:24, Yang LI <yang.hunter...@gmail.com> wrote: > > > > > Dear Flink Community, > > > > > > I've encountered an issue during the testing of my Flink autoscaler. It > > > appears that Flink is losing track of specific Kafka partitions, > leading > > to > > > a persistent increase in lag on these partitions. The logs indicate a > > > 'kafka connector metricGroup name collision exception.' Notably, the > > > consumption on these Kafka partitions returns to normal after > restarting > > > the Kafka broker. For context, I have enabled in-place rescaling > support > > > with 'jobmanager.scheduler: Adaptive.' > > > > > > I suspect the problem may stem from: > > > > > > The in-place rescaling support triggering restarts of some > taskmanagers. > > > This process might not be restarting the metric groups registered by > the > > > Kafka source connector correctly, leading to a name collision exception > > and > > > preventing Flink from accurately reporting metrics related to Kafka > > > consumption. > > > A potential edge case in the metric for pending records, especially > when > > > different partitions exhibit varying lags. This discrepancy might be > > > causing the pending record metric to malfunction. > > > I would appreciate your insights on these observations. > > > > > > Best regards, > > > Yang LI > > > > > >