Re: Flink pending record metric weired after autoscaler rescaling

Yang LI Fri, 12 Jan 2024 07:53:10 -0800

Yes, https://issues.apache.org/jira/browse/FLINK-34063
<https://issues.apache.org/jira/browse/FLINK-34063>does match quite well
the issues I have encountered. I'll leave a comment in that ticket then.


Thanks,
Yang

On Fri, 12 Jan 2024 at 15:39, Gyula Fóra <gyula.f...@gmail.com> wrote:

> Could this be related to the issue reported here?
> https://issues.apache.org/jira/browse/FLINK-34063
>
> Gyula
>
> On Wed, Jan 10, 2024 at 4:04 PM Yang LI <yang.hunter...@gmail.com> wrote:
>
> > Just to give more context, my setup uses Apache Flink 1.18 with the
> > adaptive scheduler enabled, issues happen randomly particularly
> > post-restart behaviors.
> >
> > After each restart, the system logs indicate "Adding split(s) to
> reader:",
> > signifying the reassignment of partitions across different TaskManagers.
> An
> > anomaly arises with specific partitions, for example, partition-10. This
> > partition does not appear in the logs immediately post-restart. It
> remains
> > unlogged for several hours, during which no data consumption from
> > partition-10 occurs. Subsequently, the logs display "Discovered new
> > partitions:", and only then does the consumption of data from
> partition-10
> > recommence.
> >
> > Could you provide any insights or hypotheses regarding the underlying
> cause
> > of this delayed recognition and processing of certain partitions?
> >
> > Best regards,
> > Yang
> >
> > On Mon, 8 Jan 2024 at 16:24, Yang LI <yang.hunter...@gmail.com> wrote:
> >
> > > Dear Flink Community,
> > >
> > > I've encountered an issue during the testing of my Flink autoscaler. It
> > > appears that Flink is losing track of specific Kafka partitions,
> leading
> > to
> > > a persistent increase in lag on these partitions. The logs indicate a
> > > 'kafka connector metricGroup name collision exception.' Notably, the
> > > consumption on these Kafka partitions returns to normal after
> restarting
> > > the Kafka broker. For context, I have enabled in-place rescaling
> support
> > > with 'jobmanager.scheduler: Adaptive.'
> > >
> > > I suspect the problem may stem from:
> > >
> > > The in-place rescaling support triggering restarts of some
> taskmanagers.
> > > This process might not be restarting the metric groups registered by
> the
> > > Kafka source connector correctly, leading to a name collision exception
> > and
> > > preventing Flink from accurately reporting metrics related to Kafka
> > > consumption.
> > > A potential edge case in the metric for pending records, especially
> when
> > > different partitions exhibit varying lags. This discrepancy might be
> > > causing the pending record metric to malfunction.
> > > I would appreciate your insights on these observations.
> > >
> > > Best regards,
> > > Yang LI
> > >
> >
>

Re: Flink pending record metric weired after autoscaler rescaling

Reply via email to