Hi Everyone, can someone please shade some light when the Checkpoint Coordinator is suspending Error comes and what should I do to avoid this? it is impacting the production pipeline after the version upgrade. It is related to resource crunch in the pipeline? Thank You
On Thu, May 11, 2023 at 10:35 AM neha goyal <nehagoy...@gmail.com> wrote: > I have recently migrated from 1.13.6 to 1.16.1, I can see there is a > performance degradation for the Flink pipeline which is using Flink's > managed state ListState, MapState, etc. Pipelines are frequently failing > with the Exception: > > 06:59:42.021 [Checkpoint Timer] WARN o.a.f.r.c.CheckpointFailureManager - > Failed to trigger or complete checkpoint 36755 for job > d0e1a940adab2981dbe0423efe83f140. (0 consecutive failed attempts so far) > org.apache.flink.runtime.checkpoint.CheckpointFailureManager > org.apache.flink.runtime.checkpoint.CheckpointFailureManagerorg.apache.flink.runtime.checkpoint.CheckpointException: > Checkpoint expired before completing. > at > org.apache.flink.runtime.checkpoint.CheckpointCoordinator$CheckpointCanceller.run(CheckpointCoordinator.java:2165) > at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) > at > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:750) > 07:18:15.257 [flink-akka.actor.default-dispatcher-31] WARN > a.remote.ReliableDeliverySupervisor - Association with remote system > [akka.tcp://fl...@ip-172-31-73-135.ap-southeast-1.compute.internal:43367] > has failed, address is now gated for [50] ms. Reason: [Disassociated] > akka.event.slf4j.Slf4jLogger$$anonfun$receive$1 > akka.remote.ReliableDeliverySupervisor07:18:15.257 [flink-metrics-23] WARN > a.remote.ReliableDeliverySupervisor - Association with remote system > [akka.tcp://flink-metr...@ip-172-31-73-135.ap-southeast-1.compute.internal:33639] > has failed, address is now gated for [50] ms. Reason: [Disassociated] > akka.event.slf4j.Slf4jLogger$$anonfun$receive$1 > akka.remote.ReliableDeliverySupervisor07:18:15.331 > [flink-akka.actor.default-dispatcher-31] WARN > o.a.f.r.c.CheckpointFailureManager - Failed to trigger or complete > checkpoint 36756 for job d0e1a940adab2981dbe0423efe83f140. (0 consecutive > failed attempts so far) > org.apache.flink.runtime.checkpoint.CheckpointFailureManager > org.apache.flink.runtime.checkpoint.CheckpointFailureManagerorg.apache.flink.runtime.checkpoint.CheckpointException: > Checkpoint Coordinator is suspending. > at > org.apache.flink.runtime.checkpoint.CheckpointCoordinator.stopCheckpointScheduler(CheckpointCoordinator.java:1926) > at > org.apache.flink.runtime.checkpoint.CheckpointCoordinatorDeActivator.jobStatusChanges(CheckpointCoordinatorDeActivator.java:46) > at > org.apache.flink.runtime.executiongraph.DefaultExecutionGraph.notifyJobStatusChange(DefaultExecutionGraph.java:1566) > at > org.apache.flink.runtime.executiongraph.DefaultExecutionGraph.transitionState(DefaultExecutionGraph.java:1161) > > Is there any issue with this Flink version or the new RocksDB version? > What should be the action item for this Exception? > The maximum savepoint size is 80.2 GB and we periodically(every 20 > minutes) take the savepoint for the job. > Checkpoint Type: aligned checkpoint >