Hi Everyone, can someone please shade some light when the Checkpoint
Coordinator is suspending Error comes and what should I do to avoid this?
it is impacting the production pipeline after the version upgrade. It is
related to resource crunch in the pipeline?
Thank You

On Thu, May 11, 2023 at 10:35 AM neha goyal <nehagoy...@gmail.com> wrote:

> I have recently migrated from 1.13.6 to 1.16.1, I can see there is a
> performance degradation for the Flink pipeline which is using Flink's
> managed state ListState, MapState, etc. Pipelines are frequently failing
> with the Exception:
>
> 06:59:42.021 [Checkpoint Timer] WARN  o.a.f.r.c.CheckpointFailureManager -
> Failed to trigger or complete checkpoint 36755 for job
> d0e1a940adab2981dbe0423efe83f140. (0 consecutive failed attempts so far)
>  org.apache.flink.runtime.checkpoint.CheckpointFailureManager
> org.apache.flink.runtime.checkpoint.CheckpointFailureManagerorg.apache.flink.runtime.checkpoint.CheckpointException:
> Checkpoint expired before completing.
> at
> org.apache.flink.runtime.checkpoint.CheckpointCoordinator$CheckpointCanceller.run(CheckpointCoordinator.java:2165)
> at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> at
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
> at
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> at java.lang.Thread.run(Thread.java:750)
> 07:18:15.257 [flink-akka.actor.default-dispatcher-31] WARN
>  a.remote.ReliableDeliverySupervisor - Association with remote system
> [akka.tcp://fl...@ip-172-31-73-135.ap-southeast-1.compute.internal:43367]
> has failed, address is now gated for [50] ms. Reason: [Disassociated]
>  akka.event.slf4j.Slf4jLogger$$anonfun$receive$1
> akka.remote.ReliableDeliverySupervisor07:18:15.257 [flink-metrics-23] WARN
>  a.remote.ReliableDeliverySupervisor - Association with remote system
> [akka.tcp://flink-metr...@ip-172-31-73-135.ap-southeast-1.compute.internal:33639]
> has failed, address is now gated for [50] ms. Reason: [Disassociated]
>  akka.event.slf4j.Slf4jLogger$$anonfun$receive$1
> akka.remote.ReliableDeliverySupervisor07:18:15.331
> [flink-akka.actor.default-dispatcher-31] WARN
>  o.a.f.r.c.CheckpointFailureManager - Failed to trigger or complete
> checkpoint 36756 for job d0e1a940adab2981dbe0423efe83f140. (0 consecutive
> failed attempts so far)
>  org.apache.flink.runtime.checkpoint.CheckpointFailureManager
> org.apache.flink.runtime.checkpoint.CheckpointFailureManagerorg.apache.flink.runtime.checkpoint.CheckpointException:
> Checkpoint Coordinator is suspending.
> at
> org.apache.flink.runtime.checkpoint.CheckpointCoordinator.stopCheckpointScheduler(CheckpointCoordinator.java:1926)
> at
> org.apache.flink.runtime.checkpoint.CheckpointCoordinatorDeActivator.jobStatusChanges(CheckpointCoordinatorDeActivator.java:46)
> at
> org.apache.flink.runtime.executiongraph.DefaultExecutionGraph.notifyJobStatusChange(DefaultExecutionGraph.java:1566)
> at
> org.apache.flink.runtime.executiongraph.DefaultExecutionGraph.transitionState(DefaultExecutionGraph.java:1161)
>
> Is there any issue with this Flink version or the new RocksDB version?
> What should be the action item for this Exception?
> The maximum savepoint size is 80.2 GB and we periodically(every 20
> minutes) take the savepoint for the job.
> Checkpoint Type: aligned checkpoint
>

Reply via email to