Flink Job Failure for version 1.16

neha goyal Wed, 10 May 2023 22:06:08 -0700

I have recently migrated from 1.13.6 to 1.16.1, I can see there is a
performance degradation for the Flink pipeline which is using Flink's
managed state ListState, MapState, etc. Pipelines are frequently failing
with the Exception:


06:59:42.021 [Checkpoint Timer] WARN  o.a.f.r.c.CheckpointFailureManager -
Failed to trigger or complete checkpoint 36755 for job
d0e1a940adab2981dbe0423efe83f140. (0 consecutive failed attempts so far)
 org.apache.flink.runtime.checkpoint.CheckpointFailureManager
org.apache.flink.runtime.checkpoint.CheckpointFailureManagerorg.apache.flink.runtime.checkpoint.CheckpointException:
Checkpoint expired before completing.
at
org.apache.flink.runtime.checkpoint.CheckpointCoordinator$CheckpointCanceller.run(CheckpointCoordinator.java:2165)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
at
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:750)
07:18:15.257 [flink-akka.actor.default-dispatcher-31] WARN
 a.remote.ReliableDeliverySupervisor - Association with remote system
[akka.tcp://fl...@ip-172-31-73-135.ap-southeast-1.compute.internal:43367]
has failed, address is now gated for [50] ms. Reason: [Disassociated]
 akka.event.slf4j.Slf4jLogger$$anonfun$receive$1
akka.remote.ReliableDeliverySupervisor07:18:15.257 [flink-metrics-23] WARN
 a.remote.ReliableDeliverySupervisor - Association with remote system
[akka.tcp://flink-metr...@ip-172-31-73-135.ap-southeast-1.compute.internal:33639]
has failed, address is now gated for [50] ms. Reason: [Disassociated]
 akka.event.slf4j.Slf4jLogger$$anonfun$receive$1
akka.remote.ReliableDeliverySupervisor07:18:15.331
[flink-akka.actor.default-dispatcher-31] WARN
 o.a.f.r.c.CheckpointFailureManager - Failed to trigger or complete
checkpoint 36756 for job d0e1a940adab2981dbe0423efe83f140. (0 consecutive
failed attempts so far)
 org.apache.flink.runtime.checkpoint.CheckpointFailureManager
org.apache.flink.runtime.checkpoint.CheckpointFailureManagerorg.apache.flink.runtime.checkpoint.CheckpointException:
Checkpoint Coordinator is suspending.
at
org.apache.flink.runtime.checkpoint.CheckpointCoordinator.stopCheckpointScheduler(CheckpointCoordinator.java:1926)
at
org.apache.flink.runtime.checkpoint.CheckpointCoordinatorDeActivator.jobStatusChanges(CheckpointCoordinatorDeActivator.java:46)
at
org.apache.flink.runtime.executiongraph.DefaultExecutionGraph.notifyJobStatusChange(DefaultExecutionGraph.java:1566)
at
org.apache.flink.runtime.executiongraph.DefaultExecutionGraph.transitionState(DefaultExecutionGraph.java:1161)

Is there any issue with this Flink version or the new RocksDB version? What
should be the action item for this Exception?
The maximum savepoint size is 80.2 GB and we periodically(every 20 minutes)
take the savepoint for the job.
Checkpoint Type: aligned checkpoint

Flink Job Failure for version 1.16

Reply via email to