Hello, I see the following error in my jobmanager log (Flink on EMR): Checking cluster logs I see : 2021-08-21 17:17:30,489 [Checkpoint Timer] INFO org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Triggering checkpoint 1 (type=CHECKPOINT) @ 1629566250303 for job c513e9ebbea4ab72d80b1338896ca5c2. 2021-08-21 17:17:33,572 [jobmanager-future-thread-5] INFO com.amazon.ws.emr.hadoop.fs.s3n.MultipartUploadOutputStream - close closed:false s3://***/_metadata 2021-08-21 17:17:33,800 [jobmanager-future-thread-5] INFO org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Completed checkpoint 1 for job c513e9ebbea4ab72d80b1338896ca5c2 (737859873 bytes in 3496 ms). 2021-08-21 17:27:30,474 [Checkpoint Timer] INFO org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Triggering checkpoint 2 (type=CHECKPOINT) @ 1629566850302 for job c513e9ebbea4ab72d80b1338896ca5c2. 2021-08-21 17:27:46,012 [jobmanager-future-thread-3] INFO com.amazon.ws.emr.hadoop.fs.s3n.MultipartUploadOutputStream - close closed:false s3://***/_metadata 2021-08-21 17:27:46,158 [jobmanager-future-thread-3] INFO org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Completed checkpoint 2 for job c513e9ebbea4ab72d80b1338896ca5c2 (1210889410 bytes in 15856 ms). 2021-08-21 17:37:30,468 [Checkpoint Timer] INFO org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Triggering checkpoint 3 (type=CHECKPOINT) @ 1629567450302 for job c513e9ebbea4ab72d80b1338896ca5c2. 2021-08-21 17:47:30,469 [Checkpoint Timer] INFO org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Checkpoint 3 of job c513e9ebbea4ab72d80b1338896ca5c2 expired before completing. 2021-08-21 17:47:30,476 [flink-akka.actor.default-dispatcher-34] INFO org.apache.flink.runtime.jobmaster.JobMaster - Trying to recover from a global failure. org.apache.flink.util.FlinkRuntimeException: Exceeded checkpoint tolerable failure threshold. at org.apache.flink.runtime.checkpoint.CheckpointFailureManager.handleJobLevelCheckpointException(CheckpointFailureManager.java:66) at org.apache.flink.runtime.checkpoint.CheckpointCoordinator.abortPendingCheckpoint(CheckpointCoordinator.java:1673) at org.apache.flink.runtime.checkpoint.CheckpointCoordinator.abortPendingCheckpoint(CheckpointCoordinator.java:1650) at org.apache.flink.runtime.checkpoint.CheckpointCoordinator.access$600(CheckpointCoordinator.java:91) at org.apache.flink.runtime.checkpoint.CheckpointCoordinator$CheckpointCanceller.run(CheckpointCoordinator.java:1783) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) 2021-08-21 17:47:30,478 [flink-akka.actor.default-dispatcher-34] INFO org.apache.flink.runtime.executiongraph.ExecutionGraph - Job session-aggregation (c513e9ebbea4ab72d80b1338896ca5c2) switched from state RUNNING to RESTARTING.
Configuration is: -yD "execution.checkpointing.timeout=10 min"\ -yD "restart-strategy=failure-rate"\ -yD "restart-strategy.failure-rate.max-failures-per-interval=70"\ -yD "restart-strategy.failure-rate.delay=1 min"\ -yD "restart-strategy.failure-rate.failure-rate-interval=60 min"\ Not sure this - https://issues.apache.org/jira/browse/FLINK-21215 is related - but it looks like it is solved. I know I can increase checkpoint timeout - but checkpoint size is relatively small and most of the time it takes several seconds to complete so 10 minutes should be more than enough. So the main question is why "Exceeded checkpoint tolerable failure threshold" triggered? Thanks!