Hello,

I see the following error in my jobmanager log (Flink on EMR):
Checking cluster logs I see :
2021-08-21 17:17:30,489 [Checkpoint Timer] INFO
org.apache.flink.runtime.checkpoint.CheckpointCoordinator  - Triggering
checkpoint 1 (type=CHECKPOINT) @ 1629566250303 for job
c513e9ebbea4ab72d80b1338896ca5c2.
2021-08-21 17:17:33,572 [jobmanager-future-thread-5] INFO
com.amazon.ws.emr.hadoop.fs.s3n.MultipartUploadOutputStream
- close closed:false s3://***/_metadata
2021-08-21 17:17:33,800 [jobmanager-future-thread-5] INFO
org.apache.flink.runtime.checkpoint.CheckpointCoordinator  - Completed
checkpoint 1 for job c513e9ebbea4ab72d80b1338896ca5c2 (737859873 bytes in
3496 ms).
2021-08-21 17:27:30,474 [Checkpoint Timer] INFO
org.apache.flink.runtime.checkpoint.CheckpointCoordinator  - Triggering
checkpoint 2 (type=CHECKPOINT) @ 1629566850302 for job
c513e9ebbea4ab72d80b1338896ca5c2.
2021-08-21 17:27:46,012 [jobmanager-future-thread-3] INFO
com.amazon.ws.emr.hadoop.fs.s3n.MultipartUploadOutputStream
- close closed:false s3://***/_metadata
2021-08-21 17:27:46,158 [jobmanager-future-thread-3] INFO
org.apache.flink.runtime.checkpoint.CheckpointCoordinator  - Completed
checkpoint 2 for job c513e9ebbea4ab72d80b1338896ca5c2 (1210889410 bytes in
15856 ms).
2021-08-21 17:37:30,468 [Checkpoint Timer] INFO
org.apache.flink.runtime.checkpoint.CheckpointCoordinator  - Triggering
checkpoint 3 (type=CHECKPOINT) @ 1629567450302 for job
c513e9ebbea4ab72d80b1338896ca5c2.
2021-08-21 17:47:30,469 [Checkpoint Timer] INFO
org.apache.flink.runtime.checkpoint.CheckpointCoordinator  - Checkpoint 3
of job c513e9ebbea4ab72d80b1338896ca5c2 expired before completing.
2021-08-21 17:47:30,476 [flink-akka.actor.default-dispatcher-34]
INFO org.apache.flink.runtime.jobmaster.JobMaster - Trying to recover from
a global failure.
org.apache.flink.util.FlinkRuntimeException: Exceeded checkpoint tolerable
failure threshold.
at
org.apache.flink.runtime.checkpoint.CheckpointFailureManager.handleJobLevelCheckpointException(CheckpointFailureManager.java:66)
at
org.apache.flink.runtime.checkpoint.CheckpointCoordinator.abortPendingCheckpoint(CheckpointCoordinator.java:1673)
at
org.apache.flink.runtime.checkpoint.CheckpointCoordinator.abortPendingCheckpoint(CheckpointCoordinator.java:1650)
at
org.apache.flink.runtime.checkpoint.CheckpointCoordinator.access$600(CheckpointCoordinator.java:91)
at
org.apache.flink.runtime.checkpoint.CheckpointCoordinator$CheckpointCanceller.run(CheckpointCoordinator.java:1783)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
at
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
2021-08-21 17:47:30,478 [flink-akka.actor.default-dispatcher-34]
INFO org.apache.flink.runtime.executiongraph.ExecutionGraph - Job
session-aggregation (c513e9ebbea4ab72d80b1338896ca5c2) switched from state
RUNNING to RESTARTING.

Configuration is:

-yD "execution.checkpointing.timeout=10 min"\
-yD "restart-strategy=failure-rate"\
-yD "restart-strategy.failure-rate.max-failures-per-interval=70"\
-yD "restart-strategy.failure-rate.delay=1 min"\
-yD "restart-strategy.failure-rate.failure-rate-interval=60 min"\

Not sure this - https://issues.apache.org/jira/browse/FLINK-21215 is
related - but it looks like it is solved.

I know I can increase checkpoint timeout - but checkpoint size is
relatively small and most of the time it takes several seconds to
complete so 10 minutes should be more than enough. So the main
question is why "Exceeded checkpoint tolerable failure threshold"
triggered?

Thanks!

Reply via email to