Random checkpoint failures with timeouts

Dineth Kariyawasam Tue, 23 Nov 2021 01:25:24 -0800

Checkpoint fails randomly with a timeout. Many times this happens when
there are no other events coming into flink (at night). Most of our
incoming data is during the daytime, and at night there are usually no
events. Many of these failures have been at night. We had set a checkpoint
timeout of 2 minutes initially. We increased it to 5 minutes, and the
frequency of failures have reduced after this. However, checkpointing never
takes more than 100 seconds when it succeeds. There was one occurrence of
it taking 118 seconds about a month ago. When it fails, it fails after
waiting for 5 minutes.


Exception log:


*org.apache.flink.runtime.checkpoint.CheckpointCoordinator INFO 2021-10-22
18:22:57 +0000 line:1867 "Checkpoint 34 of job
ec563be081b87033f7e5f9a94c86fd78 expired before
completing."org.apache.flink.runtime.checkpoint.CheckpointCoordinator INFO
2021-10-22 18:22:57 +0000 line:710 "Triggering checkpoint 35
(type=CHECKPOINT) @ 1634926977313 for job
ec563be081b87033f7e5f9a94c86fd78."org.apache.flink.runtime.jobmaster.JobMaster
INFO 2021-10-22 18:22:57 +0000 line:239 "Trying to recover from a global
failure."*

Flink version: 1.12.5
Setup: 1 Job manager and 1 task manager.
Checkpoint setup: RocksDB, once every 30 seconds, 2 minute timeout, 30
seconds between checkpoints

Random checkpoint failures with timeouts

Reply via email to