Checkpoint fails randomly with a timeout. Many times this happens when there are no other events coming into flink (at night). Most of our incoming data is during the daytime, and at night there are usually no events. Many of these failures have been at night. We had set a checkpoint timeout of 2 minutes initially. We increased it to 5 minutes, and the frequency of failures have reduced after this. However, checkpointing never takes more than 100 seconds when it succeeds. There was one occurrence of it taking 118 seconds about a month ago. When it fails, it fails after waiting for 5 minutes.
Exception log: *org.apache.flink.runtime.checkpoint.CheckpointCoordinator INFO 2021-10-22 18:22:57 +0000 line:1867 "Checkpoint 34 of job ec563be081b87033f7e5f9a94c86fd78 expired before completing."org.apache.flink.runtime.checkpoint.CheckpointCoordinator INFO 2021-10-22 18:22:57 +0000 line:710 "Triggering checkpoint 35 (type=CHECKPOINT) @ 1634926977313 for job ec563be081b87033f7e5f9a94c86fd78."org.apache.flink.runtime.jobmaster.JobMaster INFO 2021-10-22 18:22:57 +0000 line:239 "Trying to recover from a global failure."* Flink version: 1.12.5 Setup: 1 Job manager and 1 task manager. Checkpoint setup: RocksDB, once every 30 seconds, 2 minute timeout, 30 seconds between checkpoints