Hi Dineth,

In the UI of flink there is pages for details for the checkpoints[1], could you 
have a look this UI
to see which part of checkpoint took long time~?

Best,
Yun



[1] 
https://nightlies.apache.org/flink/flink-docs-master/docs/ops/monitoring/checkpoint_monitoring/


------------------------------------------------------------------
From:Dineth Kariyawasam <din...@zilingo.com>
Send Time:2021 Nov. 23 (Tue.) 17:32
To:user <user@flink.apache.org>
Subject:Random checkpoint failures with timeouts

Checkpoint fails randomly with a timeout. Many times this happens when there 
are no other events coming into flink (at night). Most of our incoming data is 
during the daytime, and at night there are usually no events. Many of these 
failures have been at night. We had set a checkpoint timeout of 2 minutes 
initially. We increased it to 5 minutes, and the frequency of failures have 
reduced after this. However, checkpointing never takes more than 100 seconds 
when it succeeds. There was one occurrence of it taking 118 seconds about a 
month ago. When it fails, it fails after waiting for 5 minutes.

Exception log:
org.apache.flink.runtime.checkpoint.CheckpointCoordinator INFO 2021-10-22 
18:22:57 +0000 line:1867 "Checkpoint 34 of job ec563be081b87033f7e5f9a94c86fd78 
expired before completing."
org.apache.flink.runtime.checkpoint.CheckpointCoordinator INFO 2021-10-22 
18:22:57 +0000 line:710 "Triggering checkpoint 35 (type=CHECKPOINT) @ 
1634926977313 for job ec563be081b87033f7e5f9a94c86fd78."
org.apache.flink.runtime.jobmaster.JobMaster INFO 2021-10-22 18:22:57 +0000 
line:239 "Trying to recover from a global failure."

Flink version: 1.12.5
Setup: 1 Job manager and 1 task manager.
Checkpoint setup: RocksDB, once every 30 seconds, 2 minute timeout, 30 seconds 
between checkpoints

Reply via email to