Hi Paul, where are you storing your checkpoints, and what's their size?
IIRC, Flink won't trigger a new checkpoint before the old ones haven't been cleaned up, and if your checkpoints are large and stored on S3, it can take a while to clean them up (especially with the Hadoop S3 plugin, using presto s3 is faster). On Thu, Jan 27, 2022 at 10:56 AM Paul Lam <paullin3...@gmail.com> wrote: > Hi Yun, > > Sorry for the late reply. I finally found some time to investigate this > problem further. I upgraded the job to 1.14.0, but it’s still the same. > > I’ve checked the debug logs, and I found that Zookeeper notifies watched > event of checkpoint id changes very late [1]. Each time a checkpoint > finished, it would take minutes before the Zookeeper client notices the > checkpoint ID is changed. > > I suspect the checkpoint coordinator is blocking on incrementing > checkpoint ID on Zookeeper [2]. But with no luck, there’s no many relevant > logs can help me prove that. > > What do you think of this? Thanks a lot! > > [1] https://gist.github.com/link3280/5072a054a43b40ba28891837a8fdf995 > [2] > https://github.com/apache/flink/blob/90e850301e672fc0da293abc55eb446f7ec68ffa/flink-runtime/src/main/java/org/apache/flink/runtime/checkpoint/CheckpointCoordinator.java#L743 > > Best, > Paul Lam > > 2021年11月23日 16:49,Paul Lam <paullin3...@gmail.com> 写道: > > Hi Yun, > > Thanks a lot for your pointers! I’ll try it out as you suggested and then > get back to you. > > Best, > Paul Lam > > 2021年11月23日 16:32,Yun Tang <myas...@live.com> 写道: > > Hi Paul, > > This is really weird, from what I know, flink-1.11.0 has a problem of > handling min-pause time [1] and this should be resolved in flink-1.12.1. > > Could you open the debug log level for org.apache.flink.runtime.checkpoint > and use jmap or byteman to get the field value > of CheckpointCoordinator#lastCheckpointCompletionRelativeTime, > CheckpointRequestDecider#minPauseBetweenCheckpoints > and SystemClock#relativeTimeMillis in method > CheckpointRequestDecider#nextTriggerDelayMillis > [2] to see any unexpected behavior. > > > [1] https://issues.apache.org/jira/browse/FLINK-18856 > [2] > https://github.com/apache/flink/blob/90e850301e672fc0da293abc55eb446f7ec68ffa/flink-runtime/src/main/java/org/apache/flink/runtime/checkpoint/CheckpointRequestDecider.java#L182 > > > Best > Yun Tang > > ------------------------------ > *From:* Paul Lam <paullin3...@gmail.com> > *Sent:* Tuesday, November 23, 2021 14:35 > *To:* user <user@flink.apache.org> > *Subject:* Inaccurate checkpoint trigger time > > Hi, > > Recently I’ve noticed a job has nondeterministic checkpoint trigger time. > > The jobs is using Flink 1.12.1 with FsStateBackend and is of 650 > parallelism. It was configured to trigger checkpoint every 150 seconds with > 0 pause time and no concurrent checkpoints. However there’re obvious errors > in the checkpoint trigger times, as the actual interval may vary from 30 > seconds to 6 minutes. > > The jobmanager logs are good, and no error logs is found. Some of the > output are as follow: > > 2021-11-23 13:51:46,438 INFO > org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Completed > checkpoint 1446 for job f432b8d90859db54f7a79ff29a563ee4 (47142264825 bytes > in 22166 ms). > 2021-11-23 13:57:21,021 INFO > org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Triggering > checkpoint 1447 (type=CHECKPOINT) @ 1637647040653 for job > f432b8d90859db54f7a79ff29a563ee4. > 2021-11-23 13:57:43,761 INFO > org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Completed > checkpoint 1447 for job f432b8d90859db54f7a79ff29a563ee4 (46563195101 bytes > in 21813 ms). > 2021-11-23 13:59:09,387 INFO > org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Triggering > checkpoint 1448 (type=CHECKPOINT) @ 1637647149157 for job > f432b8d90859db54f7a79ff29a563ee4. > 2021-11-23 13:59:31,370 INFO > org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Completed > checkpoint 1448 for job f432b8d90859db54f7a79ff29a563ee4 (45543757702 bytes > in 20354 ms). > 2021-11-23 14:06:37,916 INFO > org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Triggering > checkpoint 1449 (type=CHECKPOINT) @ 1637647597704 for job > f432b8d90859db54f7a79ff29a563ee4. > 2021-11-23 14:07:03,157 INFO > org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Completed > checkpoint 1449 for job f432b8d90859db54f7a79ff29a563ee4 (45662471025 bytes > in 23779 ms). > 2021-11-23 14:07:05,838 INFO > org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Triggering > checkpoint 1450 (type=CHECKPOINT) @ 1637647625640 for job > f432b8d90859db54f7a79ff29a563ee4. > 2021-11-23 14:07:30,748 INFO > org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Completed > checkpoint 1450 for job f432b8d90859db54f7a79ff29a563ee4 (46916136024 bytes > in 22998 ms). > 2021-11-23 14:13:09,089 INFO > org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Triggering > checkpoint 1451 (type=CHECKPOINT) @ 1637647988831 for job > f432b8d90859db54f7a79ff29a563ee4. > 2021-11-23 14:13:38,411 INFO > org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Completed > checkpoint 1451 for job f432b8d90859db54f7a79ff29a563ee4 (47439074367 bytes > in 27616 ms). > 2021-11-23 14:13:38,676 INFO > org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Triggering > checkpoint 1452 (type=CHECKPOINT) @ 1637648018481 for job > f432b8d90859db54f7a79ff29a563ee4. > 2021-11-23 14:14:01,937 INFO > org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Completed > checkpoint 1452 for job f432b8d90859db54f7a79ff29a563ee4 (47046200711 bytes > in 21869 ms). > 2021-11-23 14:20:04,923 INFO > org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Triggering > checkpoint 1453 (type=CHECKPOINT) @ 1637648404722 for job > f432b8d90859db54f7a79ff29a563ee4. > 2021-11-23 14:20:26,592 INFO > org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Completed > checkpoint 1453 for job f432b8d90859db54f7a79ff29a563ee4 (47481503566 bytes > in 20172 ms). > 2021-11-23 14:21:54,879 INFO > org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Triggering > checkpoint 1454 (type=CHECKPOINT) @ 1637648514668 for job > f432b8d90859db54f7a79ff29a563ee4. > 2021-11-23 14:22:19,392 INFO > org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Completed > checkpoint 1454 for job f432b8d90859db54f7a79ff29a563ee4 (47106414948 bytes > in 22930 ms). > > It looks pretty weird to me. Please help me narrow down the problem if you > have any idea. > > Best, > Paul Lam > > > >