Hi Yun, Thanks a lot for your pointers! I’ll try it out as you suggested and then get back to you.
Best, Paul Lam > 2021年11月23日 16:32,Yun Tang <myas...@live.com> 写道: > > Hi Paul, > > This is really weird, from what I know, flink-1.11.0 has a problem of > handling min-pause time [1] and this should be resolved in flink-1.12.1. > > Could you open the debug log level for org.apache.flink.runtime.checkpoint > and use jmap or byteman to get the field value of > CheckpointCoordinator#lastCheckpointCompletionRelativeTime, > CheckpointRequestDecider#minPauseBetweenCheckpoints and > SystemClock#relativeTimeMillis in method > CheckpointRequestDecider#nextTriggerDelayMillis [2] to see any unexpected > behavior. > > > [1] https://issues.apache.org/jira/browse/FLINK-18856 > <https://issues.apache.org/jira/browse/FLINK-18856> > [2] > https://github.com/apache/flink/blob/90e850301e672fc0da293abc55eb446f7ec68ffa/flink-runtime/src/main/java/org/apache/flink/runtime/checkpoint/CheckpointRequestDecider.java#L182 > > <https://github.com/apache/flink/blob/90e850301e672fc0da293abc55eb446f7ec68ffa/flink-runtime/src/main/java/org/apache/flink/runtime/checkpoint/CheckpointRequestDecider.java#L182> > > > Best > Yun Tang > > From: Paul Lam <paullin3...@gmail.com> > Sent: Tuesday, November 23, 2021 14:35 > To: user <user@flink.apache.org> > Subject: Inaccurate checkpoint trigger time > > Hi, > > Recently I’ve noticed a job has nondeterministic checkpoint trigger time. > > The jobs is using Flink 1.12.1 with FsStateBackend and is of 650 parallelism. > It was configured to trigger checkpoint every 150 seconds with 0 pause time > and no concurrent checkpoints. However there’re obvious errors in the > checkpoint trigger times, as the actual interval may vary from 30 seconds to > 6 minutes. > > The jobmanager logs are good, and no error logs is found. Some of the output > are as follow: > > > 2021-11-23 13:51:46,438 INFO > org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Completed > checkpoint 1446 for job f432b8d90859db54f7a79ff29a563ee4 (47142264825 bytes > in 22166 ms). > > 2021-11-23 13:57:21,021 INFO > org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Triggering > checkpoint 1447 (type=CHECKPOINT) @ 1637647040653 for job > f432b8d90859db54f7a79ff29a563ee4. > > 2021-11-23 13:57:43,761 INFO > org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Completed > checkpoint 1447 for job f432b8d90859db54f7a79ff29a563ee4 (46563195101 bytes > in 21813 ms). > > 2021-11-23 13:59:09,387 INFO > org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Triggering > checkpoint 1448 (type=CHECKPOINT) @ 1637647149157 for job > f432b8d90859db54f7a79ff29a563ee4. > > 2021-11-23 13:59:31,370 INFO > org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Completed > checkpoint 1448 for job f432b8d90859db54f7a79ff29a563ee4 (45543757702 bytes > in 20354 ms). > > 2021-11-23 14:06:37,916 INFO > org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Triggering > checkpoint 1449 (type=CHECKPOINT) @ 1637647597704 for job > f432b8d90859db54f7a79ff29a563ee4. > > 2021-11-23 14:07:03,157 INFO > org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Completed > checkpoint 1449 for job f432b8d90859db54f7a79ff29a563ee4 (45662471025 bytes > in 23779 ms). > > 2021-11-23 14:07:05,838 INFO > org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Triggering > checkpoint 1450 (type=CHECKPOINT) @ 1637647625640 for job > f432b8d90859db54f7a79ff29a563ee4. > > 2021-11-23 14:07:30,748 INFO > org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Completed > checkpoint 1450 for job f432b8d90859db54f7a79ff29a563ee4 (46916136024 bytes > in 22998 ms). > > 2021-11-23 14:13:09,089 INFO > org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Triggering > checkpoint 1451 (type=CHECKPOINT) @ 1637647988831 for job > f432b8d90859db54f7a79ff29a563ee4. > > 2021-11-23 14:13:38,411 INFO > org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Completed > checkpoint 1451 for job f432b8d90859db54f7a79ff29a563ee4 (47439074367 bytes > in 27616 ms). > > 2021-11-23 14:13:38,676 INFO > org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Triggering > checkpoint 1452 (type=CHECKPOINT) @ 1637648018481 for job > f432b8d90859db54f7a79ff29a563ee4. > > 2021-11-23 14:14:01,937 INFO > org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Completed > checkpoint 1452 for job f432b8d90859db54f7a79ff29a563ee4 (47046200711 bytes > in 21869 ms). > > 2021-11-23 14:20:04,923 INFO > org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Triggering > checkpoint 1453 (type=CHECKPOINT) @ 1637648404722 for job > f432b8d90859db54f7a79ff29a563ee4. > > 2021-11-23 14:20:26,592 INFO > org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Completed > checkpoint 1453 for job f432b8d90859db54f7a79ff29a563ee4 (47481503566 bytes > in 20172 ms). > > 2021-11-23 14:21:54,879 INFO > org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Triggering > checkpoint 1454 (type=CHECKPOINT) @ 1637648514668 for job > f432b8d90859db54f7a79ff29a563ee4. > > 2021-11-23 14:22:19,392 INFO > org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Completed > checkpoint 1454 for job f432b8d90859db54f7a79ff29a563ee4 (47106414948 bytes > in 22930 ms). > > > It looks pretty weird to me. Please help me narrow down the problem if you > have any idea. > > Best, > Paul Lam