Hi Yun, Sorry for the late reply. I finally found some time to investigate this problem further. I upgraded the job to 1.14.0, but it’s still the same.
I’ve checked the debug logs, and I found that Zookeeper notifies watched event of checkpoint id changes very late [1]. Each time a checkpoint finished, it would take minutes before the Zookeeper client notices the checkpoint ID is changed. I suspect the checkpoint coordinator is blocking on incrementing checkpoint ID on Zookeeper [2]. But with no luck, there’s no many relevant logs can help me prove that. What do you think of this? Thanks a lot! [1] https://gist.github.com/link3280/5072a054a43b40ba28891837a8fdf995 [2] https://github.com/apache/flink/blob/90e850301e672fc0da293abc55eb446f7ec68ffa/flink-runtime/src/main/java/org/apache/flink/runtime/checkpoint/CheckpointCoordinator.java#L743 Best, Paul Lam > 2021年11月23日 16:49,Paul Lam <paullin3...@gmail.com> 写道: > > Hi Yun, > > Thanks a lot for your pointers! I’ll try it out as you suggested and then get > back to you. > > Best, > Paul Lam > >> 2021年11月23日 16:32,Yun Tang <myas...@live.com <mailto:myas...@live.com>> 写道: >> >> Hi Paul, >> >> This is really weird, from what I know, flink-1.11.0 has a problem of >> handling min-pause time [1] and this should be resolved in flink-1.12.1. >> >> Could you open the debug log level for org.apache.flink.runtime.checkpoint >> and use jmap or byteman to get the field value of >> CheckpointCoordinator#lastCheckpointCompletionRelativeTime, >> CheckpointRequestDecider#minPauseBetweenCheckpoints and >> SystemClock#relativeTimeMillis in method >> CheckpointRequestDecider#nextTriggerDelayMillis [2] to see any unexpected >> behavior. >> >> >> [1] https://issues.apache.org/jira/browse/FLINK-18856 >> <https://issues.apache.org/jira/browse/FLINK-18856> >> [2] >> https://github.com/apache/flink/blob/90e850301e672fc0da293abc55eb446f7ec68ffa/flink-runtime/src/main/java/org/apache/flink/runtime/checkpoint/CheckpointRequestDecider.java#L182 >> >> <https://github.com/apache/flink/blob/90e850301e672fc0da293abc55eb446f7ec68ffa/flink-runtime/src/main/java/org/apache/flink/runtime/checkpoint/CheckpointRequestDecider.java#L182> >> >> >> Best >> Yun Tang >> >> From: Paul Lam <paullin3...@gmail.com <mailto:paullin3...@gmail.com>> >> Sent: Tuesday, November 23, 2021 14:35 >> To: user <user@flink.apache.org <mailto:user@flink.apache.org>> >> Subject: Inaccurate checkpoint trigger time >> >> Hi, >> >> Recently I’ve noticed a job has nondeterministic checkpoint trigger time. >> >> The jobs is using Flink 1.12.1 with FsStateBackend and is of 650 >> parallelism. It was configured to trigger checkpoint every 150 seconds with >> 0 pause time and no concurrent checkpoints. However there’re obvious errors >> in the checkpoint trigger times, as the actual interval may vary from 30 >> seconds to 6 minutes. >> >> The jobmanager logs are good, and no error logs is found. Some of the output >> are as follow: >> >> >> 2021-11-23 13:51:46,438 INFO >> org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Completed >> checkpoint 1446 for job f432b8d90859db54f7a79ff29a563ee4 (47142264825 bytes >> in 22166 ms). >> >> 2021-11-23 13:57:21,021 INFO >> org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Triggering >> checkpoint 1447 (type=CHECKPOINT) @ 1637647040653 for job >> f432b8d90859db54f7a79ff29a563ee4. >> >> 2021-11-23 13:57:43,761 INFO >> org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Completed >> checkpoint 1447 for job f432b8d90859db54f7a79ff29a563ee4 (46563195101 bytes >> in 21813 ms). >> >> 2021-11-23 13:59:09,387 INFO >> org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Triggering >> checkpoint 1448 (type=CHECKPOINT) @ 1637647149157 for job >> f432b8d90859db54f7a79ff29a563ee4. >> >> 2021-11-23 13:59:31,370 INFO >> org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Completed >> checkpoint 1448 for job f432b8d90859db54f7a79ff29a563ee4 (45543757702 bytes >> in 20354 ms). >> >> 2021-11-23 14:06:37,916 INFO >> org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Triggering >> checkpoint 1449 (type=CHECKPOINT) @ 1637647597704 for job >> f432b8d90859db54f7a79ff29a563ee4. >> >> 2021-11-23 14:07:03,157 INFO >> org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Completed >> checkpoint 1449 for job f432b8d90859db54f7a79ff29a563ee4 (45662471025 bytes >> in 23779 ms). >> >> 2021-11-23 14:07:05,838 INFO >> org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Triggering >> checkpoint 1450 (type=CHECKPOINT) @ 1637647625640 for job >> f432b8d90859db54f7a79ff29a563ee4. >> >> 2021-11-23 14:07:30,748 INFO >> org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Completed >> checkpoint 1450 for job f432b8d90859db54f7a79ff29a563ee4 (46916136024 bytes >> in 22998 ms). >> >> 2021-11-23 14:13:09,089 INFO >> org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Triggering >> checkpoint 1451 (type=CHECKPOINT) @ 1637647988831 for job >> f432b8d90859db54f7a79ff29a563ee4. >> >> 2021-11-23 14:13:38,411 INFO >> org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Completed >> checkpoint 1451 for job f432b8d90859db54f7a79ff29a563ee4 (47439074367 bytes >> in 27616 ms). >> >> 2021-11-23 14:13:38,676 INFO >> org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Triggering >> checkpoint 1452 (type=CHECKPOINT) @ 1637648018481 for job >> f432b8d90859db54f7a79ff29a563ee4. >> >> 2021-11-23 14:14:01,937 INFO >> org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Completed >> checkpoint 1452 for job f432b8d90859db54f7a79ff29a563ee4 (47046200711 bytes >> in 21869 ms). >> >> 2021-11-23 14:20:04,923 INFO >> org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Triggering >> checkpoint 1453 (type=CHECKPOINT) @ 1637648404722 for job >> f432b8d90859db54f7a79ff29a563ee4. >> >> 2021-11-23 14:20:26,592 INFO >> org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Completed >> checkpoint 1453 for job f432b8d90859db54f7a79ff29a563ee4 (47481503566 bytes >> in 20172 ms). >> >> 2021-11-23 14:21:54,879 INFO >> org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Triggering >> checkpoint 1454 (type=CHECKPOINT) @ 1637648514668 for job >> f432b8d90859db54f7a79ff29a563ee4. >> >> 2021-11-23 14:22:19,392 INFO >> org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Completed >> checkpoint 1454 for job f432b8d90859db54f7a79ff29a563ee4 (47106414948 bytes >> in 22930 ms). >> >> >> It looks pretty weird to me. Please help me narrow down the problem if you >> have any idea. >> >> Best, >> Paul Lam >