Hi Yun,

Sorry for the late reply. I finally found some time to investigate this problem 
further. I upgraded the job to 1.14.0, but it’s still the same.

I’ve checked the debug logs, and I found that Zookeeper notifies watched event 
of checkpoint id changes very late [1]. Each time a checkpoint finished, it 
would take minutes before the Zookeeper client notices the checkpoint ID is 
changed.

I suspect the checkpoint coordinator is blocking on incrementing checkpoint ID 
on Zookeeper [2]. But with no luck, there’s no many relevant logs can help me 
prove that. 

What do you think of this? Thanks a lot!

[1] https://gist.github.com/link3280/5072a054a43b40ba28891837a8fdf995
[2] 
https://github.com/apache/flink/blob/90e850301e672fc0da293abc55eb446f7ec68ffa/flink-runtime/src/main/java/org/apache/flink/runtime/checkpoint/CheckpointCoordinator.java#L743

Best,
Paul Lam

> 2021年11月23日 16:49,Paul Lam <paullin3...@gmail.com> 写道:
> 
> Hi Yun,
> 
> Thanks a lot for your pointers! I’ll try it out as you suggested and then get 
> back to you.
> 
> Best,
> Paul Lam
> 
>> 2021年11月23日 16:32,Yun Tang <myas...@live.com <mailto:myas...@live.com>> 写道:
>> 
>> Hi Paul,
>> 
>> This is really weird, from what I know, flink-1.11.0 has a problem of 
>> handling min-pause time [1] and this should be resolved in flink-1.12.1.
>> 
>> Could you open the debug log level for org.apache.flink.runtime.checkpoint 
>> and use jmap or byteman to get the field value of 
>> CheckpointCoordinator#lastCheckpointCompletionRelativeTime, 
>> CheckpointRequestDecider#minPauseBetweenCheckpoints and 
>> SystemClock#relativeTimeMillis in method 
>> CheckpointRequestDecider#nextTriggerDelayMillis [2] to see any unexpected 
>> behavior.
>> 
>> 
>> [1] https://issues.apache.org/jira/browse/FLINK-18856 
>> <https://issues.apache.org/jira/browse/FLINK-18856>
>> [2] 
>> https://github.com/apache/flink/blob/90e850301e672fc0da293abc55eb446f7ec68ffa/flink-runtime/src/main/java/org/apache/flink/runtime/checkpoint/CheckpointRequestDecider.java#L182
>>  
>> <https://github.com/apache/flink/blob/90e850301e672fc0da293abc55eb446f7ec68ffa/flink-runtime/src/main/java/org/apache/flink/runtime/checkpoint/CheckpointRequestDecider.java#L182>
>> 
>> 
>> Best
>> Yun Tang
>> 
>> From: Paul Lam <paullin3...@gmail.com <mailto:paullin3...@gmail.com>>
>> Sent: Tuesday, November 23, 2021 14:35
>> To: user <user@flink.apache.org <mailto:user@flink.apache.org>>
>> Subject: Inaccurate checkpoint trigger time
>>  
>> Hi,
>> 
>> Recently I’ve noticed a job has nondeterministic checkpoint trigger time. 
>> 
>> The jobs is using Flink 1.12.1 with FsStateBackend and is of 650 
>> parallelism. It was configured to trigger checkpoint every 150 seconds with 
>> 0 pause time and no concurrent checkpoints. However there’re obvious errors 
>> in the checkpoint trigger times, as the actual interval may vary from 30 
>> seconds to 6 minutes.
>> 
>> The jobmanager logs are good, and no error logs is found. Some of the output 
>> are as follow: 
>> 
>> 
>> 2021-11-23 13:51:46,438 INFO 
>> org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Completed 
>> checkpoint 1446 for job f432b8d90859db54f7a79ff29a563ee4 (47142264825 bytes 
>> in 22166 ms).
>> 
>> 2021-11-23 13:57:21,021 INFO 
>> org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Triggering 
>> checkpoint 1447 (type=CHECKPOINT) @ 1637647040653 for job 
>> f432b8d90859db54f7a79ff29a563ee4.
>> 
>> 2021-11-23 13:57:43,761 INFO 
>> org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Completed 
>> checkpoint 1447 for job f432b8d90859db54f7a79ff29a563ee4 (46563195101 bytes 
>> in 21813 ms).
>> 
>> 2021-11-23 13:59:09,387 INFO 
>> org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Triggering 
>> checkpoint 1448 (type=CHECKPOINT) @ 1637647149157 for job 
>> f432b8d90859db54f7a79ff29a563ee4.
>> 
>> 2021-11-23 13:59:31,370 INFO 
>> org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Completed 
>> checkpoint 1448 for job f432b8d90859db54f7a79ff29a563ee4 (45543757702 bytes 
>> in 20354 ms).
>> 
>> 2021-11-23 14:06:37,916 INFO 
>> org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Triggering 
>> checkpoint 1449 (type=CHECKPOINT) @ 1637647597704 for job 
>> f432b8d90859db54f7a79ff29a563ee4.
>> 
>> 2021-11-23 14:07:03,157 INFO 
>> org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Completed 
>> checkpoint 1449 for job f432b8d90859db54f7a79ff29a563ee4 (45662471025 bytes 
>> in 23779 ms).
>> 
>> 2021-11-23 14:07:05,838 INFO 
>> org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Triggering 
>> checkpoint 1450 (type=CHECKPOINT) @ 1637647625640 for job 
>> f432b8d90859db54f7a79ff29a563ee4.
>> 
>> 2021-11-23 14:07:30,748 INFO 
>> org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Completed 
>> checkpoint 1450 for job f432b8d90859db54f7a79ff29a563ee4 (46916136024 bytes 
>> in 22998 ms).
>> 
>> 2021-11-23 14:13:09,089 INFO 
>> org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Triggering 
>> checkpoint 1451 (type=CHECKPOINT) @ 1637647988831 for job 
>> f432b8d90859db54f7a79ff29a563ee4.
>> 
>> 2021-11-23 14:13:38,411 INFO 
>> org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Completed 
>> checkpoint 1451 for job f432b8d90859db54f7a79ff29a563ee4 (47439074367 bytes 
>> in 27616 ms).
>> 
>> 2021-11-23 14:13:38,676 INFO 
>> org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Triggering 
>> checkpoint 1452 (type=CHECKPOINT) @ 1637648018481 for job 
>> f432b8d90859db54f7a79ff29a563ee4.
>> 
>> 2021-11-23 14:14:01,937 INFO 
>> org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Completed 
>> checkpoint 1452 for job f432b8d90859db54f7a79ff29a563ee4 (47046200711 bytes 
>> in 21869 ms).
>> 
>> 2021-11-23 14:20:04,923 INFO 
>> org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Triggering 
>> checkpoint 1453 (type=CHECKPOINT) @ 1637648404722 for job 
>> f432b8d90859db54f7a79ff29a563ee4.
>> 
>> 2021-11-23 14:20:26,592 INFO 
>> org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Completed 
>> checkpoint 1453 for job f432b8d90859db54f7a79ff29a563ee4 (47481503566 bytes 
>> in 20172 ms).
>> 
>> 2021-11-23 14:21:54,879 INFO 
>> org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Triggering 
>> checkpoint 1454 (type=CHECKPOINT) @ 1637648514668 for job 
>> f432b8d90859db54f7a79ff29a563ee4.
>> 
>> 2021-11-23 14:22:19,392 INFO 
>> org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Completed 
>> checkpoint 1454 for job f432b8d90859db54f7a79ff29a563ee4 (47106414948 bytes 
>> in 22930 ms).
>> 
>> 
>> It looks pretty weird to me. Please help me narrow down the problem if you 
>> have any idea.
>> 
>> Best,
>> Paul Lam
> 

Reply via email to