Hi team! I came across strange behavior in Flink 1.17.1. If during the build of a checkpoint the s3 storage becomes unavailable, then the current checkpoint expired by timeout and new ones are not triggered. The triggering for new checkpoints is resumed only after s3 is restored and this can be after a long time.
I can reproduce it, wait checkpoint and after start disconnect s3 storage 2023-10-27 09:48:11,866 INFO org.apache.flink.runtime.checkpoint.CheckpointCoordinator [] - Triggering checkpoint 2504 (type=CheckpointType{name='Checkpoint', sharingFilesStrategy=FORWARD_BACKWARD}) @ 1698400091851 for job 00000000000000000000000000000000. 2023-10-27 09:58:12,873 INFO org.apache.flink.runtime.checkpoint.CheckpointCoordinator [] - Checkpoint 2504 of job 00000000000000000000000000000000 expired before completing. 2023-10-27 09:58:12,874 WARN org.apache.flink.runtime.checkpoint.CheckpointFailureManager [] - Failed to trigger or complete checkpoint 2504 for job 00000000000000000000000000000000. (0 consecutive failed attempts so far) after current checkpoint is expired (our timeout 10 min) no new triggering attempt in logs until restore s3 storage 2023-10-27 10:42:09,530 WARN org.apache.flink.runtime.state.IncrementalRemoteKeyedStateHandle [] - Could not properly discard misc file states. com.amazonaws.SdkClientException: Unable to execute HTTP request: Read timed out 2023-10-27 10:42:13,305 INFO org.apache.flink.runtime.checkpoint.CheckpointCoordinator [] - Triggering checkpoint 2505 (type=CheckpointType{name='Checkpoint', sharingFilesStrategy=FORWARD_BACKWARD}) @ 1698400691875 for job 00000000000000000000000000000000. 2023-10-27 10:42:39,287 INFO org.apache.flink.runtime.checkpoint.CheckpointCoordinator [] - Completed checkpoint 2505 for job 00000000000000000000000000000000 (10023840497 bytes, checkpointDuration=2666106 ms, finalizationTime=1306 ms). 2023-10-27 10:44:39,288 INFO org.apache.flink.runtime.checkpoint.CheckpointRequestDecider [] - checkpoint request time in queue: 1887436 2023-10-27 10:44:39,300 INFO org.apache.flink.runtime.checkpoint.CheckpointCoordinator [] - Triggering checkpoint 2506 (type=CheckpointType{name='Checkpoint', sharingFilesStrategy=FORWARD_BACKWARD}) @ 1698403479288 for job 00000000000000000000000000000000. 2023-10-27 10:44:50,924 INFO org.apache.flink.runtime.checkpoint.CheckpointCoordinator [] - Completed checkpoint 2506 for job 00000000000000000000000000000000 (10085877149 bytes, checkpointDuration=11011 ms, finalizationTime=625 ms). 2023-10-27 10:46:50,924 INFO org.apache.flink.runtime.checkpoint.CheckpointRequestDecider [] - checkpoint request time in queue: 1119073 taskmanager logs on restore s3 storage 2023-10-27 10:42:13,302 DEBUG org.apache.flink.streaming.runtime.tasks.AsyncCheckpointRunnable [] - Cleanup AsyncCheckpointRunnable for checkpoint 2504 of Process ... 2023-10-27 10:42:13,302 DEBUG org.apache.flink.streaming.runtime.tasks.StreamTask [] - Notify checkpoint 2503 complete on task ... 2023-10-27 10:42:13,302 DEBUG org.apache.flink.streaming.runtime.tasks.SubtaskCheckpointCoordinatorImpl [] - Notification of checkpoint ABORT 2504 for task ... It looks like everything hangs on requests for the state of objects in s3 storage (repeated HEAD requests with full object path in s3 storage). Sometimes it was observed that job completely stops working (no consuming and producing) until the s3 storage is restored Is this expected behavior? P.S. If a storage failure occurs before the start of checkpoint assembly, then everything works as expected, new checkpoints are triggered every confugured interval and expire after 10 min. [cid:01917319-9655-4c20-9ceb-fec81b4638e3] ________________________________ "This message contains confidential information/commercial secret. If you are not the intended addressee of this message you may not copy, save, print or forward it to any third party and you are kindly requested to destroy this message and notify the sender thereof by email. Данное сообщение содержит конфиденциальную информацию/информацию, являющуюся коммерческой тайной. Если Вы не являетесь надлежащим адресатом данного сообщения, Вы не вправе копировать, сохранять, печатать или пересылать его каким либо иным лицам. Просьба уничтожить данное сообщение и уведомить об этом отправителя электронным письмом."