renxiang zhou created FLINK-31249: ------------------------------------- Summary: Checkpoint Timer failed to process timeout events when it blocked at writing _metadata to DFS Key: FLINK-31249 URL: https://issues.apache.org/jira/browse/FLINK-31249 Project: Flink Issue Type: Improvement Components: Runtime / Checkpointing Affects Versions: 1.16.0, 1.11.6 Reporter: renxiang zhou Fix For: 1.18.0 Attachments: image-2023-02-28-11-25-03-637.png
The jobmanager-future thread may be blocked at writing metadata to DFS caused by a DFS failure, and the CheckpointCoordinator Lock is hold by this thread. When the next Checkpoint is triggered, the Checkpoint Timer thread waits for the lock to be released. If the previous checkpoint times out, the checkpoint timer does not execute the timeout event since it is blocked at waiting for the lock. As a result, the previous checkpoint cannot be cancelled. !image-2023-02-28-11-25-03-637.png|width=1144,height=248! -- This message was sent by Atlassian Jira (v8.20.10#820010)