renxiang zhou created FLINK-31249:
-------------------------------------

             Summary: Checkpoint Timer failed to process timeout events when it 
blocked at writing _metadata to DFS
                 Key: FLINK-31249
                 URL: https://issues.apache.org/jira/browse/FLINK-31249
             Project: Flink
          Issue Type: Improvement
          Components: Runtime / Checkpointing
    Affects Versions: 1.16.0, 1.11.6
            Reporter: renxiang zhou
             Fix For: 1.18.0
         Attachments: image-2023-02-28-11-25-03-637.png

The jobmanager-future thread may be blocked at writing metadata to DFS caused 
by a DFS failure, and the CheckpointCoordinator Lock is hold by this thread. 

When the next Checkpoint is triggered, the Checkpoint Timer thread waits for 
the lock to be released.  If the previous checkpoint times out, the checkpoint 
timer does not execute the timeout event since it is blocked at waiting for the 
lock. As a result, the previous checkpoint cannot be cancelled.

!image-2023-02-28-11-25-03-637.png|width=1144,height=248!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to