[ https://issues.apache.org/jira/browse/FLINK-13497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16898775#comment-16898775 ]
Biao Liu commented on FLINK-13497: ---------------------------------- The thread model of {{CheckpointCoordinator}} seems to be a bit messy. Here we missed two necessary synchronizations. # Synchronization between different checkpoints. That's the reason of why {{CheckpointFailureManager}} has already decided to {{failGlobal}} but other checkpoints could succeed at the same time. We might need to re-think the thread model here. [~yunta] gave a work-around way. # Synchronization between {{CheckpointCoordinator}} and {{ExecutionGraph}}. That's caused by asynchronous {{failGlobal}}. So I suggest using a work-around way canceling task with {{ExecutionAttemptID}} instead. That's a kind of weak synchronization. > Checkpoints can complete after CheckpointFailureManager fails job > ----------------------------------------------------------------- > > Key: FLINK-13497 > URL: https://issues.apache.org/jira/browse/FLINK-13497 > Project: Flink > Issue Type: Bug > Components: Runtime / Checkpointing > Affects Versions: 1.9.0, 1.10.0 > Reporter: Till Rohrmann > Priority: Critical > Fix For: 1.9.0 > > > I think that we introduced with FLINK-12364 an inconsistency wrt to job > termination a checkpointing. In FLINK-9900 it was discovered that checkpoints > can complete even after the {{CheckpointFailureManager}} decided to fail a > job. I think the expected behaviour should be that we fail all pending > checkpoints once the {{CheckpointFailureManager}} decides to fail the job. -- This message was sent by Atlassian JIRA (v7.6.14#76016)