[ 
https://issues.apache.org/jira/browse/FLINK-13497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16896988#comment-16896988
 ] 

vinoyang commented on FLINK-13497:
----------------------------------

When I thought deeply, even if we did not introduce FLINK-12364, the issue 
described by [~till.rohrmann] also exists. When users call 
{{setFailOnCheckpointingErrors(true)}} and when a checkpoint failed on TM, the 
decline message would be sent to JM and trigger the failure of the job. The 
pending checkpoints and checkpoint coordinator also work normally. What's more, 
some failed instances in FLINK-9900 happened before merging FLINK-12364.

But there is a fact that after FLINK-12364 was merged, the number of failures 
increased significantly.

> Checkpoints can complete after CheckpointFailureManager fails job
> -----------------------------------------------------------------
>
>                 Key: FLINK-13497
>                 URL: https://issues.apache.org/jira/browse/FLINK-13497
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / Checkpointing
>    Affects Versions: 1.9.0, 1.10.0
>            Reporter: Till Rohrmann
>            Priority: Critical
>             Fix For: 1.9.0
>
>
> I think that we introduced with FLINK-12364 an inconsistency wrt to job 
> termination a checkpointing. In FLINK-9900 it was discovered that checkpoints 
> can complete even after the {{CheckpointFailureManager}} decided to fail a 
> job. I think the expected behaviour should be that we fail all pending 
> checkpoints once the {{CheckpointFailureManager}} decides to fail the job.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

Reply via email to