[jira] [Commented] (FLINK-13497) Checkpoints can complete after CheckpointFailureManager fails job

vinoyang (JIRA) Tue, 30 Jul 2019 20:45:39 -0700


    [ 
https://issues.apache.org/jira/browse/FLINK-13497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16896752#comment-16896752
 ]


vinoyang commented on FLINK-13497:
----------------------------------

Currently, the {{CheckpointFailureManager}} choose a simple counting mechanism 
to fail the job, so there is the possibility like [~till.rohrmann] said. It is 
also another issue(FLINK-12514) to track a better counting mechanism.

The solution proposed by [~yunta] may fix this issue temporarily. But it may 
cause another risk. The {{stopCheckpointScheduler}} also called 
{{CheckpointFailureManager#handleCheckpointException}}. It will make counting 
more complex in the future.

Maybe we need to call a pure method just fails all pending checkpoints when 
failing the job?

> Checkpoints can complete after CheckpointFailureManager fails job
> -----------------------------------------------------------------
>
>                 Key: FLINK-13497
>                 URL: https://issues.apache.org/jira/browse/FLINK-13497
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / Checkpointing
>    Affects Versions: 1.9.0, 1.10.0
>            Reporter: Till Rohrmann
>            Priority: Critical
>             Fix For: 1.9.0
>
>
> I think that we introduced with FLINK-12364 an inconsistency wrt to job 
> termination a checkpointing. In FLINK-9900 it was discovered that checkpoints 
> can complete even after the {{CheckpointFailureManager}} decided to fail a 
> job. I think the expected behaviour should be that we fail all pending 
> checkpoints once the {{CheckpointFailureManager}} decides to fail the job.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

[jira] [Commented] (FLINK-13497) Checkpoints can complete after CheckpointFailureManager fails job

Reply via email to