[jira] [Commented] (FLINK-6625) Flink removes HA job data when reaching JobStatus.FAILED

Robert Metzger (JIRA) Thu, 11 Oct 2018 04:55:44 -0700


    [ 
https://issues.apache.org/jira/browse/FLINK-6625?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16646333#comment-16646333
 ]


Robert Metzger commented on FLINK-6625:
---------------------------------------

Can we guarantee that checkpoints are consistent when a job has finished with 
FAILED?

Can't a failed checkpoint (unable to commit something, write somewhere, etc.) 
fail the job? Such an incomplete checkpoint would make HA-recovery impossible.

> Flink removes HA job data when reaching JobStatus.FAILED
> --------------------------------------------------------
>
>                 Key: FLINK-6625
>                 URL: https://issues.apache.org/jira/browse/FLINK-6625
>             Project: Flink
>          Issue Type: Improvement
>          Components: Distributed Coordination
>    Affects Versions: 1.3.0, 1.4.0
>            Reporter: Till Rohrmann
>            Priority: Major
>
> Currently, Flink removes all job related data (submitted {{JobGraph}} as well 
> as checkpoints) when it reaches a globally terminal state (including 
> {{JobStatus.FAILED}}). In high availability mode, this entails that all data 
> is removed from ZooKeeper and there is no way to recover the job by 
> restarting the cluster with the same cluster id.
> I think this is problematic, since an application might just have failed 
> because it has depleted its numbers of restart attempts. Also the last 
> checkpoint information could be helpful when trying to find out why the job 
> has actually failed. I propose that we only remove job data when reaching the 
> state {{JobStatus.SUCCESS}} or {{JobStatus.CANCELED}}.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (FLINK-6625) Flink removes HA job data when reaching JobStatus.FAILED

Reply via email to