[ https://issues.apache.org/jira/browse/FLINK-6625?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16646333#comment-16646333 ]
Robert Metzger commented on FLINK-6625: --------------------------------------- Can we guarantee that checkpoints are consistent when a job has finished with FAILED? Can't a failed checkpoint (unable to commit something, write somewhere, etc.) fail the job? Such an incomplete checkpoint would make HA-recovery impossible. > Flink removes HA job data when reaching JobStatus.FAILED > -------------------------------------------------------- > > Key: FLINK-6625 > URL: https://issues.apache.org/jira/browse/FLINK-6625 > Project: Flink > Issue Type: Improvement > Components: Distributed Coordination > Affects Versions: 1.3.0, 1.4.0 > Reporter: Till Rohrmann > Priority: Major > > Currently, Flink removes all job related data (submitted {{JobGraph}} as well > as checkpoints) when it reaches a globally terminal state (including > {{JobStatus.FAILED}}). In high availability mode, this entails that all data > is removed from ZooKeeper and there is no way to recover the job by > restarting the cluster with the same cluster id. > I think this is problematic, since an application might just have failed > because it has depleted its numbers of restart attempts. Also the last > checkpoint information could be helpful when trying to find out why the job > has actually failed. I propose that we only remove job data when reaching the > state {{JobStatus.SUCCESS}} or {{JobStatus.CANCELED}}. -- This message was sent by Atlassian JIRA (v7.6.3#76005)