[ 
https://issues.apache.org/jira/browse/FLINK-10751?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Till Rohrmann updated FLINK-10751:
----------------------------------
    Fix Version/s: 1.8.0

> Checkpoints should be retained when job reaches suspended state
> ---------------------------------------------------------------
>
>                 Key: FLINK-10751
>                 URL: https://issues.apache.org/jira/browse/FLINK-10751
>             Project: Flink
>          Issue Type: Bug
>          Components: Distributed Coordination
>    Affects Versions: 1.6.2, 1.7.0
>            Reporter: Ufuk Celebi
>            Assignee: Ufuk Celebi
>            Priority: Minor
>             Fix For: 1.7.0
>
>
> {{CheckpointProperties}} define in which terminal job status a checkpoint 
> should be disposed.
> I've noticed that the properties for {{CHECKPOINT_NEVER_RETAINED}}, 
> {{CHECKPOINT_RETAINED_ON_FAILURE}} prescribe checkpoint disposal in (locally) 
> terminal job status {{SUSPENDED}}.
> Since a job reaches the {{SUSPENDED}} state when its {{JobMaster}} looses 
> leadership, this would result in the checkpoint to be cleaned up and not 
> being available for recovery by the new leader. Therefore, we should rather 
> retain checkpoints when reachingĀ job status {{SUSPENDED}}.
> *BUT:* Because we special case this terminal state in the only highly 
> available {{CompletedCheckpointStore}} implementation (seeĀ 
> [ZooKeeperCompletedCheckpointStore|https://github.com/apache/flink/blob/e7ac3ba/flink-runtime/src/main/java/org/apache/flink/runtime/checkpoint/ZooKeeperCompletedCheckpointStore.java#L315])
>  and don't use regular checkpoint disposal, this issue has not surfaced yet.
> I think we should proactively fix the properties to indicate to retain 
> checkpoints in {{SUSPENDED}} state. We might actually completely remove this 
> case since with this change, all properties will indicate to retain on 
> suspension.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to