[ https://issues.apache.org/jira/browse/FLINK-10751?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Till Rohrmann updated FLINK-10751: ---------------------------------- Fix Version/s: 1.8.0 > Checkpoints should be retained when job reaches suspended state > --------------------------------------------------------------- > > Key: FLINK-10751 > URL: https://issues.apache.org/jira/browse/FLINK-10751 > Project: Flink > Issue Type: Bug > Components: Distributed Coordination > Affects Versions: 1.6.2, 1.7.0 > Reporter: Ufuk Celebi > Assignee: Ufuk Celebi > Priority: Minor > Fix For: 1.7.0 > > > {{CheckpointProperties}} define in which terminal job status a checkpoint > should be disposed. > I've noticed that the properties for {{CHECKPOINT_NEVER_RETAINED}}, > {{CHECKPOINT_RETAINED_ON_FAILURE}} prescribe checkpoint disposal in (locally) > terminal job status {{SUSPENDED}}. > Since a job reaches the {{SUSPENDED}} state when its {{JobMaster}} looses > leadership, this would result in the checkpoint to be cleaned up and not > being available for recovery by the new leader. Therefore, we should rather > retain checkpoints when reachingĀ job status {{SUSPENDED}}. > *BUT:* Because we special case this terminal state in the only highly > available {{CompletedCheckpointStore}} implementation (seeĀ > [ZooKeeperCompletedCheckpointStore|https://github.com/apache/flink/blob/e7ac3ba/flink-runtime/src/main/java/org/apache/flink/runtime/checkpoint/ZooKeeperCompletedCheckpointStore.java#L315]) > and don't use regular checkpoint disposal, this issue has not surfaced yet. > I think we should proactively fix the properties to indicate to retain > checkpoints in {{SUSPENDED}} state. We might actually completely remove this > case since with this change, all properties will indicate to retain on > suspension. -- This message was sent by Atlassian JIRA (v7.6.3#76005)