[ https://issues.apache.org/jira/browse/FLINK-20872?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17260942#comment-17260942 ]
Yun Tang commented on FLINK-20872: ---------------------------------- As we all know, savepoint/checkpoint is used for fault tolerance. If you choose to start the job from a previous savepoint/checkpoint, the correct mechanism is to enable job could recover from previous savepoint/checkpoint to avoid data lost during failover. I don't know why we need to add specific warning to describe such expected behavior. Moreover, I think current documentation should already give enough description over this: ["Checkpoints allow Flink to recover state and positions in the streams to give the application the same semantics as a failure-free execution."|https://ci.apache.org/projects/flink/flink-docs-release-1.12/dev/stream/state/checkpointing.html] I will close this ticket as this conflicts with the basic idea of fault tolerance in Apache Flink. > Job resume from history savepoint when failover if checkpoint is disabled > ------------------------------------------------------------------------- > > Key: FLINK-20872 > URL: https://issues.apache.org/jira/browse/FLINK-20872 > Project: Flink > Issue Type: Improvement > Affects Versions: 1.10.0, 1.12.0 > Reporter: Liu > Priority: Minor > > I have a long running job. Its checkpoint is disabled and restartStrategy is > set. One time I upgrade the job through savepoint. One day later, the job is > failed and restart automatically. But it is resumed from the previous > savepoint so that the job is heavily lagged. > > I have checked the code and find that the job will first try to resume from > checkpoint and then savepoint. > {code:java} > if (checkpointCoordinator != null) { > // check whether we find a valid checkpoint > if (!checkpointCoordinator.restoreInitialCheckpointIfPresent( > new HashSet<>(newExecutionGraph.getAllVertices().values()))) { > // check whether we can restore from a savepoint > tryRestoreExecutionGraphFromSavepoint( > newExecutionGraph, jobGraph.getSavepointRestoreSettings()); > } > } > {code} > For job which checkpoint is disabled, internal failover should not resume > from previous savepoint, especially the savepoint is done long long ago. In > this situation, state loss is acceptable but lag is not acceptable. -- This message was sent by Atlassian Jira (v8.3.4#803005)