[ https://issues.apache.org/jira/browse/FLINK-20872?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Yun Tang closed FLINK-20872. ---------------------------- Resolution: Won't Do > Job resume from history savepoint when failover if checkpoint is disabled > ------------------------------------------------------------------------- > > Key: FLINK-20872 > URL: https://issues.apache.org/jira/browse/FLINK-20872 > Project: Flink > Issue Type: Improvement > Affects Versions: 1.10.0, 1.12.0 > Reporter: Liu > Priority: Minor > > I have a long running job. Its checkpoint is disabled and restartStrategy is > set. One time I upgrade the job through savepoint. One day later, the job is > failed and restart automatically. But it is resumed from the previous > savepoint so that the job is heavily lagged. > > I have checked the code and find that the job will first try to resume from > checkpoint and then savepoint. > {code:java} > if (checkpointCoordinator != null) { > // check whether we find a valid checkpoint > if (!checkpointCoordinator.restoreInitialCheckpointIfPresent( > new HashSet<>(newExecutionGraph.getAllVertices().values()))) { > // check whether we can restore from a savepoint > tryRestoreExecutionGraphFromSavepoint( > newExecutionGraph, jobGraph.getSavepointRestoreSettings()); > } > } > {code} > For job which checkpoint is disabled, internal failover should not resume > from previous savepoint, especially the savepoint is done long long ago. In > this situation, state loss is acceptable but lag is not acceptable. -- This message was sent by Atlassian Jira (v8.3.4#803005)