[jira] [Commented] (FLINK-20872) Job resume from history savepoint when failover if checkpoint is disabled

Yun Tang (Jira) Thu, 07 Jan 2021 18:34:06 -0800


    [ 
https://issues.apache.org/jira/browse/FLINK-20872?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17260942#comment-17260942
 ]


Yun Tang commented on FLINK-20872:
----------------------------------

As we all know, savepoint/checkpoint is used for fault tolerance. If you choose 
to start the job from a previous savepoint/checkpoint, the correct mechanism is 
to enable job could recover from previous savepoint/checkpoint to avoid data 
lost during failover. I don't know why we need to add specific warning to 
describe such expected behavior. Moreover, I think current documentation should 
already give enough description over this: ["Checkpoints allow Flink to recover 
state and positions in the streams to give the application the same semantics 
as a failure-free 
execution."|https://ci.apache.org/projects/flink/flink-docs-release-1.12/dev/stream/state/checkpointing.html]

I will close this ticket as this conflicts with the basic idea of fault 
tolerance in Apache Flink.

> Job resume from history savepoint when failover if checkpoint is disabled
> -------------------------------------------------------------------------
>
>                 Key: FLINK-20872
>                 URL: https://issues.apache.org/jira/browse/FLINK-20872
>             Project: Flink
>          Issue Type: Improvement
>    Affects Versions: 1.10.0, 1.12.0
>            Reporter: Liu
>            Priority: Minor
>
> I have a long running job. Its checkpoint is disabled and restartStrategy is 
> set.  One time I upgrade the job through savepoint. One day later, the job is 
> failed and restart automatically. But it is resumed from the previous 
> savepoint so that the job is heavily lagged.
>  
> I have checked the code and find that the job will first try to resume from 
> checkpoint and then savepoint.
> {code:java}
> if (checkpointCoordinator != null) {
>     // check whether we find a valid checkpoint
>     if (!checkpointCoordinator.restoreInitialCheckpointIfPresent(
>             new HashSet<>(newExecutionGraph.getAllVertices().values()))) {
>         // check whether we can restore from a savepoint
>         tryRestoreExecutionGraphFromSavepoint(
>                 newExecutionGraph, jobGraph.getSavepointRestoreSettings());
>     }
> }
> {code}
> For job which checkpoint is disabled, internal failover should not resume 
> from previous savepoint, especially the savepoint is done long long ago. In 
> this situation, state loss is acceptable but lag is not acceptable.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (FLINK-20872) Job resume from history savepoint when failover if checkpoint is disabled

Reply via email to