[jira] [Commented] (FLINK-3397) Failed streaming jobs should fall back to the most recent checkpoint/savepoint

Ufuk Celebi (JIRA) Fri, 15 Jul 2016 10:46:58 -0700

    [ 
https://issues.apache.org/jira/browse/FLINK-3397?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15379767#comment-15379767
 ]


Ufuk Celebi commented on FLINK-3397:
------------------------------------

Thanks for the reminder! :-)

The description of checkpoints and savepoints are mostly correct. Minor changes:

"Every time a new checkpoint is taken the older ones are discarded and only the 
latest is considered for any restoration on failure."
=> This is also configurable, you can keep around multiple completed 
checkpoints.

"These checkpointed state are never cleared unless the user wants to delete a 
savepoint and create a new one."
=> I would remove the last part "and create a new one" as this is independent 
of when savepoints are cleared. The important thing is that they are not 
automatically cleared.

The rest of the description is not correct:

"Any job submitted checks if there was a savepoint already available in the 
back end store."
This is not checked automatically, but the user provides the savepoint path to 
resume from.

Regarding resuming jobs: if a job was submitted with a savepoint path to 
recover from, it will always fall back to that state in the worst case. What 
does not happen is that it is falling back to any newer savepoints (even if 
some were triggered). This is what you describe on page 2. In general though I 
would refrain from any time consideration when talking about this, the 
checkpoint ID description is good though.

All in all it's great to see that you looked into the code before doing this. I 
fear though that these changes require some more  consideration about how 
savepoints are stored/accessed. They are currently mostly independent of the 
job from which they were created.

> Failed streaming jobs should fall back to the most recent checkpoint/savepoint
> ------------------------------------------------------------------------------
>
>                 Key: FLINK-3397
>                 URL: https://issues.apache.org/jira/browse/FLINK-3397
>             Project: Flink
>          Issue Type: Improvement
>          Components: State Backends, Checkpointing, Streaming
>    Affects Versions: 1.0.0
>            Reporter: Gyula Fora
>            Priority: Minor
>         Attachments: FLINK-3397.pdf
>
>
> The current fallback behaviour in case of a streaming job failure is slightly 
> counterintuitive:
> If a job fails it will fall back to the most recent checkpoint (if any) even 
> if there were more recent savepoint taken. This means that savepoints are not 
> regarded as checkpoints by the system only points from where a job can be 
> manually restarted.
> I suggest to change this so that savepoints are also regarded as checkpoints 
> in case of a failure and they will also be used to automatically restore the 
> streaming job.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (FLINK-3397) Failed streaming jobs should fall back to the most recent checkpoint/savepoint

Reply via email to