[ https://issues.apache.org/jira/browse/FLINK-3397?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15379767#comment-15379767 ]
Ufuk Celebi commented on FLINK-3397: ------------------------------------ Thanks for the reminder! :-) The description of checkpoints and savepoints are mostly correct. Minor changes: "Every time a new checkpoint is taken the older ones are discarded and only the latest is considered for any restoration on failure." => This is also configurable, you can keep around multiple completed checkpoints. "These checkpointed state are never cleared unless the user wants to delete a savepoint and create a new one." => I would remove the last part "and create a new one" as this is independent of when savepoints are cleared. The important thing is that they are not automatically cleared. The rest of the description is not correct: "Any job submitted checks if there was a savepoint already available in the back end store." This is not checked automatically, but the user provides the savepoint path to resume from. Regarding resuming jobs: if a job was submitted with a savepoint path to recover from, it will always fall back to that state in the worst case. What does not happen is that it is falling back to any newer savepoints (even if some were triggered). This is what you describe on page 2. In general though I would refrain from any time consideration when talking about this, the checkpoint ID description is good though. All in all it's great to see that you looked into the code before doing this. I fear though that these changes require some more consideration about how savepoints are stored/accessed. They are currently mostly independent of the job from which they were created. > Failed streaming jobs should fall back to the most recent checkpoint/savepoint > ------------------------------------------------------------------------------ > > Key: FLINK-3397 > URL: https://issues.apache.org/jira/browse/FLINK-3397 > Project: Flink > Issue Type: Improvement > Components: State Backends, Checkpointing, Streaming > Affects Versions: 1.0.0 > Reporter: Gyula Fora > Priority: Minor > Attachments: FLINK-3397.pdf > > > The current fallback behaviour in case of a streaming job failure is slightly > counterintuitive: > If a job fails it will fall back to the most recent checkpoint (if any) even > if there were more recent savepoint taken. This means that savepoints are not > regarded as checkpoints by the system only points from where a job can be > manually restarted. > I suggest to change this so that savepoints are also regarded as checkpoints > in case of a failure and they will also be used to automatically restore the > streaming job. -- This message was sent by Atlassian JIRA (v6.3.4#6332)