Hi all, Currently, savepoints are exactly the completed checkpoints, and Flink provides commands (save/run) to allow saving and restoring jobs. But in the near future, savepoints will be very different from checkpoints because they will have common serialization formats and allow recover from major updates. The saving and restoring based on savepoints will be more costly.
To provide efficient saving and restoring of jobs, we propose to add two more commands in Flink: SUSPEND and RESUME which are based on checkpoints. As the implementation of checkpoints depends on the backends (and many other components in Flink), suspending and resuming may not work if there exist major changes in the job or Flink (e.g., different backends). But as the implementation is based on checkpoints instead of savepoints, they are supposed to be more efficient. The details of the design can be viewed in the Google Doc: Support Resuming and Suspending of Flink Jobs <https://docs.google.com/document/d/1c3vUOTrNlCu2uhfi5ZNYpAguoFR03NgQWZpDTkSxVjg/edit?usp=sharing> . Look forward to your comments. Any feedback is appreciated. :) Thanks, Xiaogang