Hello,
I got confused about usage of savepoints and checkpoints in different
scenarios.
I understand that checkpoints' main purpose is fault tolerance, they are
more lightweight and don't support changing job graph, parallelism or state
backend when restoring from them, as mentioned in the latest 1.13 docs:
https://ci.apache.org/projects/flink/flink-docs-release-1.13/docs/ops/state/checkpoints/#difference-to-savepoints

At the same time:
1) Reactive scaling mode (in 1.13) uses checkpoints exactly for that -
rescaling.
2) There are use cases like here:
http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/What-happens-when-a-job-is-rescaled-td39462.html
where people seem to be using retained checkpoints instead of savepoints to
do manual job restarts with rescaling.
3) There are claims like here:
https://lists.apache.org/thread.html/4299518f4da2810aa88fe6b21f841880b619f3f8ac264084a318c034%40%3Cuser.flink.apache.org%3E
that in HA setup JobManager is able to restart from a checkpoint even if
operators are added/removed or parallelism is changed (in this case I'm not
sure if the checkpoints used by HA JM in `high-availability.storageDir` is
the same thing as usual checkpoints).

So I guess the questions are:
1) Can retained checkpoints be safely used for manual restarting and
rescaling a job?
2) Are checkpoints made by HA JM structurally different from the usual
ones? Can they be used to restore a job with a changed job graph?

Thank you,
Igor

Reply via email to