Github user StephanEwen commented on the issue:
https://github.com/apache/flink/pull/4828
I think this approach is not yet sufficient. There can be various reasons
why a failure in DEPLOY happens, failed checkpoint restore is only one of the
reasons.
This also adds some coupling of execution graph state and checkpoint
coordinator (last restored checkpoint ID) which breaks design and
responsibilities.
A proper solution here is probably a bit more comprehensive - and need a
bit more thinking, probably a bigger design document. my first though would be
to report a proper RestoreException from the TaskManager, keeping a history of
exceptions that triggered recovery, using that to evaluate fallback, etc.
---