Hi Flinkers, Operating different flink jobs I've discovered that job restarts with a pretty large state (in my case this is up to 100GB+) take quite a lot of time. For example, to restart a job (e.g. to update it) the savepoint is created, and in case of savepoints all the state seems to be pushed into the distributed store (hdfs in my case) when stopping a job and pulling this state back when starting the new version of the job.
What I've found by the moment trying to speed up job restarts is: - using external retained checkpoints [1]; the drawback is that the job cannot be rescaled during restart - using external state and storage with the stateless jobs; the drawback is the necessity of additional network hops to this storage. So I'm wondering whether there are any best practices community knows and uses to cope with the cases like this? [1] https://ci.apache.org/projects/flink/flink-docs-stable/ops/state/checkpoints.html#retained-checkpoints