Hi, Have you tried task local recovery [1]?
[1] https://ci.apache.org/projects/flink/flink-docs-stable/ops/state/checkpoints.html#retained-checkpoints Best, Paul Lam > 在 2019年4月17日,17:46,Sergey Zhemzhitsky <szh.s...@gmail.com> 写道: > > Hi Flinkers, > > Operating different flink jobs I've discovered that job restarts with > a pretty large state (in my case this is up to 100GB+) take quite a > lot of time. For example, to restart a job (e.g. to update it) the > savepoint is created, and in case of savepoints all the state seems to > be pushed into the distributed store (hdfs in my case) when stopping a > job and pulling this state back when starting the new version of the > job. > > What I've found by the moment trying to speed up job restarts is: > - using external retained checkpoints [1]; the drawback is that the > job cannot be rescaled during restart > - using external state and storage with the stateless jobs; the > drawback is the necessity of additional network hops to this storage. > > So I'm wondering whether there are any best practices community knows > and uses to cope with the cases like this? > > [1] > https://ci.apache.org/projects/flink/flink-docs-stable/ops/state/checkpoints.html#retained-checkpoints