
Have you tried task local recovery [1]?


Paul Lam

> 在 2019年4月17日,17:46,Sergey Zhemzhitsky <szh.s...@gmail.com> 写道:
> Hi Flinkers,
> Operating different flink jobs I've discovered that job restarts with
> a pretty large state (in my case this is up to 100GB+) take quite a
> lot of time. For example, to restart a job (e.g. to update it) the
> savepoint is created, and in case of savepoints all the state seems to
> be pushed into the distributed store (hdfs in my case) when stopping a
> job and pulling this state back when starting the new version of the
> job.
> What I've found by the moment trying to speed up job restarts is:
> - using external retained checkpoints [1]; the drawback is that the
> job cannot be rescaled during restart
> - using external state and storage with the stateless jobs; the
> drawback is the necessity of additional network hops to this storage.
> So I'm wondering whether there are any best practices community knows
> and uses to cope with the cases like this?
> [1] 
> https://ci.apache.org/projects/flink/flink-docs-stable/ops/state/checkpoints.html#retained-checkpoints

Reply via email to