Fast restart of a job with a large state

Sergey Zhemzhitsky Wed, 17 Apr 2019 02:47:30 -0700

Hi Flinkers,

Operating different flink jobs I've discovered that job restarts with
a pretty large state (in my case this is up to 100GB+) take quite a
lot of time. For example, to restart a job (e.g. to update it) the
savepoint is created, and in case of savepoints all the state seems to
be pushed into the distributed store (hdfs in my case) when stopping a
job and pulling this state back when starting the new version of the
job.


What I've found by the moment trying to speed up job restarts is:
- using external retained checkpoints [1]; the drawback is that the
job cannot be rescaled during restart
- using external state and storage with the stateless jobs; the
drawback is the necessity of additional network hops to this storage.

So I'm wondering whether there are any best practices community knows
and uses to cope with the cases like this?

[1] 
https://ci.apache.org/projects/flink/flink-docs-stable/ops/state/checkpoints.html#retained-checkpoints

Fast restart of a job with a large state

Reply via email to