Hi Stefan, Paul, Thanks for the tips! Currently I have not tried neither rescaling from checkpoints nor task local recovery. Now it's a subject to test.
In case it will be necessary not to just rescale a job, but also to change its DAG - is there a way to have something like let's call it "local savepoints" or "incremental savepoints" to prevent the whole state transferring to and from a distributed storage? Kind Regards, Sergey On Thu, Apr 18, 2019, 13:22 Stefan Richter <s.rich...@ververica.com> wrote: > Hi, > > If rescaling is the problem, let me clarify that you can currently rescale > from savepoints and all types of checkpoints (including incremental). If > that was the only problem, then there is nothing to worry about - the > documentation is only a bit conservative about this because we will not > commit to an APU that all future types checkpoints will be resealable. But > currently they are all, and this is also very unlikely to change anytime > soon. > > Paul, just to comment on your suggestion as well, local recovery would > only help with failover. 1) It does not help for restarts by the user and > 2) also does not work for rescaling (2) is a consequence of 1) because > failover never rescales, only restarts). > > Best, > Stefan > > On 18. Apr 2019, at 12:07, Paul Lam <paullin3...@gmail.com> wrote: > > The URL in my previous mail is wrong, and it should be: > > > https://ci.apache.org/projects/flink/flink-docs-stable/ops/state/large_state_tuning.html#task-local-recovery > > Best, > Paul Lam > > 在 2019年4月18日,18:04,Paul Lam <paullin3...@gmail.com> 写道: > > Hi, > > Have you tried task local recovery [1]? > > [1] > https://ci.apache.org/projects/flink/flink-docs-stable/ops/state/checkpoints.html#retained-checkpoints > > Best, > Paul Lam > > 在 2019年4月17日,17:46,Sergey Zhemzhitsky <szh.s...@gmail.com> 写道: > > Hi Flinkers, > > Operating different flink jobs I've discovered that job restarts with > a pretty large state (in my case this is up to 100GB+) take quite a > lot of time. For example, to restart a job (e.g. to update it) the > savepoint is created, and in case of savepoints all the state seems to > be pushed into the distributed store (hdfs in my case) when stopping a > job and pulling this state back when starting the new version of the > job. > > What I've found by the moment trying to speed up job restarts is: > - using external retained checkpoints [1]; the drawback is that the > job cannot be rescaled during restart > - using external state and storage with the stateless jobs; the > drawback is the necessity of additional network hops to this storage. > > So I'm wondering whether there are any best practices community knows > and uses to cope with the cases like this? > > [1] > https://ci.apache.org/projects/flink/flink-docs-stable/ops/state/checkpoints.html#retained-checkpoints > > > > >