Hi Gyula,

I assumed it will only download at most 10GB and just start reading from
> remote and the job should start up "immediately".


It won't start up immediately, instead it clips the state before running.
This clipping process is primarily performed on the remote side. This may
involve writing new state files, which could be cached on the local disk,
but it should not exceed the 10GB limit.

May I ask what checkpoint storage are you using? And would you please try
to start the job without a rescale and see if it could start
running immediately? And it would be great if you could provide some logs
from the taskmanager during the restore. I suspect that state clipping may
involve too much file rewriting affecting the speed. I'll do a similar
experiment.


Best,
Zakelly

On Fri, Apr 4, 2025 at 4:28 PM Gyula Fóra <gyula.f...@gmail.com> wrote:

> Hi All!
>
> I am experimenting with the ForSt state backend on 2.0.0 and I noticed the
> following thing.
>
> If I have a job with a larger state, let's say 500GB and now I want to
> start the job with a lower parallelism on a single TaskManager, the job
> will simply not start as the ForStIncrementalRestoreOperation tries to
> download all states locally (there is not enough disk space)
>
> I have these configs:
>
> "state.backend.type": "forst"
> "state.backend.forst.cache.size-based-limit": "10GB"
>
> I assumed it will only download at most 10GB and just start reading from
> remote and the job should start up "immediately".
>
> What am I missing?
>
> Gyula
>

Reply via email to