Hi Gyula, It seems the ForSt is downloading even for a no-rescale start.
It came to me that there is a limitation: the ForSt won't store state files on remote if the synchronous state APIs are using. So is it a datastream job using old state APIs (not state V2), or is it a SQL job without asynchronous state support (listed in [1]). Would you please check the taskmanager log and see if there is 'ForStSync' showing, which means ForSt is running in sync mode with pure local state. [1] https://nightlies.apache.org/flink/flink-docs-release-2.0/docs/ops/state/disaggregated_state/#for-sql-jobs Best, Zakelly On Fri, Apr 4, 2025 at 6:41 PM Gyula Fóra <gyula.f...@gmail.com> wrote: > This is the flamegrapgh during the no-rescale restart. I couldnt attach it > for the mailing list > > On Fri, Apr 4, 2025 at 12:24 PM Zakelly Lan <zakelly....@gmail.com> wrote: > >> Hi Gyula, >> >> I assumed it will only download at most 10GB and just start reading from >>> remote and the job should start up "immediately". >> >> >> It won't start up immediately, instead it clips the state before running. >> This clipping process is primarily performed on the remote side. This may >> involve writing new state files, which could be cached on the local disk, >> but it should not exceed the 10GB limit. >> >> May I ask what checkpoint storage are you using? And would you please try >> to start the job without a rescale and see if it could start >> running immediately? And it would be great if you could provide some logs >> from the taskmanager during the restore. I suspect that state clipping may >> involve too much file rewriting affecting the speed. I'll do a similar >> experiment. >> >> >> Best, >> Zakelly >> >> On Fri, Apr 4, 2025 at 4:28 PM Gyula Fóra <gyula.f...@gmail.com> wrote: >> >>> Hi All! >>> >>> I am experimenting with the ForSt state backend on 2.0.0 and I noticed >>> the following thing. >>> >>> If I have a job with a larger state, let's say 500GB and now I want to >>> start the job with a lower parallelism on a single TaskManager, the job >>> will simply not start as the ForStIncrementalRestoreOperation tries to >>> download all states locally (there is not enough disk space) >>> >>> I have these configs: >>> >>> "state.backend.type": "forst" >>> "state.backend.forst.cache.size-based-limit": "10GB" >>> >>> I assumed it will only download at most 10GB and just start reading from >>> remote and the job should start up "immediately". >>> >>> What am I missing? >>> >>> Gyula >>> >>