Hi Zakelly! Backend is S3
I have performed a simpler experiment. Single Taskmanager 6 parallelism. I let it accumulate about 40 GB of state then restarted it with a smaller local disk size (15GB). The job takes a very long time trying to download everything but eventually the TM crashes before the job starts. No rescaling happened in this case Parallelism 6 -> 6 Cheers, Gyula On Fri, Apr 4, 2025 at 12:24 PM Zakelly Lan <zakelly....@gmail.com> wrote: > Hi Gyula, > > I assumed it will only download at most 10GB and just start reading from >> remote and the job should start up "immediately". > > > It won't start up immediately, instead it clips the state before running. > This clipping process is primarily performed on the remote side. This may > involve writing new state files, which could be cached on the local disk, > but it should not exceed the 10GB limit. > > May I ask what checkpoint storage are you using? And would you please try > to start the job without a rescale and see if it could start > running immediately? And it would be great if you could provide some logs > from the taskmanager during the restore. I suspect that state clipping may > involve too much file rewriting affecting the speed. I'll do a similar > experiment. > > > Best, > Zakelly > > On Fri, Apr 4, 2025 at 4:28 PM Gyula Fóra <gyula.f...@gmail.com> wrote: > >> Hi All! >> >> I am experimenting with the ForSt state backend on 2.0.0 and I noticed >> the following thing. >> >> If I have a job with a larger state, let's say 500GB and now I want to >> start the job with a lower parallelism on a single TaskManager, the job >> will simply not start as the ForStIncrementalRestoreOperation tries to >> download all states locally (there is not enough disk space) >> >> I have these configs: >> >> "state.backend.type": "forst" >> "state.backend.forst.cache.size-based-limit": "10GB" >> >> I assumed it will only download at most 10GB and just start reading from >> remote and the job should start up "immediately". >> >> What am I missing? >> >> Gyula >> >