Re: ForSt State backend seem to try to download all state locally

Gyula Fóra Sat, 05 Apr 2025 10:24:28 -0700

Hi Zakelly!

Backend is S3


I have performed a simpler experiment. Single Taskmanager 6 parallelism. I
let it accumulate about 40 GB of state then restarted it with a smaller
local disk size (15GB).
The job takes a very long time trying to download everything but eventually
the TM crashes before the job starts.

No rescaling happened in this case Parallelism 6 -> 6

Cheers,
Gyula

On Fri, Apr 4, 2025 at 12:24 PM Zakelly Lan <[email protected]> wrote:

> Hi Gyula,
>
> I assumed it will only download at most 10GB and just start reading from
>> remote and the job should start up "immediately".
>
>
> It won't start up immediately, instead it clips the state before running.
> This clipping process is primarily performed on the remote side. This may
> involve writing new state files, which could be cached on the local disk,
> but it should not exceed the 10GB limit.
>
> May I ask what checkpoint storage are you using? And would you please try
> to start the job without a rescale and see if it could start
> running immediately? And it would be great if you could provide some logs
> from the taskmanager during the restore. I suspect that state clipping may
> involve too much file rewriting affecting the speed. I'll do a similar
> experiment.
>
>
> Best,
> Zakelly
>
> On Fri, Apr 4, 2025 at 4:28 PM Gyula Fóra <[email protected]> wrote:
>
>> Hi All!
>>
>> I am experimenting with the ForSt state backend on 2.0.0 and I noticed
>> the following thing.
>>
>> If I have a job with a larger state, let's say 500GB and now I want to
>> start the job with a lower parallelism on a single TaskManager, the job
>> will simply not start as the ForStIncrementalRestoreOperation tries to
>> download all states locally (there is not enough disk space)
>>
>> I have these configs:
>>
>> "state.backend.type": "forst"
>> "state.backend.forst.cache.size-based-limit": "10GB"
>>
>> I assumed it will only download at most 10GB and just start reading from
>> remote and the job should start up "immediately".
>>
>> What am I missing?
>>
>> Gyula
>>
>

Re: ForSt State backend seem to try to download all state locally

Reply via email to