Hi Gyula,

It seems the ForSt is downloading even for a no-rescale start.

It came to me that there is a limitation: the ForSt won't store state files
on remote if the synchronous state APIs are using. So is it a datastream
job using old state APIs (not state V2), or is it a SQL job without
asynchronous state support (listed in [1]). Would you please check the
taskmanager log and see if there is 'ForStSync' showing, which means ForSt
is running in sync mode with pure local state.


[1]
https://nightlies.apache.org/flink/flink-docs-release-2.0/docs/ops/state/disaggregated_state/#for-sql-jobs

Best,
Zakelly

On Fri, Apr 4, 2025 at 6:41 PM Gyula Fóra <gyula.f...@gmail.com> wrote:

> This is the flamegrapgh during the no-rescale restart. I couldnt attach it
> for the mailing list
>
> On Fri, Apr 4, 2025 at 12:24 PM Zakelly Lan <zakelly....@gmail.com> wrote:
>
>> Hi Gyula,
>>
>> I assumed it will only download at most 10GB and just start reading from
>>> remote and the job should start up "immediately".
>>
>>
>> It won't start up immediately, instead it clips the state before running.
>> This clipping process is primarily performed on the remote side. This may
>> involve writing new state files, which could be cached on the local disk,
>> but it should not exceed the 10GB limit.
>>
>> May I ask what checkpoint storage are you using? And would you please try
>> to start the job without a rescale and see if it could start
>> running immediately? And it would be great if you could provide some logs
>> from the taskmanager during the restore. I suspect that state clipping may
>> involve too much file rewriting affecting the speed. I'll do a similar
>> experiment.
>>
>>
>> Best,
>> Zakelly
>>
>> On Fri, Apr 4, 2025 at 4:28 PM Gyula Fóra <gyula.f...@gmail.com> wrote:
>>
>>> Hi All!
>>>
>>> I am experimenting with the ForSt state backend on 2.0.0 and I noticed
>>> the following thing.
>>>
>>> If I have a job with a larger state, let's say 500GB and now I want to
>>> start the job with a lower parallelism on a single TaskManager, the job
>>> will simply not start as the ForStIncrementalRestoreOperation tries to
>>> download all states locally (there is not enough disk space)
>>>
>>> I have these configs:
>>>
>>> "state.backend.type": "forst"
>>> "state.backend.forst.cache.size-based-limit": "10GB"
>>>
>>> I assumed it will only download at most 10GB and just start reading from
>>> remote and the job should start up "immediately".
>>>
>>> What am I missing?
>>>
>>> Gyula
>>>
>>

Reply via email to