Re: Savepoint incomplete when job was killed after a cancel timeout

Paul Lam Tue, 29 Sep 2020 10:28:40 -0700

Hi Till,

Thanks a lot for the pointer! I tried to restore the job using the
savepoint in a dry run, and it worked!


Guess I've misunderstood the configuration option, and confused by the
non-existent paths that the metadata contains.

Best,
Paul Lam

Till Rohrmann <[email protected]> 于2020年9月29日周二 下午10:30写道：

> Thanks for sharing the logs with me. It looks as if the total size of the
> savepoint is 335kb for a job with a parallelism of 60 and a total of 120
> tasks. Hence, the average size of a state per task is between 2.5kb - 5kb.
> I think that the state size threshold refers to the size of the per task
> state. Hence, I believe that the _metadata file should contain all of your
> state. Have you tried restoring from this savepoint?
>
> Cheers,
> Till
>
> On Tue, Sep 29, 2020 at 3:47 PM Paul Lam <[email protected]> wrote:
>
>> Hi Till,
>>
>> Thanks for your quick reply.
>>
>> The checkpoint/savepoint size would be around 2MB, which is larger than
>> `state.backend.fs.memory-threshold`.
>>
>> The jobmanager logs are attached, which looks normal to me.
>>
>> Thanks again!
>>
>> Best,
>> Paul Lam
>>
>> Till Rohrmann <[email protected]> 于2020年9月29日周二 下午8:32写道：
>>
>>> Hi Paul,
>>>
>>> could you share with us the logs of the JobManager? They might help to
>>> better understand in which order each operation occurred.
>>>
>>> How big are you expecting the size of the state to be? If it is smaller
>>> than state.backend.fs.memory-threshold, then the state data will be stored
>>> in the _metadata file.
>>>
>>> Cheers,
>>> Till
>>>
>>> On Tue, Sep 29, 2020 at 1:52 PM Paul Lam <[email protected]> wrote:
>>>
>>>> Hi,
>>>>
>>>> We have a Flink job that was stopped erroneously with no available
>>>> checkpoint/savepoint to restore,
>>>> and are looking for some help to narrow down the problem.
>>>>
>>>> How we ran into this problem:
>>>>
>>>> We stopped the job using cancel with savepoint command (for
>>>> compatibility issue), but the command
>>>> timed out after 1 min because there was some backpressure. So we force
>>>> kill the job by yarn kill command.
>>>> Usually, this would not cause troubles because we can still use the
>>>> last checkpoint to restore the job.
>>>>
>>>> But at this time, the last checkpoint dir was cleaned up and empty (the
>>>> retained checkpoint number was 1).
>>>> According to zookeeper and the logs, the savepoint finished (job master
>>>> logged “Savepoint stored in …”)
>>>> right after the cancel timeout. However, the savepoint directory
>>>> contains only _metadata file, and other
>>>> state files referred by metadata are absent.
>>>>
>>>> Environment & Config:
>>>> - Flink 1.11.0
>>>> - YARN job cluster
>>>> - HA via zookeeper
>>>> - FsStateBackend
>>>> - Aligned non-incremental checkpoint
>>>>
>>>> Any comments and suggestions are appreciated! Thanks!
>>>>
>>>> Best,
>>>> Paul Lam
>>>>
>>>>

Re: Savepoint incomplete when job was killed after a cancel timeout

Reply via email to