Re: Savepoint incomplete when job was killed after a cancel timeout

Till Rohrmann Tue, 29 Sep 2020 05:33:48 -0700

Hi Paul,

could you share with us the logs of the JobManager? They might help to
better understand in which order each operation occurred.


How big are you expecting the size of the state to be? If it is smaller
than state.backend.fs.memory-threshold, then the state data will be stored
in the _metadata file.

Cheers,
Till

On Tue, Sep 29, 2020 at 1:52 PM Paul Lam <paullin3...@gmail.com> wrote:

> Hi,
>
> We have a Flink job that was stopped erroneously with no available
> checkpoint/savepoint to restore,
> and are looking for some help to narrow down the problem.
>
> How we ran into this problem:
>
> We stopped the job using cancel with savepoint command (for compatibility
> issue), but the command
> timed out after 1 min because there was some backpressure. So we force
> kill the job by yarn kill command.
> Usually, this would not cause troubles because we can still use the last
> checkpoint to restore the job.
>
> But at this time, the last checkpoint dir was cleaned up and empty (the
> retained checkpoint number was 1).
> According to zookeeper and the logs, the savepoint finished (job master
> logged “Savepoint stored in …”)
> right after the cancel timeout. However, the savepoint directory contains
> only _metadata file, and other
> state files referred by metadata are absent.
>
> Environment & Config:
> - Flink 1.11.0
> - YARN job cluster
> - HA via zookeeper
> - FsStateBackend
> - Aligned non-incremental checkpoint
>
> Any comments and suggestions are appreciated! Thanks!
>
> Best,
> Paul Lam
>
>

Re: Savepoint incomplete when job was killed after a cancel timeout

Reply via email to