Hi Paul, could you share with us the logs of the JobManager? They might help to better understand in which order each operation occurred.
How big are you expecting the size of the state to be? If it is smaller than state.backend.fs.memory-threshold, then the state data will be stored in the _metadata file. Cheers, Till On Tue, Sep 29, 2020 at 1:52 PM Paul Lam <paullin3...@gmail.com> wrote: > Hi, > > We have a Flink job that was stopped erroneously with no available > checkpoint/savepoint to restore, > and are looking for some help to narrow down the problem. > > How we ran into this problem: > > We stopped the job using cancel with savepoint command (for compatibility > issue), but the command > timed out after 1 min because there was some backpressure. So we force > kill the job by yarn kill command. > Usually, this would not cause troubles because we can still use the last > checkpoint to restore the job. > > But at this time, the last checkpoint dir was cleaned up and empty (the > retained checkpoint number was 1). > According to zookeeper and the logs, the savepoint finished (job master > logged “Savepoint stored in …”) > right after the cancel timeout. However, the savepoint directory contains > only _metadata file, and other > state files referred by metadata are absent. > > Environment & Config: > - Flink 1.11.0 > - YARN job cluster > - HA via zookeeper > - FsStateBackend > - Aligned non-incremental checkpoint > > Any comments and suggestions are appreciated! Thanks! > > Best, > Paul Lam > >