Hi Till, Thanks a lot for the pointer! I tried to restore the job using the savepoint in a dry run, and it worked!
Guess I've misunderstood the configuration option, and confused by the non-existent paths that the metadata contains. Best, Paul Lam Till Rohrmann <[email protected]> 于2020年9月29日周二 下午10:30写道: > Thanks for sharing the logs with me. It looks as if the total size of the > savepoint is 335kb for a job with a parallelism of 60 and a total of 120 > tasks. Hence, the average size of a state per task is between 2.5kb - 5kb. > I think that the state size threshold refers to the size of the per task > state. Hence, I believe that the _metadata file should contain all of your > state. Have you tried restoring from this savepoint? > > Cheers, > Till > > On Tue, Sep 29, 2020 at 3:47 PM Paul Lam <[email protected]> wrote: > >> Hi Till, >> >> Thanks for your quick reply. >> >> The checkpoint/savepoint size would be around 2MB, which is larger than >> `state.backend.fs.memory-threshold`. >> >> The jobmanager logs are attached, which looks normal to me. >> >> Thanks again! >> >> Best, >> Paul Lam >> >> Till Rohrmann <[email protected]> 于2020年9月29日周二 下午8:32写道: >> >>> Hi Paul, >>> >>> could you share with us the logs of the JobManager? They might help to >>> better understand in which order each operation occurred. >>> >>> How big are you expecting the size of the state to be? If it is smaller >>> than state.backend.fs.memory-threshold, then the state data will be stored >>> in the _metadata file. >>> >>> Cheers, >>> Till >>> >>> On Tue, Sep 29, 2020 at 1:52 PM Paul Lam <[email protected]> wrote: >>> >>>> Hi, >>>> >>>> We have a Flink job that was stopped erroneously with no available >>>> checkpoint/savepoint to restore, >>>> and are looking for some help to narrow down the problem. >>>> >>>> How we ran into this problem: >>>> >>>> We stopped the job using cancel with savepoint command (for >>>> compatibility issue), but the command >>>> timed out after 1 min because there was some backpressure. So we force >>>> kill the job by yarn kill command. >>>> Usually, this would not cause troubles because we can still use the >>>> last checkpoint to restore the job. >>>> >>>> But at this time, the last checkpoint dir was cleaned up and empty (the >>>> retained checkpoint number was 1). >>>> According to zookeeper and the logs, the savepoint finished (job master >>>> logged “Savepoint stored in …”) >>>> right after the cancel timeout. However, the savepoint directory >>>> contains only _metadata file, and other >>>> state files referred by metadata are absent. >>>> >>>> Environment & Config: >>>> - Flink 1.11.0 >>>> - YARN job cluster >>>> - HA via zookeeper >>>> - FsStateBackend >>>> - Aligned non-incremental checkpoint >>>> >>>> Any comments and suggestions are appreciated! Thanks! >>>> >>>> Best, >>>> Paul Lam >>>> >>>>
