Re: Savepoint incomplete when job was killed after a cancel timeout

2020-09-29 Thread Till Rohrmann
Glad to hear that your job data was not lost! Cheers, Till On Tue, Sep 29, 2020 at 7:28 PM Paul Lam wrote: > Hi Till, > > Thanks a lot for the pointer! I tried to restore the job using the > savepoint in a dry run, and it worked! > > Guess I've misunderstood the configuration option, and confus

Re: Savepoint incomplete when job was killed after a cancel timeout

2020-09-29 Thread Paul Lam
Hi Till, Thanks a lot for the pointer! I tried to restore the job using the savepoint in a dry run, and it worked! Guess I've misunderstood the configuration option, and confused by the non-existent paths that the metadata contains. Best, Paul Lam Till Rohrmann 于2020年9月29日周二 下午10:30写道: > Than

Re: Savepoint incomplete when job was killed after a cancel timeout

2020-09-29 Thread Till Rohrmann
Thanks for sharing the logs with me. It looks as if the total size of the savepoint is 335kb for a job with a parallelism of 60 and a total of 120 tasks. Hence, the average size of a state per task is between 2.5kb - 5kb. I think that the state size threshold refers to the size of the per task stat

Re: Savepoint incomplete when job was killed after a cancel timeout

2020-09-29 Thread Till Rohrmann
Hi Paul, could you share with us the logs of the JobManager? They might help to better understand in which order each operation occurred. How big are you expecting the size of the state to be? If it is smaller than state.backend.fs.memory-threshold, then the state data will be stored in the _meta

Savepoint incomplete when job was killed after a cancel timeout

2020-09-29 Thread Paul Lam
Hi, We have a Flink job that was stopped erroneously with no available checkpoint/savepoint to restore, and are looking for some help to narrow down the problem. How we ran into this problem: We stopped the job using cancel with savepoint command (for compatibility issue), but the command tim