Re: Restart from checkpoint after program failure

2018-10-18 Thread Paul Lam
Hi, I think you need to specify the directory of an concrete checkpoint instead of the root directory for checkpoints to restore the states. The directory name should be like chk-${id}. The job id will change if you re-submit the job, so jobmanager is not able to recognize the retained checkpo

Restart from checkpoint after program failure

2018-10-17 Thread chrisr123
Hi Folks, I'm trying to restart my program with restored state from a checkpoint after a program failure (restart strategies tried but exhausted), but I'm not picking up the restored state. What am I doing wrong here? *Summary* I'm using a very simple app on 1 node just to learn checkpointing. A