I don’t believe checkpoint/restart is supported in OMPI past the 1.6 series. There was some attempt to restore it, but that person graduated prior to fully completing the work.
> On Mar 23, 2016, at 9:14 AM, Meij, Henk <hm...@wesleyan.edu> wrote: > > So I've redone this with openmpi 1.10.2 and another piece of software (lammps > 16feb16) and get same results. > > Upon cr_restart I see the openlava_wrapper process, the mpirun process > reappearing but no orted and no lmp_mpi processes. Not obvious error > anywhere. Using the --save-all feature from BLCR and ignore pids. > > Does BLCR and openmpi work? Anybody have any idea as to where to look? > > -Henk > > From: Meij, Henk > Sent: Monday, March 21, 2016 12:24 PM > To: us...@open-mpi.org <mailto:us...@open-mpi.org> > Subject: RE: BLCR & openmpi > > hmm, I'm not correct. cr_restart starts with no errors, launches some of the > processes, then suspends itself. strace on mpirun on this manual invocation > yields the behavior same as below. > > -Henk > > [hmeij@swallowtail kflaherty]$ ps -u hmeij > PID TTY TIME CMD > 29481 ? 00:00:00 res > 29485 ? 00:00:00 1458575067.384 > 29488 ? 00:00:00 1458575067.384. > 29508 ? 00:00:00 cr_restart > 29509 ? 00:00:00 blcr_watcher > 29512 ? 00:00:02 lava.openmpi.wr > 29514 ? 00:38:35 mpirun > 30313 ? 00:00:01 sshd > 30314 pts/1 00:00:00 bash > 30458 ? 00:00:00 sleep > 30483 ? 00:00:00 sleep > 30650 pts/1 00:00:00 cr_restart > 30652 pts/1 00:00:00 lava.openmpi.wr > 30653 pts/1 00:00:00 mpirun > 30729 pts/1 00:00:00 ps > [hmeij@swallowtail kflaherty]$ jobs > [1]+ Stopped cr_restart --no-restore-pid --no-restore-pgid > --no-restore-sid --relocate /sanscratch/383=/sanscratch/000 > /sanscratch/checkpoints/383/chk.28244 > From: Meij, Henk > Sent: Monday, March 21, 2016 12:04 PM > To: us...@open-mpi.org <mailto:us...@open-mpi.org> > Subject: BLCR & openmpi > > openmpi1.2 (yes, I know old),python 2.6.1 blcr 0.8.5 > > when I attempt to cr_restart (having performed cr_checkpoint --save-all) I > can restart the job manually with blcr on a node. but when I go through my > openlava scheduler, the cr_restart launches mpirun, then nothing. no orted or > the python processes that were running. the new scheduler job performing the > restart puts in place the old machinefile and stderr and stdout files. here > is what I view on an strace of mpirun > > What problem is this pointing at? > Thanks, > > -Henk > > poll([{fd=5, events=POLLIN}, {fd=4, events=POLLIN}, {fd=6, events=POLLIN}, > {fd=11, events=POLLIN}, {fd=7, events=POLLIN}, {fd=8, events=POLLIN}, {fd=9, > events=POLLIN}, {fd=10, events=POLLIN}], 8, 1000) = 8 ([{fd=5, > revents=POLLNVAL}, {fd=4, revents=POLLNVAL}, {fd=6, revents=POLLNVAL}, > {fd=11, revents=POLLNVAL}, {fd=7, revents=POLLNVAL}, {fd=8, > revents=POLLNVAL}, {fd=9, revents=POLLNVAL}, {fd=10, revents=POLLNVAL}]) > rt_sigprocmask(SIG_BLOCK, [INT USR1 USR2 TERM CHLD], NULL, 8) = 0 > rt_sigaction(SIGCHLD, {0x2b7ca19cb30a, [INT USR1 USR2 TERM CHLD], > SA_RESTORER|SA_RESTART, 0x397840f790}, NULL, 8) = 0 > rt_sigaction(SIGTERM, {0x2b7ca19cb30a, [INT USR1 USR2 TERM CHLD], > SA_RESTORER|SA_RESTART, 0x397840f790}, NULL, 8) = 0 > rt_sigaction(SIGINT, {0x2b7ca19cb30a, [INT USR1 USR2 TERM CHLD], > SA_RESTORER|SA_RESTART, 0x397840f790}, NULL, 8) = 0 > rt_sigaction(SIGUSR1, {0x2b7ca19cb30a, [INT USR1 USR2 TERM CHLD], > SA_RESTORER|SA_RESTART, 0x397840f790}, NULL, 8) = 0 > rt_sigaction(SIGUSR2, {0x2b7ca19cb30a, [INT USR1 USR2 TERM CHLD], > SA_RESTORER|SA_RESTART, 0x397840f790}, NULL, 8) = 0 > sched_yield() = 0 > rt_sigprocmask(SIG_BLOCK, [INT USR1 USR2 TERM CHLD], NULL, 8) = 0 > rt_sigaction(SIGCHLD, {0x2b7ca19cb30a, [INT USR1 USR2 TERM CHLD], > SA_RESTORER|SA_RESTART, 0x397840f790}, NULL, 8) = 0 > rt_sigaction(SIGTERM, {0x2b7ca19cb30a, [INT USR1 USR2 TERM CHLD], > SA_RESTORER|SA_RESTART, 0x397840f790}, NULL, 8) = 0 > rt_sigaction(SIGINT, {0x2b7ca19cb30a, [INT USR1 USR2 TERM CHLD], > SA_RESTORER|SA_RESTART, 0x397840f790}, NULL, 8) = 0 > rt_sigaction(SIGUSR1, {0x2b7ca19cb30a, [INT USR1 USR2 TERM CHLD], > SA_RESTORER|SA_RESTART, 0x397840f790}, NULL, 8) = 0 > rt_sigaction(SIGUSR2, {0x2b7ca19cb30a, [INT USR1 USR2 TERM CHLD], > SA_RESTORER|SA_RESTART, 0x397840f790}, NULL, 8) = 0 > > > _______________________________________________ > users mailing list > us...@open-mpi.org <mailto:us...@open-mpi.org> > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users > <http://www.open-mpi.org/mailman/listinfo.cgi/users> > Link to this post: > http://www.open-mpi.org/community/lists/users/2016/03/28806.php > <http://www.open-mpi.org/community/lists/users/2016/03/28806.php>