Both BLCR and Open MPI work just fine. Independently. Checkpointing and restarting a parallel application is not as simple as mixing 2 tools together (especially when we talk about a communication library, aka. MPI), they have to cooperate in order to achieve the desired goal of being able to continue the execution on another set of resources. Open MPI had support for C/R but this feature has been lost.
1. It is not clear from your email what exactly you checkpoint. Are you checkpointing the mpirun process, or are you checkpointing all the MPI processes ? 2. What are you recovering? Assuming that you checkpoint your MPI processes (and not the mpirun), what you can try to do during the recovery is to spawn a new set of MPI processes (that will give you new orteds) and then let each one of these processes call the corresponding BLCR cr_restart. 3. This will not give you a working MPI environment, as the processes will know each other from the original execution, and will be unable to connect to each other to resume communications. You will have to dig a little more in the code in order to achieve what you want/need. George. On Wed, Mar 23, 2016 at 12:14 PM, Meij, Henk <hm...@wesleyan.edu> wrote: > So I've redone this with openmpi 1.10.2 and another piece of software > (lammps 16feb16) and get same results. > > > > Upon cr_restart I see the openlava_wrapper process, the mpirun process > reappearing but no orted and no lmp_mpi processes. Not obvious error > anywhere. Using the --save-all feature from BLCR and ignore pids. > > > > Does BLCR and openmpi work? Anybody have any idea as to where to look? > > > > -Henk > > > ------------------------------ > *From:* Meij, Henk > *Sent:* Monday, March 21, 2016 12:24 PM > *To:* us...@open-mpi.org > *Subject:* RE: BLCR & openmpi > > hmm, I'm not correct. cr_restart starts with no errors, launches some > of the processes, then suspends itself. strace on mpirun on this manual > invocation yields the behavior same as below. > > > > -Henk > > > > [hmeij@swallowtail kflaherty]$ ps -u hmeij > PID TTY TIME CMD > 29481 ? 00:00:00 res > 29485 ? 00:00:00 1458575067.384 > 29488 ? 00:00:00 1458575067.384. > 29508 ? 00:00:00 cr_restart > 29509 ? 00:00:00 blcr_watcher > 29512 ? 00:00:02 lava.openmpi.wr > 29514 ? 00:38:35 mpirun > 30313 ? 00:00:01 sshd > 30314 pts/1 00:00:00 bash > 30458 ? 00:00:00 sleep > 30483 ? 00:00:00 sleep > 30650 pts/1 00:00:00 cr_restart > 30652 pts/1 00:00:00 lava.openmpi.wr > 30653 pts/1 00:00:00 mpirun > 30729 pts/1 00:00:00 ps > [hmeij@swallowtail kflaherty]$ jobs > [1]+ Stopped cr_restart --no-restore-pid > --no-restore-pgid --no-restore-sid --relocate > /sanscratch/383=/sanscratch/000 /sanscratch/checkpoints/383/chk.28244 > ------------------------------ > *From:* Meij, Henk > *Sent:* Monday, March 21, 2016 12:04 PM > *To:* us...@open-mpi.org > *Subject:* BLCR & openmpi > > openmpi1.2 (yes, I know old),python 2.6.1 blcr 0.8.5 > > > > when I attempt to cr_restart (having performed cr_checkpoint --save-all) I > can restart the job manually with blcr on a node. but when I go through my > openlava scheduler, the cr_restart launches mpirun, then nothing. no orted > or the python processes that were running. the new scheduler job performing > the restart puts in place the old machinefile and stderr and stdout files. > here is what I view on an strace of mpirun > > > > What problem is this pointing at? > > Thanks, > > > > -Henk > > > > poll([{fd=5, events=POLLIN}, {fd=4, events=POLLIN}, {fd=6, events=POLLIN}, > {fd=11, events=POLLIN}, {fd=7, events=POLLIN}, {fd=8, events=POLLIN}, > {fd=9, events=POLLIN}, {fd=10, events=POLLIN}], 8, 1000) = 8 ([{fd=5, > revents=POLLNVAL}, {fd=4, revents=POLLNVAL}, {fd=6, revents=POLLNVAL}, > {fd=11, revents=POLLNVAL}, {fd=7, revents=POLLNVAL}, {fd=8, > revents=POLLNVAL}, {fd=9, revents=POLLNVAL}, {fd=10, revents=POLLNVAL}]) > rt_sigprocmask(SIG_BLOCK, [INT USR1 USR2 TERM CHLD], NULL, 8) = 0 > rt_sigaction(SIGCHLD, {0x2b7ca19cb30a, [INT USR1 USR2 TERM CHLD], > SA_RESTORER|SA_RESTART, 0x397840f790}, NULL, 8) = 0 > rt_sigaction(SIGTERM, {0x2b7ca19cb30a, [INT USR1 USR2 TERM CHLD], > SA_RESTORER|SA_RESTART, 0x397840f790}, NULL, 8) = 0 > rt_sigaction(SIGINT, {0x2b7ca19cb30a, [INT USR1 USR2 TERM CHLD], > SA_RESTORER|SA_RESTART, 0x397840f790}, NULL, 8) = 0 > rt_sigaction(SIGUSR1, {0x2b7ca19cb30a, [INT USR1 USR2 TERM CHLD], > SA_RESTORER|SA_RESTART, 0x397840f790}, NULL, 8) = 0 > rt_sigaction(SIGUSR2, {0x2b7ca19cb30a, [INT USR1 USR2 TERM CHLD], > SA_RESTORER|SA_RESTART, 0x397840f790}, NULL, 8) = 0 > sched_yield() = 0 > rt_sigprocmask(SIG_BLOCK, [INT USR1 USR2 TERM CHLD], NULL, 8) = 0 > rt_sigaction(SIGCHLD, {0x2b7ca19cb30a, [INT USR1 USR2 TERM CHLD], > SA_RESTORER|SA_RESTART, 0x397840f790}, NULL, 8) = 0 > rt_sigaction(SIGTERM, {0x2b7ca19cb30a, [INT USR1 USR2 TERM CHLD], > SA_RESTORER|SA_RESTART, 0x397840f790}, NULL, 8) = 0 > rt_sigaction(SIGINT, {0x2b7ca19cb30a, [INT USR1 USR2 TERM CHLD], > SA_RESTORER|SA_RESTART, 0x397840f790}, NULL, 8) = 0 > rt_sigaction(SIGUSR1, {0x2b7ca19cb30a, [INT USR1 USR2 TERM CHLD], > SA_RESTORER|SA_RESTART, 0x397840f790}, NULL, 8) = 0 > rt_sigaction(SIGUSR2, {0x2b7ca19cb30a, [INT USR1 USR2 TERM CHLD], > SA_RESTORER|SA_RESTART, 0x397840f790}, NULL, 8) = 0 > > > > > > _______________________________________________ > users mailing list > us...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users > Link to this post: > http://www.open-mpi.org/community/lists/users/2016/03/28806.php >