I don’t believe checkpoint/restart is supported in OMPI past the 1.6 series. 
There was some attempt to restore it, but that person graduated prior to fully 
completing the work.


> On Mar 23, 2016, at 9:14 AM, Meij, Henk <hm...@wesleyan.edu> wrote:
> 
> So I've redone this with openmpi 1.10.2 and another piece of software (lammps 
> 16feb16) and get same results.
>  
> Upon cr_restart I see the openlava_wrapper process, the mpirun process 
> reappearing but no orted and no lmp_mpi processes. Not obvious error 
> anywhere. Using the --save-all feature from BLCR and ignore pids.
>  
> Does BLCR and openmpi work? Anybody have any idea as to where to look?
>  
> -Henk
>  
> From: Meij, Henk
> Sent: Monday, March 21, 2016 12:24 PM
> To: us...@open-mpi.org <mailto:us...@open-mpi.org>
> Subject: RE: BLCR & openmpi
> 
> hmm, I'm not correct. cr_restart starts with no errors, launches some of the 
> processes, then suspends itself. strace on mpirun on this manual invocation 
> yields the behavior same as below.
>  
> -Henk
>  
> [hmeij@swallowtail kflaherty]$ ps -u hmeij
>   PID TTY          TIME CMD
> 29481 ?        00:00:00 res
> 29485 ?        00:00:00 1458575067.384
> 29488 ?        00:00:00 1458575067.384.
> 29508 ?        00:00:00 cr_restart
> 29509 ?        00:00:00 blcr_watcher
> 29512 ?        00:00:02 lava.openmpi.wr
> 29514 ?        00:38:35 mpirun
> 30313 ?        00:00:01 sshd
> 30314 pts/1    00:00:00 bash
> 30458 ?        00:00:00 sleep
> 30483 ?        00:00:00 sleep
> 30650 pts/1    00:00:00 cr_restart
> 30652 pts/1    00:00:00 lava.openmpi.wr
> 30653 pts/1    00:00:00 mpirun
> 30729 pts/1    00:00:00 ps
> [hmeij@swallowtail kflaherty]$ jobs
> [1]+  Stopped                 cr_restart --no-restore-pid --no-restore-pgid 
> --no-restore-sid --relocate /sanscratch/383=/sanscratch/000 
> /sanscratch/checkpoints/383/chk.28244
> From: Meij, Henk
> Sent: Monday, March 21, 2016 12:04 PM
> To: us...@open-mpi.org <mailto:us...@open-mpi.org>
> Subject: BLCR & openmpi
> 
> openmpi1.2 (yes, I know old),python 2.6.1 blcr 0.8.5
>  
> when I attempt to cr_restart (having performed cr_checkpoint --save-all) I 
> can restart the job manually with blcr on a node. but when I go through my 
> openlava scheduler, the cr_restart launches mpirun, then nothing. no orted or 
> the python processes that were running. the new scheduler job performing the 
> restart puts in place the old machinefile and stderr and stdout files. here 
> is what I view on an strace of mpirun
>  
> What problem is this pointing at?
> Thanks,
>  
> -Henk
>  
> poll([{fd=5, events=POLLIN}, {fd=4, events=POLLIN}, {fd=6, events=POLLIN}, 
> {fd=11, events=POLLIN}, {fd=7, events=POLLIN}, {fd=8, events=POLLIN}, {fd=9, 
> events=POLLIN}, {fd=10, events=POLLIN}], 8, 1000) = 8 ([{fd=5, 
> revents=POLLNVAL}, {fd=4, revents=POLLNVAL}, {fd=6, revents=POLLNVAL}, 
> {fd=11, revents=POLLNVAL}, {fd=7, revents=POLLNVAL}, {fd=8, 
> revents=POLLNVAL}, {fd=9, revents=POLLNVAL}, {fd=10, revents=POLLNVAL}])
> rt_sigprocmask(SIG_BLOCK, [INT USR1 USR2 TERM CHLD], NULL, 8) = 0
> rt_sigaction(SIGCHLD, {0x2b7ca19cb30a, [INT USR1 USR2 TERM CHLD], 
> SA_RESTORER|SA_RESTART, 0x397840f790}, NULL, 8) = 0
> rt_sigaction(SIGTERM, {0x2b7ca19cb30a, [INT USR1 USR2 TERM CHLD], 
> SA_RESTORER|SA_RESTART, 0x397840f790}, NULL, 8) = 0
> rt_sigaction(SIGINT, {0x2b7ca19cb30a, [INT USR1 USR2 TERM CHLD], 
> SA_RESTORER|SA_RESTART, 0x397840f790}, NULL, 8) = 0
> rt_sigaction(SIGUSR1, {0x2b7ca19cb30a, [INT USR1 USR2 TERM CHLD], 
> SA_RESTORER|SA_RESTART, 0x397840f790}, NULL, 8) = 0
> rt_sigaction(SIGUSR2, {0x2b7ca19cb30a, [INT USR1 USR2 TERM CHLD], 
> SA_RESTORER|SA_RESTART, 0x397840f790}, NULL, 8) = 0
> sched_yield()                           = 0
> rt_sigprocmask(SIG_BLOCK, [INT USR1 USR2 TERM CHLD], NULL, 8) = 0
> rt_sigaction(SIGCHLD, {0x2b7ca19cb30a, [INT USR1 USR2 TERM CHLD], 
> SA_RESTORER|SA_RESTART, 0x397840f790}, NULL, 8) = 0
> rt_sigaction(SIGTERM, {0x2b7ca19cb30a, [INT USR1 USR2 TERM CHLD], 
> SA_RESTORER|SA_RESTART, 0x397840f790}, NULL, 8) = 0
> rt_sigaction(SIGINT, {0x2b7ca19cb30a, [INT USR1 USR2 TERM CHLD], 
> SA_RESTORER|SA_RESTART, 0x397840f790}, NULL, 8) = 0
> rt_sigaction(SIGUSR1, {0x2b7ca19cb30a, [INT USR1 USR2 TERM CHLD], 
> SA_RESTORER|SA_RESTART, 0x397840f790}, NULL, 8) = 0
> rt_sigaction(SIGUSR2, {0x2b7ca19cb30a, [INT USR1 USR2 TERM CHLD], 
> SA_RESTORER|SA_RESTART, 0x397840f790}, NULL, 8) = 0
>  
>  
> _______________________________________________
> users mailing list
> us...@open-mpi.org <mailto:us...@open-mpi.org>
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users 
> <http://www.open-mpi.org/mailman/listinfo.cgi/users>
> Link to this post: 
> http://www.open-mpi.org/community/lists/users/2016/03/28806.php 
> <http://www.open-mpi.org/community/lists/users/2016/03/28806.php>

Reply via email to