0:00 lava.openmpi.wr
24157 ?00:00:28 mpirun
24176 ?00:00:00 sshd
24177 ?00:00:00 ps
From: users [users-boun...@open-mpi.org] on behalf of George Bosilca
[bosi...@icl.utk.edu]
Sent: Wednesday, March 23, 2016 12:27 PM
To: Open MPI Users
Both BLCR and Open MPI work just fine. Independently.
Checkpointing and restarting a parallel application is not as simple as
mixing 2 tools together (especially when we talk about a communication
library, aka. MPI), they have to cooperate in order to achieve the desired
goal of being able to cont
I don’t believe checkpoint/restart is supported in OMPI past the 1.6 series.
There was some attempt to restore it, but that person graduated prior to fully
completing the work.
> On Mar 23, 2016, at 9:14 AM, Meij, Henk wrote:
>
> So I've redone this with openmpi 1.10.2 and another piece of so
So I've redone this with openmpi 1.10.2 and another piece of software (lammps
16feb16) and get same results.
Upon cr_restart I see the openlava_wrapper process, the mpirun process
reappearing but no orted and no lmp_mpi processes. Not obvious error anywhere.
Using the --save-all feature from
hmm, I'm not correct. cr_restart starts with no errors, launches some of the
processes, then suspends itself. strace on mpirun on this manual invocation
yields the behavior same as below.
-Henk
[hmeij@swallowtail kflaherty]$ ps -u hmeij
PID TTY TIME CMD
29481 ?00:00:00 re
openmpi1.2 (yes, I know old),python 2.6.1 blcr 0.8.5
when I attempt to cr_restart (having performed cr_checkpoint --save-all) I can
restart the job manually with blcr on a node. but when I go through my openlava
scheduler, the cr_restart launches mpirun, then nothing. no orted or the python
p