On Mon, Mar 22, 2010 at 8:20 PM, fengguang tian <ferny...@gmail.com> wrote: > I set up a cluster of 18 nodes using Open MPI and BLCR library, and the MPI > program runs well on the clusters, > but how to checkpoint the MPI program on this clusters? > for example: > here is what I do for a test: > mpiu@nimbus: /mirror$ mpirun -np 50 --hostfile .mpihostfile -am ft-enable-cr > hellompi > the program will run on the clusters > then ,I enter: > mpiu@nimbus: /mirror$ ompi-checkpoint -term $(pidof mpirun) > > but the MPI program are not terminated as what happaned on single > machine,although it created a checkpoint file“ompi_global_snapshot_ > 14030.ckpt“ in the home directory on master node.
Are you using OpenMPI 1.4 without a shared file system mounted at $HOME? If yes, then take a look here: http://www.open-mpi.org/community/lists/users/2010/03/12246.php Regards,