On Mon, Mar 22, 2010 at 8:20 PM, fengguang tian <ferny...@gmail.com> wrote:
> I set up a cluster of 18 nodes using Open MPI and BLCR library, and the MPI
> program runs well on the clusters,
> but how to checkpoint the MPI program on this clusters?
> for example:
> here is what I do for a test:
> mpiu@nimbus: /mirror$ mpirun -np 50 --hostfile .mpihostfile -am ft-enable-cr
> hellompi
>  the program will run on the clusters
> then ,I enter:
> mpiu@nimbus: /mirror$ ompi-checkpoint -term $(pidof mpirun)
>
> but the MPI program are not terminated as what happaned on single
> machine,although it created a checkpoint file“ompi_global_snapshot_
> 14030.ckpt“ in the home directory on master node.

Are you using OpenMPI 1.4 without a shared file system mounted at
$HOME? If yes, then take a look here:

http://www.open-mpi.org/community/lists/users/2010/03/12246.php

Regards,

Reply via email to