I have created the shared file system. but I created a /mirror at root directory,not at the $HOME directory,is that the problem? thank you
cheers fengguang On Tue, Mar 23, 2010 at 10:23 AM, Fernando Lemos <fernando...@gmail.com>wrote: > On Mon, Mar 22, 2010 at 8:20 PM, fengguang tian <ferny...@gmail.com> > wrote: > > I set up a cluster of 18 nodes using Open MPI and BLCR library, and the > MPI > > program runs well on the clusters, > > but how to checkpoint the MPI program on this clusters? > > for example: > > here is what I do for a test: > > mpiu@nimbus: /mirror$ mpirun -np 50 --hostfile .mpihostfile -am > ft-enable-cr > > hellompi > > the program will run on the clusters > > then ,I enter: > > mpiu@nimbus: /mirror$ ompi-checkpoint -term $(pidof mpirun) > > > > but the MPI program are not terminated as what happaned on single > > machine,although it created a checkpoint file“ompi_global_snapshot_ > > 14030.ckpt“ in the home directory on master node. > > Are you using OpenMPI 1.4 without a shared file system mounted at > $HOME? If yes, then take a look here: > > http://www.open-mpi.org/community/lists/users/2010/03/12246.php > > Regards, > > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users >