I use mpirun -np 50 -am ft-enable-cr --mca snapc_base_global_snapshot_dir --hostfile .mpihostfile xxxx to store the global checkpoint snapshot into the shared directory:/mirror,but the problems are still there, when ompi-checkpoint, the mpirun is still not killed,it is hanging there.when doing ompi-restart, it shows:
mpiu@nimbus:/mirror$ ompi-restart ompi_global_snapshot_333.ckpt/ -------------------------------------------------------------------------- Error: The filename (ompi_global_snapshot_333.ckpt/) is invalid because either you have not provided a filename or provided an invalid filename. Please see --help for usage. -------------------------------------------------------------------------- cheers fengguang On Tue, Mar 23, 2010 at 10:34 AM, Fernando Lemos <fernando...@gmail.com>wrote: > On Tue, Mar 23, 2010 at 12:27 PM, fengguang tian <ferny...@gmail.com> > wrote: > > I have created the shared file system. but I created a /mirror at root > > directory,not at the $HOME directory,is that the > > problem? thank you > > Others might be able to give you more a accurate explanation. The way > I understood it, in OpenMPI 1.4, you need to write all checkpoints to > a single, shared location. That's why you generally want a shared file > system. > > Now I'm pretty sure you can change the directory to which the > checkpoints are written. If you $HOME isn't a shared directory, you > can point OpenMPI to write the checkpoints to the shared directory > instead. > > In OpenMPI 1.5 (unstable), some magic allows you to create the > checkpoints and restore them without a shared directory. > > Regards, > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users >