On Mar 23, 2010, at 1:00 PM, Fernando Lemos wrote:

On Tue, Mar 23, 2010 at 12:55 PM, fengguang tian <ferny...@gmail.com> wrote:

I use mpirun -np 50 -am ft-enable-cr --mca snapc_base_global_snapshot_dir
--hostfile .mpihostfile xxxx
to store the global checkpoint snapshot into the shared
directory:/mirror,but the problems are still there,
when ompi-checkpoint, the mpirun is still not killed,it is hanging
there.

So the 'ompi-checkpoint' command does not finish? By default 'ompi- checkpoint' does not terminate the MPI job. If you pass the '--term' option to it, then it will.


when doing ompi-restart, it shows:

mpiu@nimbus:/mirror$ ompi-restart ompi_global_snapshot_333.ckpt/
--------------------------------------------------------------------------
Error: The filename (ompi_global_snapshot_333.ckpt/) is invalid because
either you have not provided a filename
       or provided an invalid filename.
       Please see --help for usage.

--------------------------------------------------------------------------


Try removing the trailing '/' in the command. The current ompi-restart is not good about differentiating between :
  ompi_global_snapshot_333.ckpt
and
  ompi_global_snapshot_333.ckpt/


Have you tried OpenMPI 1.5? I got it to work with 1.5, but not with
1.4 (but then I didn't try 1.4 with a shared filesystem).

I would also suggest trying v1.4 or 1.5 to see if your problems persist with these versions.

-- Josh



_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

Reply via email to