On Mar 29, 2010, at 11:53 AM, fengguang tian wrote:

hi

i have used the --term option,but the mpirun is still hanging,it is the same whether I include the ' / ' or not.I am installing the v1.4 to see whether the problems are still there. I tried, but some problems are still there.

What configure options did you use when building Open MPI?


BTW, my MPI program will have some input file, and will generate some output file after some computation, it can be checkpointed,but when restart it, some error happened,have you met this kind of problem?

Try putting the 'snapc_base_global_snapshot_dir' in the $HOME/.openmpi/ mca-params.conf file instead of just on the command line. Like:
snapc_base_global_snapshot_dir=/shared-dir/

I suspect that ompi-restart is looking in the wrong place for your checkpoint. By default it will search $HOME (since that is the default for snapc_base_global_snapshot_dir). If you put this parameter in the mca-params.conf file, then it is always set in any tool (mpirun/ompi- checkpoint/ompi-restart) to the specified value. So ompi-restart will search the correct location for the checkpoint files.

-- Josh


cheers
fengguang

On Mon, Mar 29, 2010 at 11:42 AM, Josh Hursey <jjhursey@open- mpi.org> wrote:

On Mar 23, 2010, at 1:00 PM, Fernando Lemos wrote:

On Tue, Mar 23, 2010 at 12:55 PM, fengguang tian <ferny...@gmail.com> wrote:

I use mpirun -np 50 -am ft-enable-cr --mca snapc_base_global_snapshot_dir
--hostfile .mpihostfile xxxx
to store the global checkpoint snapshot into the shared
directory:/mirror,but the problems are still there,
when ompi-checkpoint, the mpirun is still not killed,it is hanging
there.

So the 'ompi-checkpoint' command does not finish? By default 'ompi- checkpoint' does not terminate the MPI job. If you pass the '--term' option to it, then it will.



when doing ompi-restart, it shows:

mpiu@nimbus:/mirror$ ompi-restart ompi_global_snapshot_333.ckpt/
--------------------------------------------------------------------------
Error: The filename (ompi_global_snapshot_333.ckpt/) is invalid because
either you have not provided a filename
      or provided an invalid filename.
      Please see --help for usage.

--------------------------------------------------------------------------


Try removing the trailing '/' in the command. The current ompi- restart is not good about differentiating between :

 ompi_global_snapshot_333.ckpt
and

 ompi_global_snapshot_333.ckpt/


Have you tried OpenMPI 1.5? I got it to work with 1.5, but not with
1.4 (but then I didn't try 1.4 with a shared filesystem).

I would also suggest trying v1.4 or 1.5 to see if your problems persist with these versions.

-- Josh




_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

Reply via email to