On Mar 29, 2010, at 11:53 AM, fengguang tian wrote:
hi
i have used the --term option,but the mpirun is still hanging,it is
the same whether I include the ' / ' or not.I am installing the v1.4
to see whether the problems are still there. I tried, but some
problems are still there.
What configure options did you use when building Open MPI?
BTW, my MPI program will have some input file, and will generate
some output file after some computation, it can be checkpointed,but
when restart it, some error happened,have you met this kind of
problem?
Try putting the 'snapc_base_global_snapshot_dir' in the $HOME/.openmpi/
mca-params.conf file instead of just on the command line. Like:
snapc_base_global_snapshot_dir=/shared-dir/
I suspect that ompi-restart is looking in the wrong place for your
checkpoint. By default it will search $HOME (since that is the default
for snapc_base_global_snapshot_dir). If you put this parameter in the
mca-params.conf file, then it is always set in any tool (mpirun/ompi-
checkpoint/ompi-restart) to the specified value. So ompi-restart will
search the correct location for the checkpoint files.
-- Josh
cheers
fengguang
On Mon, Mar 29, 2010 at 11:42 AM, Josh Hursey <jjhursey@open-
mpi.org> wrote:
On Mar 23, 2010, at 1:00 PM, Fernando Lemos wrote:
On Tue, Mar 23, 2010 at 12:55 PM, fengguang tian
<ferny...@gmail.com> wrote:
I use mpirun -np 50 -am ft-enable-cr --mca
snapc_base_global_snapshot_dir
--hostfile .mpihostfile xxxx
to store the global checkpoint snapshot into the shared
directory:/mirror,but the problems are still there,
when ompi-checkpoint, the mpirun is still not killed,it is hanging
there.
So the 'ompi-checkpoint' command does not finish? By default 'ompi-
checkpoint' does not terminate the MPI job. If you pass the '--term'
option to it, then it will.
when doing ompi-restart, it shows:
mpiu@nimbus:/mirror$ ompi-restart ompi_global_snapshot_333.ckpt/
--------------------------------------------------------------------------
Error: The filename (ompi_global_snapshot_333.ckpt/) is invalid
because
either you have not provided a filename
or provided an invalid filename.
Please see --help for usage.
--------------------------------------------------------------------------
Try removing the trailing '/' in the command. The current ompi-
restart is not good about differentiating between :
ompi_global_snapshot_333.ckpt
and
ompi_global_snapshot_333.ckpt/
Have you tried OpenMPI 1.5? I got it to work with 1.5, but not with
1.4 (but then I didn't try 1.4 with a shared filesystem).
I would also suggest trying v1.4 or 1.5 to see if your problems
persist with these versions.
-- Josh
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users