On Dec 12, 2009, at 10:03 AM, Kritiraj Sajadah wrote:
Dear All,
I am trying to checkpoint am MPI application which has two
processes each running on two seperate hosts.
I run the application as follows:
raj@sun32:~$ mpirun -am ft-enable-cr -np 2 --hostfile sunhost -mca
btl ^openib -mca snapc_base_global_snapshot_dir /tmp m.
Try setting the 'snapc_base_global_snapshot_dir' in your
$HOME/.openmpi/mca-params.conf file instead of on the command line.
This way it will be properly picked up by the ompi-restart commands.
See the link below for how to do this:
http://www.osl.iu.edu/research/ft/ompi-cr/examples.php#uc-ckpt-global
and I trigger the checkpoint as follows:
raj@sun32:~$ ompi-checkpoint -v 30010
The following happens displaying two errors which checkpointng the
application:
##############################################
I am processor no 0 of a total of 2 procs on host sun32
I am processor no 1 of a total of 2 procs on host sun06
I am processorrrrrrrr no 0 of a total of 2 procs on host sun32
I am processorrrrrrrr no 1 of a total of 2 procs on host sun06
[sun32:30010] Error: expected_component: PID information unavailable!
[sun32:30010] Error: expected_component: Component Name information
unavailable!
The only way this error could be generated when checkpointing (versus
restarting) is if the Snapshot Coordinator failed to propagate the CRS
component used so that it could be stored in the metadata. If this
continues to happen try enabling debugging in the snapshot coordinator:
mpirun -mca snapc_full_verbose 20 ...
I am processssssssssssor no 1 of a total of 2 procs on host sun06
I am processssssssssssor no 0 of a total of 2 procs on host sun32
bye
bye
############################################
when I try to restart the application from the checkpointed file, I
get the following:
raj@sun32:~$ ompi-restart ompi_global_snapshot_30010.ckpt
--------------------------------------------------------------------------
Error: The filename (opal_snapshot_1.ckpt) is invalid because either
you have not provided a filename
or provided an invalid filename.
Please see --help for usage.
--------------------------------------------------------------------------
I am processssssssssssor no 0 of a total of 2 procs on host sun32
bye
This usually indicates that either:
1) The local checkpoint directory (opal_snapshot_1.ckpt) is missing.
So the global checkpoint is either corrupted, or the node where rank 1
resided was not able to access the storage location (/tmp in your
example).
2) You moved the ompi_global_snapshot_30010.ckpt directory from /tmp
to somewhere else. Currently, manually moving the global checkpoint
directory is not supported.
-- Josh
I would very appreciate if you could give me some ideas on how to
checkpoint and restart MPI application running on multiple hosts.
Thank you
Regards,
Raj
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users