On Dec 12, 2009, at 10:03 AM, Kritiraj Sajadah wrote:

Dear All,
I am trying to checkpoint am MPI application which has two processes each running on two seperate hosts.

I run the application as follows:

raj@sun32:~$ mpirun -am ft-enable-cr -np 2 --hostfile sunhost -mca btl ^openib -mca snapc_base_global_snapshot_dir /tmp m.

Try setting the 'snapc_base_global_snapshot_dir' in your $HOME/.openmpi/mca-params.conf file instead of on the command line. This way it will be properly picked up by the ompi-restart commands.

See the link below for how to do this:
  http://www.osl.iu.edu/research/ft/ompi-cr/examples.php#uc-ckpt-global


and I trigger the checkpoint as follows:

raj@sun32:~$ ompi-checkpoint -v 30010


The following happens displaying two errors which checkpointng the application:


##############################################
I am processor no 0 of a total of 2 procs on host sun32
I am processor no 1 of a total of 2 procs on host sun06
I am processorrrrrrrr no 0 of a total of 2 procs on host sun32
I am processorrrrrrrr no 1 of a total of 2 procs on host sun06

[sun32:30010] Error: expected_component: PID information unavailable!
[sun32:30010] Error: expected_component: Component Name information unavailable!

The only way this error could be generated when checkpointing (versus restarting) is if the Snapshot Coordinator failed to propagate the CRS component used so that it could be stored in the metadata. If this continues to happen try enabling debugging in the snapshot coordinator:
 mpirun -mca snapc_full_verbose 20 ...


I am processssssssssssor no 1 of a total of 2 procs on host sun06
I am processssssssssssor no 0 of a total of 2 procs on host sun32
bye
bye
############################################




when I try to restart the application from the checkpointed file, I get the following:

raj@sun32:~$ ompi-restart ompi_global_snapshot_30010.ckpt
--------------------------------------------------------------------------
Error: The filename (opal_snapshot_1.ckpt) is invalid because either you have not provided a filename
      or provided an invalid filename.
      Please see --help for usage.

--------------------------------------------------------------------------
I am processssssssssssor no 0 of a total of 2 procs on host sun32
bye

This usually indicates that either:
1) The local checkpoint directory (opal_snapshot_1.ckpt) is missing. So the global checkpoint is either corrupted, or the node where rank 1 resided was not able to access the storage location (/tmp in your example). 2) You moved the ompi_global_snapshot_30010.ckpt directory from /tmp to somewhere else. Currently, manually moving the global checkpoint directory is not supported.

-- Josh



I would very appreciate if you could give me some ideas on how to checkpoint and restart MPI application running on multiple hosts.

Thank you

Regards,

Raj



_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

Reply via email to