Dear All,
I am trying to checkpoint am MPI application which has two processes
each running on two seperate hosts.
I run the application as follows:
raj@sun32:~$ mpirun -am ft-enable-cr -np 2 --hostfile sunhost -mca btl ^openib
-mca snapc_base_global_snapshot_dir /tmp m.
and I trigger the checkpoint as follows:
raj@sun32:~$ ompi-checkpoint -v 30010
The following happens displaying two errors which checkpointng the application:
##############################################
I am processor no 0 of a total of 2 procs on host sun32
I am processor no 1 of a total of 2 procs on host sun06
I am processorrrrrrrr no 0 of a total of 2 procs on host sun32
I am processorrrrrrrr no 1 of a total of 2 procs on host sun06
[sun32:30010] Error: expected_component: PID information unavailable!
[sun32:30010] Error: expected_component: Component Name information unavailable!
I am processssssssssssor no 1 of a total of 2 procs on host sun06
I am processssssssssssor no 0 of a total of 2 procs on host sun32
bye
bye
############################################
when I try to restart the application from the checkpointed file, I get the
following:
raj@sun32:~$ ompi-restart ompi_global_snapshot_30010.ckpt
--------------------------------------------------------------------------
Error: The filename (opal_snapshot_1.ckpt) is invalid because either you have
not provided a filename
or provided an invalid filename.
Please see --help for usage.
--------------------------------------------------------------------------
I am processssssssssssor no 0 of a total of 2 procs on host sun32
bye
I would very appreciate if you could give me some ideas on how to checkpoint
and restart MPI application running on multiple hosts.
Thank you
Regards,
Raj