Hi,

first, my resources: I've two SLES10 machines with Open MPI 1.3rc2
installed. It's configured with ./configure --prefix=/usr/local
--with-ft=cr --enable-ft-thread --enable-mpi-threads. I've installed
BLCR 0.7.3, too. The hosts are called dschungsles10-1 and
dschungsles10-2. My MPI-Apps are located in /srv/mpi/ on
dschungsles10-1, which is also exported via NFS to dschungsles10-2.

I'm able to restart a MPI-Application a.out from a checkpoint, if I use
only one host (mpirun -np 4 -am ft-enable-cr a.out)

Now, I'm trying to restart my application which I started over two
hosts. Taking the snapshot works fine:

demo@dschungsles10-1:~> ps aux | grep mpirun
demo      8637 27.8  0.0  33364  2308 pts/2    R+   16:06   0:02 mpirun
-np 4 -am ft-enable-cr -host dschun
gsles10-2 -v a.out
demo      8658  0.0  0.0   2736   480 pts/3    R+   16:07   0:00 grep mpirun
demo@dschungsles10-1:~> ompi-checkpoint -v -s 8637
[dschungsles10-1:08661] orte_checkpoint: Checkpointing...
[dschungsles10-1:08661]          PID 8637
[dschungsles10-1:08661]          Connected to Mpirun [[417,0],0]
[dschungsles10-1:08661] orte_checkpoint: notify_hnp: Contact Head Node
Process PID 8637
[dschungsles10-1:08661] orte_checkpoint: notify_hnp: Requested a
checkpoint of jobid [INVALID]
[dschungsles10-1:08661] orte_checkpoint: hnp_receiver: Receive a command
message.
[dschungsles10-1:08661] orte_checkpoint: hnp_receiver: Status Update.
[dschungsles10-1:08661]                 Requested - Global Snapshot
Reference: (null)
[dschungsles10-1:08661] orte_checkpoint: hnp_receiver: Receive a command
message.
[dschungsles10-1:08661] orte_checkpoint: hnp_receiver: Status Update.
[dschungsles10-1:08661]                   Pending - Global Snapshot
Reference: (null)
[dschungsles10-1:08661] orte_checkpoint: hnp_receiver: Receive a command
message.
[dschungsles10-1:08661] orte_checkpoint: hnp_receiver: Status Update.
[dschungsles10-1:08661]                   Running - Global Snapshot
Reference: (null)
[dschungsles10-1:08661] orte_checkpoint: hnp_receiver: Receive a command
message.
[dschungsles10-1:08661] orte_checkpoint: hnp_receiver: Status Update.
[dschungsles10-1:08661]             File Transfer - Global Snapshot
Reference: (null)
[dschungsles10-1:08661] orte_checkpoint: hnp_receiver: Receive a command
message.
[dschungsles10-1:08661] orte_checkpoint: hnp_receiver: Status Update.
[dschungsles10-1:08661]                  Finished - Global Snapshot
Reference: ompi_global_snapshot_8637.ckpt
Snapshot Ref.:   0 ompi_global_snapshot_8637.ckpt

But restarting doesn't work:

demo@dschungsles10-1:~> ompi-restart -v ompi_global_snapshot_8637.ckpt
[dschungsles10-1:08687] Checking for the existence of
(/home/demo/ompi_global_snapshot_8637.ckpt)
[dschungsles10-1:08687] Restarting from file
(ompi_global_snapshot_8637.ckpt)
[dschungsles10-1:08687]          Exec in self
Password:
--------------------------------------------------------------------------
Error: The filename (opal_snapshot_0.ckpt) is invalid because either you
have not provided a filename
       or provided an invalid filename.
       Please see --help for usage.

--------------------------------------------------------------------------
--------------------------------------------------------------------------
Error: The filename (opal_snapshot_1.ckpt) is invalid because either you
have not provided a filename
       or provided an invalid filename.
       Please see --help for usage.

--------------------------------------------------------------------------
--------------------------------------------------------------------------
Error: The filename (opal_snapshot_2.ckpt) is invalid because either you
have not provided a filename
       or provided an invalid filename.
       Please see --help for usage.

--------------------------------------------------------------------------
--------------------------------------------------------------------------
Error: The filename (opal_snapshot_3.ckpt) is invalid because either you
have not provided a filename
       or provided an invalid filename.
       Please see --help for usage.

--------------------------------------------------------------------------

Perhaps, somebody has a few ideas...

 -Gregor

Reply via email to