Yeah! It's working fine. I just forgot to share the homedirs on both hosts, where the checkpoint is written.
-Gregor > Hi, > > first, my resources: I've two SLES10 machines with Open MPI 1.3rc2 > installed. It's configured with ./configure --prefix=/usr/local > --with-ft=cr --enable-ft-thread --enable-mpi-threads. I've installed > BLCR 0.7.3, too. The hosts are called dschungsles10-1 and > dschungsles10-2. My MPI-Apps are located in /srv/mpi/ on > dschungsles10-1, which is also exported via NFS to dschungsles10-2. > > I'm able to restart a MPI-Application a.out from a checkpoint, if I use > only one host (mpirun -np 4 -am ft-enable-cr a.out) > > Now, I'm trying to restart my application which I started over two > hosts. Taking the snapshot works fine: > > demo@dschungsles10-1:~> ps aux | grep mpirun > demo 8637 27.8 0.0 33364 2308 pts/2 R+ 16:06 0:02 mpirun > -np 4 -am ft-enable-cr -host dschun > gsles10-2 -v a.out > demo 8658 0.0 0.0 2736 480 pts/3 R+ 16:07 0:00 grep mpirun > demo@dschungsles10-1:~> ompi-checkpoint -v -s 8637 > [dschungsles10-1:08661] orte_checkpoint: Checkpointing... > [dschungsles10-1:08661] PID 8637 > [dschungsles10-1:08661] Connected to Mpirun [[417,0],0] > [dschungsles10-1:08661] orte_checkpoint: notify_hnp: Contact Head Node > Process PID 8637 > [dschungsles10-1:08661] orte_checkpoint: notify_hnp: Requested a > checkpoint of jobid [INVALID] > [dschungsles10-1:08661] orte_checkpoint: hnp_receiver: Receive a command > message. > [dschungsles10-1:08661] orte_checkpoint: hnp_receiver: Status Update. > [dschungsles10-1:08661] Requested - Global Snapshot > Reference: (null) > [dschungsles10-1:08661] orte_checkpoint: hnp_receiver: Receive a command > message. > [dschungsles10-1:08661] orte_checkpoint: hnp_receiver: Status Update. > [dschungsles10-1:08661] Pending - Global Snapshot > Reference: (null) > [dschungsles10-1:08661] orte_checkpoint: hnp_receiver: Receive a command > message. > [dschungsles10-1:08661] orte_checkpoint: hnp_receiver: Status Update. > [dschungsles10-1:08661] Running - Global Snapshot > Reference: (null) > [dschungsles10-1:08661] orte_checkpoint: hnp_receiver: Receive a command > message. > [dschungsles10-1:08661] orte_checkpoint: hnp_receiver: Status Update. > [dschungsles10-1:08661] File Transfer - Global Snapshot > Reference: (null) > [dschungsles10-1:08661] orte_checkpoint: hnp_receiver: Receive a command > message. > [dschungsles10-1:08661] orte_checkpoint: hnp_receiver: Status Update. > [dschungsles10-1:08661] Finished - Global Snapshot > Reference: ompi_global_snapshot_8637.ckpt > Snapshot Ref.: 0 ompi_global_snapshot_8637.ckpt > > But restarting doesn't work: > > demo@dschungsles10-1:~> ompi-restart -v ompi_global_snapshot_8637.ckpt > [dschungsles10-1:08687] Checking for the existence of > (/home/demo/ompi_global_snapshot_8637.ckpt) > [dschungsles10-1:08687] Restarting from file > (ompi_global_snapshot_8637.ckpt) > [dschungsles10-1:08687] Exec in self > Password: > -------------------------------------------------------------------------- > Error: The filename (opal_snapshot_0.ckpt) is invalid because either you > have not provided a filename > or provided an invalid filename. > Please see --help for usage. > > -------------------------------------------------------------------------- > -------------------------------------------------------------------------- > Error: The filename (opal_snapshot_1.ckpt) is invalid because either you > have not provided a filename > or provided an invalid filename. > Please see --help for usage. > > -------------------------------------------------------------------------- > -------------------------------------------------------------------------- > Error: The filename (opal_snapshot_2.ckpt) is invalid because either you > have not provided a filename > or provided an invalid filename. > Please see --help for usage. > > -------------------------------------------------------------------------- > -------------------------------------------------------------------------- > Error: The filename (opal_snapshot_3.ckpt) is invalid because either you > have not provided a filename > or provided an invalid filename. > Please see --help for usage. > > -------------------------------------------------------------------------- > > Perhaps, somebody has a few ideas... > > -Gregor > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users > -- Gregor Dschung System Life Guard, HiWi Fraunhofer-Institut für Techno- und Wirtschaftsmathematik ITWM Fraunhofer-Platz 1 D-67663 Kaiserslautern E-Mail: gregor.dsch...@itwm.fraunhofer.de Internet: www.itwm.fraunhofer.de