Hi

I am using open-mpi and blcr in a cluster of 3 machines, and the checkpoint
and restart work fine in single machine,but when doing checkpoint in
clusters environment, the ompi-checkpoint hangs

for example
my clusters composed of 3 machines, and using NFS, has a shared directory.
in master node,I run :mpirun -np 50 -am ft-enable-cr --hostfile (hostfile)
hello
, and the program run in the cluster,it works fine.but when I use
ompi-checkpoint --term $(pidof mpirun) to checkpoint the program, the mpirun
process is not
killed,it is still running, and although the ompi-checkpoint have created a
checkpoint file, the mpirun process hangs here and are not terminated by the
ompi-checkpoint.
when i check the process ,the mpirun is still there:
mpiu     31187  0.0  0.0  21636  4512 pts/3    S<s  10:45   0:00 -bash
*mpiu     31688  0.0  0.0  65472  3888 pts/3    S<+  10:54   0:00  \_ mpirun
-np*
mpiu     29635  0.0  0.0  21636  4504 pts/1    S<s  09:08   0:00 -bash
mpiu     32188  0.0  0.0  15168  1064 pts/1    R<+  11:18   0:00  \_ ps auf

and when I use ompi-restart to restart the program, it shows:
[nimbus:14545] Error: Unable to access the path [/home/mpiu/ompi_global_
snapshot_14030.ckpt/0/opal_snapshot_29.ckpt]!
--------------------------------------------------------------------------
Error: The filename (opal_snapshot_29.ckpt) is invalid because either you
have not provided a filename
       or provided an invalid filename.
       Please see --help for usage.

--------------------------------------------------------------------------
[nimbus:14609] Error: Unable to access the path
[/home/mpiu/ompi_global_snapshot_14030.ckpt/0/opal_snapshot_34.ckpt]!
--------------------------------------------------------------------------
Error: The filename (opal_snapshot_34.ckpt) is invalid because either you
have not provided a filename
       or provided an invalid filename.
       Please see --help for usage.

--------------------------------------------------------------------------
[nimbus:14685] Error: Unable to access the path
[/home/mpiu/ompi_global_snapshot_14030.ckpt/0/opal_snapshot_39.ckpt]!
--------------------------------------------------------------------------
Error: The filename (opal_snapshot_39.ckpt) is invalid because either you
have not provided a filename
       or provided an invalid filename.
       Please see --help for usage.

--------------------------------------------------------------------------
[nimbus:14737] Error: Unable to access the path
[/home/mpiu/ompi_global_snapshot_14030.ckpt/0/opal_snapshot_44.ckpt]!
--------------------------------------------------------------------------
Error: The filename (opal_snapshot_44.ckpt) is invalid because either you
have not provided a filename
       or provided an invalid filename.
       Please see --help for usage.

--------------------------------------------------------------------------
[nimbus:14798] Error: Unable to access the path
[/home/mpiu/ompi_global_snapshot_14030.ckpt/0/opal_snapshot_49.ckpt]!
--------------------------------------------------------------------------
Error: The filename (opal_snapshot_49.ckpt) is invalid because either you
have not provided a filename
       or provided an invalid filename.
       Please see --help for usage.

--------------------------------------------------------------------------
[nimbus:14317] Error: Unable to access the path
[/home/mpiu/ompi_global_snapshot_14030.ckpt/0/opal_snapshot_4.ckpt]!
--------------------------------------------------------------------------
Error: The filename (opal_snapshot_4.ckpt) is invalid because either you
have not provided a filename
       or provided an invalid filename.
       Please see --help for usage.

--------------------------------------------------------------------------
[nimbus:14331] Error: Unable to access the path
[/home/mpiu/ompi_global_snapshot_14030.ckpt/0/opal_snapshot_9.ckpt]!
--------------------------------------------------------------------------
Error: The filename (opal_snapshot_9.ckpt) is invalid because either you
have not provided a filename
       or provided an invalid filename.
       Please see --help for usage.

--------------------------------------------------------------------------
[nimbus:14381] Error: Unable to access the path
[/home/mpiu/ompi_global_snapshot_14030.ckpt/0/opal_snapshot_14.ckpt]!
--------------------------------------------------------------------------
Error: The filename (opal_snapshot_14.ckpt) is invalid because either you
have not provided a filename
       or provided an invalid filename.
       Please see --help for usage.

--------------------------------------------------------------------------
[nimbus:14408] Error: Unable to access the path
[/home/mpiu/ompi_global_snapshot_14030.ckpt/0/opal_snapshot_19.ckpt]!
--------------------------------------------------------------------------
Error: The filename (opal_snapshot_19.ckpt) is invalid because either you
have not provided a filename
       or provided an invalid filename.
       Please see --help for usage.

--------------------------------------------------------------------------
[nimbus:14483] Error: Unable to access the path
[/home/mpiu/ompi_global_snapshot_14030.ckpt/0/opal_snapshot_24.ckpt]!
--------------------------------------------------------------------------
Error: The filename (opal_snapshot_24.ckpt) is invalid because either you
have not provided a filename
       or provided an invalid filename.
       Please see --help for usage.

--------------------------------------------------------------------------
NO 26
Hello, world, I am 2 of 50 on nimbus

NO 26
Hello, world, I am 12 of 50 on nimbus

NO 26
Hello, world, I am 10 of 50 on nimbus

NO 26
Hello, world, I am 1 of 50 on nimbus

NO 26
Hello, world, I am 8 of 50 on nimbus

NO 26
Hello, world, I am 3 of 50 on nimbus

NO 26
Hello, world, I am 0 of 50 on nimbus

NO 26
Hello, world, I am 5 of 50 on nimbus

NO 26
Hello, world, I am 11 of 50 on nimbus

NO 26
Hello, world, I am 6 of 50 on nimbus

NO 26
Hello, world, I am 17 of 50 on nimbus

NO 26
Hello, world, I am 15 of 50 on nimbus

NO 26
Hello, world, I am 18 of 50 on nimbus

NO 27
Hello, world, I am 2 of 50 on nimbus

NO 26
Hello, world, I am 13 of 50 on nimbus

NO 27
Hello, world, I am 12 of 50 on nimbus

NO 26
Hello, world, I am 7 of 50 on nimbus

NO 27
Hello, world, I am 10 of 50 on nimbus

NO 27
Hello, world, I am 1 of 50 on nimbus

NO 26
Hello, world, I am 21 of 50 on nimbus

NO 27
Hello, world, I am 8 of 50 on nimbus

NO 26
Hello, world, I am 22 of 50 on nimbus

NO 27
Hello, world, I am 3 of 50 on nimbus

NO 26
Hello, world, I am 20 of 50 on nimbus

NO 27
Hello, world, I am 0 of 50 on nimbus

NO 27
Hello, world, I am 5 of 50 on nimbus

NO 26
Hello, world, I am 16 of 50 on nimbus

NO 26
Hello, world, I am 26 of 50 on nimbus

NO 26
Hello, world, I am 23 of 50 on nimbus

NO 26
Hello, world, I am 27 of 50 on nimbus

NO 26
Hello, world, I am 28 of 50 on nimbus

NO 27
Hello, world, I am 11 of 50 on nimbus

NO 27
Hello, world, I am 6 of 50 on nimbus

NO 26
Hello, world, I am 25 of 50 on nimbus

NO 26
Hello, world, I am 31 of 50 on nimbus

NO 27
Hello, world, I am 17 of 50 on nimbus

NO 26
Hello, world, I am 30 of 50 on nimbus

NO 26
Hello, world, I am 43 of 50 on nimbus

NO 27
Hello, world, I am 15 of 50 on nimbus

NO 27
Hello, world, I am 18 of 50 on nimbus

NO 26
Hello, world, I am 33 of 50 on nimbus

NO 26
Hello, world, I am 32 of 50 on nimbus

NO 26
Hello, world, I am 47 of 50 on nimbus

NO 28
Hello, world, I am 2 of 50 on nimbus

NO 26
Hello, world, I am 36 of 50 on nimbus

NO 26
Hello, world, I am 35 of 50 on nimbus

NO 27
Hello, world, I am 13 of 50 on nimbus

NO 26
Hello, world, I am 40 of 50 on nimbus

NO 26
Hello, world, I am 38 of 50 on nimbus

NO 26
Hello, world, I am 37 of 50 on nimbus

NO 28
Hello, world, I am 12 of 50 on nimbus

NO 27
Hello, world, I am 7 of 50 on nimbus

NO 28
Hello, world, I am 10 of 50 on nimbus

NO 26
Hello, world, I am 48 of 50 on nimbus

NO 26
Hello, world, I am 41 of 50 on nimbus

NO 28
Hello, world, I am 1 of 50 on nimbus

NO 26
Hello, world, I am 45 of 50 on nimbus

NO 27
Hello, world, I am 21 of 50 on nimbus

NO 26
Hello, world, I am 42 of 50 on nimbus

NO 26
Hello, world, I am 46 of 50 on nimbus

[nimbus:14312] [[63351,0],0]-[[63351,1],46] mca_oob_tcp_msg_recv: readv
failed: Connection reset by peer (104)
--------------------------------------------------------------------------
mpirun has exited due to process rank 4 with PID 14317 on
node nimbus exiting improperly. There are two reasons this could occur:

1. this process did not call "init" before exiting, but others in
the job did. This can cause a job to hang indefinitely while it waits
for all processes to call "init". By rule, if one process calls "init",
then ALL processes must call "init" prior to termination.

2. this process called "init", but exited without calling "finalize".
By rule, all processes that call "init" MUST call "finalize" prior to
exiting or it will be considered an "abnormal termination"

This may have caused other processes in the application to be
terminated by signals sent by mpirun (as reported here).
--------------------------------------------------------------------------

cheers
fengguang

Reply via email to