I set up a cluster of 18 nodes using Open MPI and BLCR library, and the MPI
program runs well on the clusters,
but how to checkpoint the MPI program on this clusters?
for example:
here is what I do for a test:
mpiu@nimbus: /mirror$ mpirun -np 50 --hostfile .mpihostfile -am ft-enable-cr
hellompi
 the program will run on the clusters
then ,I enter:
mpiu@nimbus: /mirror$ ompi-checkpoint -term $(pidof mpirun)

but the MPI program are not terminated as what happaned on single
machine,although it created a checkpoint file“ompi_global_snapshot_
14030.ckpt“ in the home directory on master node.

then I use ompi-restart command in the master node to restart the MPI
program
mpiu@nimbus:/mirror$  ompi-restart ompi_global_snapshot_14030.ckpt

the error information is like:
[nimbus:14545] Error: Unable to access the path
[/home/mpiu/ompi_global_snapshot_14030.ckpt/0/opal_snapshot_29.ckpt]!
--------------------------------------------------------------------------
Error: The filename (opal_snapshot_29.ckpt) is invalid because either you
have not provided a filename
       or provided an invalid filename.
       Please see --help for usage.

--------------------------------------------------------------------------
[nimbus:14609] Error: Unable to access the path
[/home/mpiu/ompi_global_snapshot_14030.ckpt/0/opal_snapshot_34.ckpt]!
--------------------------------------------------------------------------
Error: The filename (opal_snapshot_34.ckpt) is invalid because either you
have not provided a filename
       or provided an invalid filename.
       Please see --help for usage.

--------------------------------------------------------------------------
[nimbus:14685] Error: Unable to access the path
[/home/mpiu/ompi_global_snapshot_14030.ckpt/0/opal_snapshot_39.ckpt]!
--------------------------------------------------------------------------
Error: The filename (opal_snapshot_39.ckpt) is invalid because either you
have not provided a filename
       or provided an invalid filename.
       Please see --help for usage.

--------------------------------------------------------------------------
[nimbus:14737] Error: Unable to access the path
[/home/mpiu/ompi_global_snapshot_14030.ckpt/0/opal_snapshot_44.ckpt]!
--------------------------------------------------------------------------
Error: The filename (opal_snapshot_44.ckpt) is invalid because either you
have not provided a filename
       or provided an invalid filename.
       Please see --help for usage.

--------------------------------------------------------------------------
[nimbus:14798] Error: Unable to access the path
[/home/mpiu/ompi_global_snapshot_14030.ckpt/0/opal_snapshot_49.ckpt]!
--------------------------------------------------------------------------
Error: The filename (opal_snapshot_49.ckpt) is invalid because either you
have not provided a filename
       or provided an invalid filename.
       Please see --help for usage.

--------------------------------------------------------------------------
[nimbus:14317] Error: Unable to access the path
[/home/mpiu/ompi_global_snapshot_14030.ckpt/0/opal_snapshot_4.ckpt]!
--------------------------------------------------------------------------
Error: The filename (opal_snapshot_4.ckpt) is invalid because either you
have not provided a filename
       or provided an invalid filename.
       Please see --help for usage.

--------------------------------------------------------------------------
[nimbus:14331] Error: Unable to access the path
[/home/mpiu/ompi_global_snapshot_14030.ckpt/0/opal_snapshot_9.ckpt]!
--------------------------------------------------------------------------
Error: The filename (opal_snapshot_9.ckpt) is invalid because either you
have not provided a filename
       or provided an invalid filename.
       Please see --help for usage.

--------------------------------------------------------------------------
[nimbus:14381] Error: Unable to access the path
[/home/mpiu/ompi_global_snapshot_14030.ckpt/0/opal_snapshot_14.ckpt]!
--------------------------------------------------------------------------
Error: The filename (opal_snapshot_14.ckpt) is invalid because either you
have not provided a filename
       or provided an invalid filename.
       Please see --help for usage.

--------------------------------------------------------------------------
[nimbus:14408] Error: Unable to access the path
[/home/mpiu/ompi_global_snapshot_14030.ckpt/0/opal_snapshot_19.ckpt]!
--------------------------------------------------------------------------
Error: The filename (opal_snapshot_19.ckpt) is invalid because either you
have not provided a filename
       or provided an invalid filename.
       Please see --help for usage.

--------------------------------------------------------------------------
[nimbus:14483] Error: Unable to access the path
[/home/mpiu/ompi_global_snapshot_14030.ckpt/0/opal_snapshot_24.ckpt]!
--------------------------------------------------------------------------
Error: The filename (opal_snapshot_24.ckpt) is invalid because either you
have not provided a filename
       or provided an invalid filename.
       Please see --help for usage.

--------------------------------------------------------------------------
NO 26
Hello, world, I am 2 of 50 on nimbus

NO 26
Hello, world, I am 12 of 50 on nimbus

NO 26
Hello, world, I am 10 of 50 on nimbus

NO 26
Hello, world, I am 1 of 50 on nimbus

NO 26
Hello, world, I am 8 of 50 on nimbus

NO 26
Hello, world, I am 3 of 50 on nimbus

NO 26
Hello, world, I am 0 of 50 on nimbus

NO 26
Hello, world, I am 5 of 50 on nimbus

NO 26
Hello, world, I am 11 of 50 on nimbus

NO 26
Hello, world, I am 6 of 50 on nimbus

NO 26
Hello, world, I am 17 of 50 on nimbus

NO 26
Hello, world, I am 15 of 50 on nimbus

NO 26
Hello, world, I am 18 of 50 on nimbus

NO 27
Hello, world, I am 2 of 50 on nimbus

NO 26
Hello, world, I am 13 of 50 on nimbus

NO 27
Hello, world, I am 12 of 50 on nimbus

NO 26
Hello, world, I am 7 of 50 on nimbus

NO 27
Hello, world, I am 10 of 50 on nimbus

NO 27
Hello, world, I am 1 of 50 on nimbus

NO 26
Hello, world, I am 21 of 50 on nimbus

NO 27
Hello, world, I am 8 of 50 on nimbus

NO 26
Hello, world, I am 22 of 50 on nimbus

NO 27
Hello, world, I am 3 of 50 on nimbus

NO 26
Hello, world, I am 20 of 50 on nimbus

NO 27
Hello, world, I am 0 of 50 on nimbus

NO 27
Hello, world, I am 5 of 50 on nimbus

NO 26
Hello, world, I am 16 of 50 on nimbus

NO 26
Hello, world, I am 26 of 50 on nimbus

NO 26
Hello, world, I am 23 of 50 on nimbus

NO 26
Hello, world, I am 27 of 50 on nimbus

NO 26
Hello, world, I am 28 of 50 on nimbus

NO 27
Hello, world, I am 11 of 50 on nimbus

NO 27
Hello, world, I am 6 of 50 on nimbus

NO 26
Hello, world, I am 25 of 50 on nimbus

NO 26
Hello, world, I am 31 of 50 on nimbus

NO 27
Hello, world, I am 17 of 50 on nimbus

NO 26
Hello, world, I am 30 of 50 on nimbus

NO 26
Hello, world, I am 43 of 50 on nimbus

NO 27
Hello, world, I am 15 of 50 on nimbus

NO 27
Hello, world, I am 18 of 50 on nimbus

NO 26
Hello, world, I am 33 of 50 on nimbus

NO 26
Hello, world, I am 32 of 50 on nimbus

NO 26
Hello, world, I am 47 of 50 on nimbus

NO 28
Hello, world, I am 2 of 50 on nimbus

NO 26
Hello, world, I am 36 of 50 on nimbus

NO 26
Hello, world, I am 35 of 50 on nimbus

NO 27
Hello, world, I am 13 of 50 on nimbus

NO 26
Hello, world, I am 40 of 50 on nimbus

NO 26
Hello, world, I am 38 of 50 on nimbus

NO 26
Hello, world, I am 37 of 50 on nimbus

NO 28
Hello, world, I am 12 of 50 on nimbus

NO 27
Hello, world, I am 7 of 50 on nimbus

NO 28
Hello, world, I am 10 of 50 on nimbus

NO 26
Hello, world, I am 48 of 50 on nimbus

NO 26
Hello, world, I am 41 of 50 on nimbus

NO 28
Hello, world, I am 1 of 50 on nimbus

NO 26
Hello, world, I am 45 of 50 on nimbus

NO 27
Hello, world, I am 21 of 50 on nimbus

NO 26
Hello, world, I am 42 of 50 on nimbus

NO 26
Hello, world, I am 46 of 50 on nimbus

[nimbus:14312] [[63351,0],0]-[[63351,1],46] mca_oob_tcp_msg_recv: readv
failed: Connection reset by peer (104)
--------------------------------------------------------------------------
mpirun has exited due to process rank 4 with PID 14317 on
node nimbus exiting improperly. There are two reasons this could occur:

1. this process did not call "init" before exiting, but others in
the job did. This can cause a job to hang indefinitely while it waits
for all processes to call "init". By rule, if one process calls "init",
then ALL processes must call "init" prior to termination.

2. this process called "init", but exited without calling "finalize".
By rule, all processes that call "init" MUST call "finalize" prior to
exiting or it will be considered an "abnormal termination"

This may have caused other processes in the application to be
terminated by signals sent by mpirun (as reported here).
--------------------------------------------------------------------------


Cheers
 fengguang

Reply via email to