Hello,
I was studying how OpenMPI reacts to failures. I have a virtual
infrastructure where failures can be emulated by turning off a given VM.
Depending on the way the VM is turned off the 'mpirun' will be notified,
either because it receives a signal or because some timeout is reached.
In both cases failures are detected after some minutes. I did some test
with the NAS benchmarks and I got the following output:
[node-5][[12114,1],5][btl_tcp_frag.c:237:mca_btl_tcp_frag_recv]
mca_btl_tcp_frag_recv: readv failed: Connection timed out (110)
[node-4][[12114,1],4][btl_tcp_frag.c:237:mca_btl_tcp_frag_recv]
mca_btl_tcp_frag_recv: readv failed: Connection timed out (110)
Then, after some minutes I got another message like this:
ORTE was unable to reliably start one or more daemons.
This usually is caused by:
* not finding the required libraries and/or binaries on
one or more nodes. Please check your PATH and LD_LIBRARY_PATH
settings, or configure OMPI with --enable-orterun-prefix-by-default
* lack of authority to execute on one or more specified nodes.
Please verify your allocation and authorities.
* the inability to write startup files into /tmp
(--tmpdir/orte_tmpdir_base).
Please check with your sys admin to determine the correct location to
use.
* compilation of the orted with dynamic libraries when static are required
(e.g., on Cray). Please check your configure cmd line and consider using
However the 'mpirun' does not terminate (after a least 30 minutes). The
execution is blocked even though a failure is detected. Is this a normal
behavior of "mpirun"?
OpenMPI version:
root@node-0:~# mpirun --version
mpirun (Open MPI) 1.8.5
I appreciate your help