Hello,

I was studying how OpenMPI reacts to failures. I have a virtual infrastructure where failures can be emulated by turning off a given VM.
Depending on the way the VM is turned off the 'mpirun' will be notified,
either because it receives a signal or because  some timeout is reached.
In both cases failures are detected after some minutes. I did some test
with the NAS benchmarks and I got the following output:

[node-5][[12114,1],5][btl_tcp_frag.c:237:mca_btl_tcp_frag_recv] mca_btl_tcp_frag_recv: readv failed: Connection timed out (110) [node-4][[12114,1],4][btl_tcp_frag.c:237:mca_btl_tcp_frag_recv] mca_btl_tcp_frag_recv: readv failed: Connection timed out (110)

Then, after some minutes I got another message like this:

ORTE was unable to reliably start one or more daemons.
This usually is caused by:

* not finding the required libraries and/or binaries on
  one or more nodes. Please check your PATH and LD_LIBRARY_PATH
  settings, or configure OMPI with --enable-orterun-prefix-by-default

* lack of authority to execute on one or more specified nodes.
  Please verify your allocation and authorities.

* the inability to write startup files into /tmp (--tmpdir/orte_tmpdir_base). Please check with your sys admin to determine the correct location to use.

*  compilation of the orted with dynamic libraries when static are required
  (e.g., on Cray). Please check your configure cmd line and consider using

However the 'mpirun' does not terminate (after a least 30 minutes). The execution is blocked even though a failure is detected. Is this a normal behavior of "mpirun"?

OpenMPI version:

root@node-0:~# mpirun --version
mpirun (Open MPI) 1.8.5


I appreciate your help

Reply via email to