[OMPI users] Failure detection

Cristian RUIZ Sat, 7 Nov 2015 09:41:20 -0500 (EST)

Hello,

I was studying how OpenMPI reacts to failures. I have a virtualinfrastructure where failures can be emulated by turning off a given VM.

Depending on the way the VM is turned off the 'mpirun' will be notified,
either because it receives a signal or because  some timeout is reached.
In both cases failures are detected after some minutes. I did some test
with the NAS benchmarks and I got the following output:

[node-5][[12114,1],5][btl_tcp_frag.c:237:mca_btl_tcp_frag_recv]mca_btl_tcp_frag_recv: readv failed: Connection timed out (110)[node-4][[12114,1],4][btl_tcp_frag.c:237:mca_btl_tcp_frag_recv]mca_btl_tcp_frag_recv: readv failed: Connection timed out (110)


Then, after some minutes I got another message like this:

ORTE was unable to reliably start one or more daemons.
This usually is caused by:

* not finding the required libraries and/or binaries on
  one or more nodes. Please check your PATH and LD_LIBRARY_PATH
  settings, or configure OMPI with --enable-orterun-prefix-by-default

* lack of authority to execute on one or more specified nodes.
  Please verify your allocation and authorities.

* the inability to write startup files into /tmp(--tmpdir/orte_tmpdir_base).Please check with your sys admin to determine the correct location touse.


*  compilation of the orted with dynamic libraries when static are required
  (e.g., on Cray). Please check your configure cmd line and consider using

However the 'mpirun' does not terminate (after a least 30 minutes). Theexecution is blocked even though a failure is detected. Is this a normalbehavior of "mpirun"?


OpenMPI version:

root@node-0:~# mpirun --version
mpirun (Open MPI) 1.8.5


I appreciate your help

[OMPI users] Failure detection

Reply via email to