Re: [OMPI users] Failure detection

Ralph Castain Sat, 7 Nov 2015 11:22:34 -0500 (EST)

No, that certainly isn’t the normal behavior. I suspect it has to do with the 
nature of the VM TCP connection, though there is something very strange about 
your output. The BTL message indicates that an MPI job is already running. Yet 
your subsequent ORTE error message indicates we are still trying to start the 
daemons, which means we can’t have started the MPI job.


So something is clearly confused.


> On Nov 7, 2015, at 6:41 AM, Cristian RUIZ <cristian.r...@inria.fr> wrote:
> 
> Hello,
> 
> I was studying how OpenMPI reacts to failures. I have a virtual 
> infrastructure where failures can be emulated by turning off a given VM.
> Depending on the way the VM is turned off the 'mpirun' will be notified,
> either because it receives a signal or because  some timeout is reached.
> In both cases failures are detected after some minutes. I did some test
> with the NAS benchmarks and I got the following output:
> 
> [node-5][[12114,1],5][btl_tcp_frag.c:237:mca_btl_tcp_frag_recv] 
> mca_btl_tcp_frag_recv: readv failed: Connection timed out (110)
> [node-4][[12114,1],4][btl_tcp_frag.c:237:mca_btl_tcp_frag_recv] 
> mca_btl_tcp_frag_recv: readv failed: Connection timed out (110)
> 
> Then, after some minutes I got another message like this:
> 
> ORTE was unable to reliably start one or more daemons.
> This usually is caused by:
> 
> * not finding the required libraries and/or binaries on
>  one or more nodes. Please check your PATH and LD_LIBRARY_PATH
>  settings, or configure OMPI with --enable-orterun-prefix-by-default
> 
> * lack of authority to execute on one or more specified nodes.
>  Please verify your allocation and authorities.
> 
> * the inability to write startup files into /tmp (--tmpdir/orte_tmpdir_base).
>  Please check with your sys admin to determine the correct location to use.
> 
> *  compilation of the orted with dynamic libraries when static are required
>  (e.g., on Cray). Please check your configure cmd line and consider using
> 
> However the 'mpirun' does not terminate (after a least 30 minutes). The 
> execution is blocked even though a failure is detected. Is this a normal 
> behavior of "mpirun"?
> 
> OpenMPI version:
> 
> root@node-0:~# mpirun --version
> mpirun (Open MPI) 1.8.5
> 
> 
> I appreciate your help
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post: 
> http://www.open-mpi.org/community/lists/users/2015/11/28020.php

Re: [OMPI users] Failure detection

Reply via email to