Re: [OMPI users] Failure detection

Cristian Camilo Ruiz Sanabria Sat, 7 Nov 2015 16:52:06 -0500 (EST)

Thanks for answering. I tested again, this time using a real cluster where I 
have the possibility of rebooting the machines at will. I run a test using 32 
machines running a MPI process per machine and during the execution I rebooted 
one of the machines and I found the same behavior: OpenMPI detects the failure 
but it blocks and the process has to be killed manually. This is the output:


root@graphene-30:~# mpirun --mca btl self,sm,tcp --machinefile machine_file 
cg.B.32 

 NAS Parallel Benchmarks 3.3 -- CG Benchmark

 Size:      75000
 Iterations:    75
 Number of active processes:    32
 Number of nonzeroes per row:       13
 Eigenvalue shift: .600E+02

   iteration           ||r||                 zeta
        1       0.13257071746643E-12    59.9994751578754
        2       0.54021441387552E-15    21.7627846142538
        3       0.57508155930725E-15    22.2876617043225
        4       0.58907101679580E-15    22.5230738188352
        5       0.59342235842271E-15    22.6275390653890
        6       0.59736634325665E-15    22.6740259189537
        7       0.60192883908490E-15    22.6949056826254
        8       0.59984965235397E-15    22.7044023166871
        9       0.60134110898017E-15    22.7087834345616
       10       0.59805179779153E-15    22.7108351397172
       11       0.60025777990273E-15    22.7118107121337
[graphene-108.nancy.grid5000.fr][[1821,1],15][btl_tcp_frag.c:215:mca_btl_tcp_frag_recv]
 mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104)
[graphene-58.nancy.grid5000.fr][[1821,1],9][btl_tcp_frag.c:215:mca_btl_tcp_frag_recv]
 mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104)
[graphene-20.nancy.grid5000.fr][[1821,1],10][btl_tcp_frag.c:215:mca_btl_tcp_frag_recv]
 mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104)
[graphene-20.nancy.grid5000.fr][[1821,1],10][btl_tcp_endpoint.c:655:mca_btl_tcp_endpoint_complete_connect]
 connect() to 172.16.64.67 failed: Connection refused (111)

I have the same behavior with versions: 1.8.5 that I compiled myself with the 
options by default and version 1.6.5 installed through Debian packages.


----- Mail original -----
> De: "Ralph Castain" <r...@open-mpi.org>
> À: "Open MPI Users" <us...@open-mpi.org>
> Envoyé: Samedi 7 Novembre 2015 17:22:28
> Objet: Re: [OMPI users] Failure detection
> 
> No, that certainly isn’t the normal behavior. I suspect it has to do with the
> nature of the VM TCP connection, though there is something very strange
> about your output. The BTL message indicates that an MPI job is already
> running. Yet your subsequent ORTE error message indicates we are still
> trying to start the daemons, which means we can’t have started the MPI job.
> 
> So something is clearly confused.
> 
> 
> > On Nov 7, 2015, at 6:41 AM, Cristian RUIZ <cristian.r...@inria.fr> wrote:
> > 
> > Hello,
> > 
> > I was studying how OpenMPI reacts to failures. I have a virtual
> > infrastructure where failures can be emulated by turning off a given VM.
> > Depending on the way the VM is turned off the 'mpirun' will be notified,
> > either because it receives a signal or because  some timeout is reached.
> > In both cases failures are detected after some minutes. I did some test
> > with the NAS benchmarks and I got the following output:
> > 
> > [node-5][[12114,1],5][btl_tcp_frag.c:237:mca_btl_tcp_frag_recv]
> > mca_btl_tcp_frag_recv: readv failed: Connection timed out (110)
> > [node-4][[12114,1],4][btl_tcp_frag.c:237:mca_btl_tcp_frag_recv]
> > mca_btl_tcp_frag_recv: readv failed: Connection timed out (110)
> > 
> > Then, after some minutes I got another message like this:
> > 
> > ORTE was unable to reliably start one or more daemons.
> > This usually is caused by:
> > 
> > * not finding the required libraries and/or binaries on
> >  one or more nodes. Please check your PATH and LD_LIBRARY_PATH
> >  settings, or configure OMPI with --enable-orterun-prefix-by-default
> > 
> > * lack of authority to execute on one or more specified nodes.
> >  Please verify your allocation and authorities.
> > 
> > * the inability to write startup files into /tmp
> > (--tmpdir/orte_tmpdir_base).
> >  Please check with your sys admin to determine the correct location to use.
> > 
> > *  compilation of the orted with dynamic libraries when static are required
> >  (e.g., on Cray). Please check your configure cmd line and consider using
> > 
> > However the 'mpirun' does not terminate (after a least 30 minutes). The
> > execution is blocked even though a failure is detected. Is this a normal
> > behavior of "mpirun"?
> > 
> > OpenMPI version:
> > 
> > root@node-0:~# mpirun --version
> > mpirun (Open MPI) 1.8.5
> > 
> > 
> > I appreciate your help
> > _______________________________________________
> > users mailing list
> > us...@open-mpi.org
> > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> > Link to this post:
> > http://www.open-mpi.org/community/lists/users/2015/11/28020.php
> 
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post:
> http://www.open-mpi.org/community/lists/users/2015/11/28021.php

Re: [OMPI users] Failure detection

Reply via email to