Thanks for answering. I tested again, this time using a real cluster where I
have the possibility of rebooting the machines at will. I run a test using 32
machines running a MPI process per machine and during the execution I rebooted
one of the machines and I found the same behavior: OpenMPI detects the failure
but it blocks and the process has to be killed manually. This is the output:
root@graphene-30:~# mpirun --mca btl self,sm,tcp --machinefile machine_file
cg.B.32
NAS Parallel Benchmarks 3.3 -- CG Benchmark
Size: 75000
Iterations: 75
Number of active processes: 32
Number of nonzeroes per row: 13
Eigenvalue shift: .600E+02
iteration ||r|| zeta
1 0.13257071746643E-12 59.9994751578754
2 0.54021441387552E-15 21.7627846142538
3 0.57508155930725E-15 22.2876617043225
4 0.58907101679580E-15 22.5230738188352
5 0.59342235842271E-15 22.6275390653890
6 0.59736634325665E-15 22.6740259189537
7 0.60192883908490E-15 22.6949056826254
8 0.59984965235397E-15 22.7044023166871
9 0.60134110898017E-15 22.7087834345616
10 0.59805179779153E-15 22.7108351397172
11 0.60025777990273E-15 22.7118107121337
[graphene-108.nancy.grid5000.fr][[1821,1],15][btl_tcp_frag.c:215:mca_btl_tcp_frag_recv]
mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104)
[graphene-58.nancy.grid5000.fr][[1821,1],9][btl_tcp_frag.c:215:mca_btl_tcp_frag_recv]
mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104)
[graphene-20.nancy.grid5000.fr][[1821,1],10][btl_tcp_frag.c:215:mca_btl_tcp_frag_recv]
mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104)
[graphene-20.nancy.grid5000.fr][[1821,1],10][btl_tcp_endpoint.c:655:mca_btl_tcp_endpoint_complete_connect]
connect() to 172.16.64.67 failed: Connection refused (111)
I have the same behavior with versions: 1.8.5 that I compiled myself with the
options by default and version 1.6.5 installed through Debian packages.
----- Mail original -----
> De: "Ralph Castain" <[email protected]>
> À: "Open MPI Users" <[email protected]>
> Envoyé: Samedi 7 Novembre 2015 17:22:28
> Objet: Re: [OMPI users] Failure detection
>
> No, that certainly isn’t the normal behavior. I suspect it has to do with the
> nature of the VM TCP connection, though there is something very strange
> about your output. The BTL message indicates that an MPI job is already
> running. Yet your subsequent ORTE error message indicates we are still
> trying to start the daemons, which means we can’t have started the MPI job.
>
> So something is clearly confused.
>
>
> > On Nov 7, 2015, at 6:41 AM, Cristian RUIZ <[email protected]> wrote:
> >
> > Hello,
> >
> > I was studying how OpenMPI reacts to failures. I have a virtual
> > infrastructure where failures can be emulated by turning off a given VM.
> > Depending on the way the VM is turned off the 'mpirun' will be notified,
> > either because it receives a signal or because some timeout is reached.
> > In both cases failures are detected after some minutes. I did some test
> > with the NAS benchmarks and I got the following output:
> >
> > [node-5][[12114,1],5][btl_tcp_frag.c:237:mca_btl_tcp_frag_recv]
> > mca_btl_tcp_frag_recv: readv failed: Connection timed out (110)
> > [node-4][[12114,1],4][btl_tcp_frag.c:237:mca_btl_tcp_frag_recv]
> > mca_btl_tcp_frag_recv: readv failed: Connection timed out (110)
> >
> > Then, after some minutes I got another message like this:
> >
> > ORTE was unable to reliably start one or more daemons.
> > This usually is caused by:
> >
> > * not finding the required libraries and/or binaries on
> > one or more nodes. Please check your PATH and LD_LIBRARY_PATH
> > settings, or configure OMPI with --enable-orterun-prefix-by-default
> >
> > * lack of authority to execute on one or more specified nodes.
> > Please verify your allocation and authorities.
> >
> > * the inability to write startup files into /tmp
> > (--tmpdir/orte_tmpdir_base).
> > Please check with your sys admin to determine the correct location to use.
> >
> > * compilation of the orted with dynamic libraries when static are required
> > (e.g., on Cray). Please check your configure cmd line and consider using
> >
> > However the 'mpirun' does not terminate (after a least 30 minutes). The
> > execution is blocked even though a failure is detected. Is this a normal
> > behavior of "mpirun"?
> >
> > OpenMPI version:
> >
> > root@node-0:~# mpirun --version
> > mpirun (Open MPI) 1.8.5
> >
> >
> > I appreciate your help
> > _______________________________________________
> > users mailing list
> > [email protected]
> > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> > Link to this post:
> > http://www.open-mpi.org/community/lists/users/2015/11/28020.php
>
> _______________________________________________
> users mailing list
> [email protected]
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post:
> http://www.open-mpi.org/community/lists/users/2015/11/28021.php