Thanks for answering. I tested again, this time using a real cluster where I have the possibility of rebooting the machines at will. I run a test using 32 machines running a MPI process per machine and during the execution I rebooted one of the machines and I found the same behavior: OpenMPI detects the failure but it blocks and the process has to be killed manually. This is the output:
root@graphene-30:~# mpirun --mca btl self,sm,tcp --machinefile machine_file cg.B.32 NAS Parallel Benchmarks 3.3 -- CG Benchmark Size: 75000 Iterations: 75 Number of active processes: 32 Number of nonzeroes per row: 13 Eigenvalue shift: .600E+02 iteration ||r|| zeta 1 0.13257071746643E-12 59.9994751578754 2 0.54021441387552E-15 21.7627846142538 3 0.57508155930725E-15 22.2876617043225 4 0.58907101679580E-15 22.5230738188352 5 0.59342235842271E-15 22.6275390653890 6 0.59736634325665E-15 22.6740259189537 7 0.60192883908490E-15 22.6949056826254 8 0.59984965235397E-15 22.7044023166871 9 0.60134110898017E-15 22.7087834345616 10 0.59805179779153E-15 22.7108351397172 11 0.60025777990273E-15 22.7118107121337 [graphene-108.nancy.grid5000.fr][[1821,1],15][btl_tcp_frag.c:215:mca_btl_tcp_frag_recv] mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104) [graphene-58.nancy.grid5000.fr][[1821,1],9][btl_tcp_frag.c:215:mca_btl_tcp_frag_recv] mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104) [graphene-20.nancy.grid5000.fr][[1821,1],10][btl_tcp_frag.c:215:mca_btl_tcp_frag_recv] mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104) [graphene-20.nancy.grid5000.fr][[1821,1],10][btl_tcp_endpoint.c:655:mca_btl_tcp_endpoint_complete_connect] connect() to 172.16.64.67 failed: Connection refused (111) I have the same behavior with versions: 1.8.5 that I compiled myself with the options by default and version 1.6.5 installed through Debian packages. ----- Mail original ----- > De: "Ralph Castain" <r...@open-mpi.org> > À: "Open MPI Users" <us...@open-mpi.org> > Envoyé: Samedi 7 Novembre 2015 17:22:28 > Objet: Re: [OMPI users] Failure detection > > No, that certainly isn’t the normal behavior. I suspect it has to do with the > nature of the VM TCP connection, though there is something very strange > about your output. The BTL message indicates that an MPI job is already > running. Yet your subsequent ORTE error message indicates we are still > trying to start the daemons, which means we can’t have started the MPI job. > > So something is clearly confused. > > > > On Nov 7, 2015, at 6:41 AM, Cristian RUIZ <cristian.r...@inria.fr> wrote: > > > > Hello, > > > > I was studying how OpenMPI reacts to failures. I have a virtual > > infrastructure where failures can be emulated by turning off a given VM. > > Depending on the way the VM is turned off the 'mpirun' will be notified, > > either because it receives a signal or because some timeout is reached. > > In both cases failures are detected after some minutes. I did some test > > with the NAS benchmarks and I got the following output: > > > > [node-5][[12114,1],5][btl_tcp_frag.c:237:mca_btl_tcp_frag_recv] > > mca_btl_tcp_frag_recv: readv failed: Connection timed out (110) > > [node-4][[12114,1],4][btl_tcp_frag.c:237:mca_btl_tcp_frag_recv] > > mca_btl_tcp_frag_recv: readv failed: Connection timed out (110) > > > > Then, after some minutes I got another message like this: > > > > ORTE was unable to reliably start one or more daemons. > > This usually is caused by: > > > > * not finding the required libraries and/or binaries on > > one or more nodes. Please check your PATH and LD_LIBRARY_PATH > > settings, or configure OMPI with --enable-orterun-prefix-by-default > > > > * lack of authority to execute on one or more specified nodes. > > Please verify your allocation and authorities. > > > > * the inability to write startup files into /tmp > > (--tmpdir/orte_tmpdir_base). > > Please check with your sys admin to determine the correct location to use. > > > > * compilation of the orted with dynamic libraries when static are required > > (e.g., on Cray). Please check your configure cmd line and consider using > > > > However the 'mpirun' does not terminate (after a least 30 minutes). The > > execution is blocked even though a failure is detected. Is this a normal > > behavior of "mpirun"? > > > > OpenMPI version: > > > > root@node-0:~# mpirun --version > > mpirun (Open MPI) 1.8.5 > > > > > > I appreciate your help > > _______________________________________________ > > users mailing list > > us...@open-mpi.org > > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users > > Link to this post: > > http://www.open-mpi.org/community/lists/users/2015/11/28020.php > > _______________________________________________ > users mailing list > us...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users > Link to this post: > http://www.open-mpi.org/community/lists/users/2015/11/28021.php