Okay, let’s try doing this: mpirun -mca oob_tcp_if_include br0 …
This will restrict us to the br0 interface that is common to the two nodes. I note that your “node1” has two interfaces on the same subnet (192.168.1), which is usually a “no-no” that can cause trouble. Let’s see if removing that confusion helps. > On Sep 20, 2015, at 3:45 PM, Jorge D'Elia <jde...@intec.unl.edu.ar> wrote: > > Hi Ralph, > > Many thanks for your fast answer! > > ----- Mensaje original ----- >> De: "Ralph Castain" <r...@open-mpi.org <mailto:r...@open-mpi.org>> >> Para: "Open MPI Users" <us...@open-mpi.org <mailto:us...@open-mpi.org>> >> Enviado: Domingo, 20 de Septiembre 2015 18:16:56 >> Asunto: Re: [OMPI users] send() to socket 9 failed with the 1.10.0 version >> but not with 1.8.7 one. >> >> Is the connection from node1 to the head node a direct one, >> or is there a difference in the ethernet subnets between them? > > The connection from node1 to the head node is a direct one, > i.e. from the head node to the switch and from the switch to > the computing nodes. > >> Can you show us the output of ifconfig from each node? > > Yes of course! Please see attached tgz file that also > contains the ompi_info logs. > > Thanks. > Jorge. > >>> On Sep 20, 2015, at 12:19 PM, Jorge D'Elia <jde...@intec.unl.edu.ar> wrote: >>> >>> Hi all, >>> >>> We have used the Open MPI distributions up to the 1.8.7 version >>> without any problem in a small LINUX cluster built with diskless >>> nodes (x86_64, Fedora 17, Linux version 4.1.1 (gcc version 4.7.2 >>> 20120921 (Red Hat 4.7.2-2) (GCC))). >>> >>> However, from the 1.8.8 version, we have a problem with the >>> mpirun command. >>> >>> For instance, with the 1.10.0 Open MPI version, we can launch MPI >>> jobs across multiple node hosts and server sucesfully only if they >>> are launched from any node but not from the server. In order to >>> fix, following the hints given in >>> >>> http://www.open-mpi.org/faq/?category=running#diagnose-multi-host-problems >>> >>> we have tried a simple test: >>> >>> [jdelia@coyote ~]$ which mpirun >>> /usr/beta/openmpi/bin/mpirun >>> [jdelia@coyote ~]$ mpirun --version >>> mpirun (Open MPI) 1.10.0 >>> [jdelia@coyote ~]$ hostname >>> coyote >>> [jdelia@coyote ~]$ ssh node1 >>> [jdelia@node1 ~]$ mpirun --host coyote hostname >>> coyote >>> [jdelia@node1 ~]$ exit >>> logout >>> Connection to node1 closed. >>> [jdelia@coyote ~]$ mpirun --host node1 hostname >>> [node1:17861] [[8026,0],1] tcp_peer_send_blocking: send() to socket 9 >>> failed: Broken pipe (32) >>> -------------------------------------------------------------------------- >>> ORTE was unable to reliably start one or more daemons. >>> This usually is caused by: >>> ... snip ... >>> -------------------------------------------------------------------------- >>> >>> The PATH and LD_LIBRARY_PATH in coyote (server) and node1 >>> were reduced to >>> >>> [jdelia@coyote ]$ ssh coyote env | grep -i PATH >>> LD_LIBRARY_PATH=/usr/beta/openmpi/lib:/usr/beta/gcc-trunk/lib:/usr/beta/gcc-trunk/lib64:/usr/lib:/usr/lib64:/usr/local/lib:/usr/local/lib64 >>> PATH=.:/usr/beta/openmpi/bin:/usr/beta/gcc-trunk/bin:/usr/lib64/ccache:/usr/bin:/usr/sbin/usr/local/bin:/usr/local/sbin >>> MODULEPATH=/usr/share/Modules/modulefiles:/etc/modulefiles >>> QT_PLUGIN_PATH=/usr/lib64/kde4/plugins:/usr/lib/kde4/plugins >>> >>> [jdelia@coyote ]$ ssh node1 env | grep -i PATH >>> LD_LIBRARY_PATH=/usr/beta/openmpi/lib:/usr/beta/gcc-trunk/lib:/usr/beta/gcc-trunk/lib64:/usr/lib:/usr/lib64:/usr/local/lib:/usr/local/lib64 >>> PATH=.:/usr/beta/openmpi/bin:/usr/beta/gcc-trunk/bin:/usr/lib64/ccache:/usr/bin:/usr/sbin/usr/local/bin:/usr/local/sbin >>> MODULEPATH=/usr/share/Modules/modulefiles:/etc/modulefiles >>> >>> Until the 1.8.7 version these tests were all OK. Then, several >>> openmpi distributions were rebuilt using the gcc compilers, >>> both with the system version >>> >>> gcc (GCC) 4.7.2 20120921 (Red Hat 4.7.2-2) >>> >>> as with the experimental one >>> >>> $ gcc --version >>> gcc (GCC) 6.0.0 20150919 (experimental) >>> >>> but without luck. Again, if we go back to 1.8.7 version, and >>> using the same environment variables, all tests are OK. >>> >>> Please, any clue in order to fix this trouble? >>> >>> We try to attach the configure log files of the 1.8.7 >>> and 1.8.10 versions using the beta gcc compiler. >>> >>> Thanks in advance. >>> >>> Regards, >>> Jorge. >>> -- >>> CIMEC (UNL-CONICET), http://www.cimec.org.ar/ >>> Predio CONICET-Santa Fe, Colec. Ruta Nac. 168, >>> Paraje El Pozo, S3000GLN, Santa Fe, ARGENTINA >>> Univ Nac Litoral (UNL). Cons Nac Inv Científ y Técn (CONICET) >>> logs.tgz>_______________________________________________ >>> users mailing list >>> us...@open-mpi.org <mailto:us...@open-mpi.org> >>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users >>> <http://www.open-mpi.org/mailman/listinfo.cgi/users> >>> Link to this post: >>> http://www.open-mpi.org/community/lists/users/2015/09/27633.php >>> <http://www.open-mpi.org/community/lists/users/2015/09/27633.php> >> >> _______________________________________________ >> users mailing list >> us...@open-mpi.org <mailto:us...@open-mpi.org> >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users >> <http://www.open-mpi.org/mailman/listinfo.cgi/users> >> Link to this post: >> http://www.open-mpi.org/community/lists/users/2015/09/27636.php >> <http://www.open-mpi.org/community/lists/users/2015/09/27636.php> > <ifconfig-ompi-info-log.tgz>_______________________________________________ > users mailing list > us...@open-mpi.org <mailto:us...@open-mpi.org> > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users > <http://www.open-mpi.org/mailman/listinfo.cgi/users> > Link to this post: > http://www.open-mpi.org/community/lists/users/2015/09/27638.php > <http://www.open-mpi.org/community/lists/users/2015/09/27638.php>