----- Mensaje original ----- > De: "Ralph Castain" <r...@open-mpi.org> > Para: "Open MPI Users" <us...@open-mpi.org> > Enviado: Lunes, 21 de Septiembre 2015 1:42:08 > Asunto: Re: [OMPI users] send() to socket 9 failed with the 1.10.0 version > but not with 1.8.7 one. > > Okay, let’s try doing this: > > mpirun -mca oob_tcp_if_include br0 … > > This will restrict us to the br0 interface that is common to the two nodes.
It works fine! Here I copy and paste a session using the hello_usempi_f08.f90 sample: [jdelia@coyote 1.10.0]$ mpifort --version GNU Fortran (GCC) 6.0.0 20150919 (experimental) Copyright (C) 2015 Free Software Foundation, Inc. [jdelia@coyote 1.10.0]$ mpirun --version mpirun (Open MPI) 1.10.0 Report bugs to http://www.open-mpi.org/community/help/ [jdelia@coyote 1.10.0]$ mpifort -o hello_usempi_f08.exe hello_usempi_f08.f90 [jdelia@coyote 1.10.0]$ cat ~/machi-openmpi.dat coyote slots=2 max_slots=2 node1 slots=2 max_slots=6 node2 slots=2 max_slots=8 [jdelia@coyote 1.10.0]$ mpirun --mca btl self,tcp --map-by node --mca oob_tcp_if_include br0 --np 5 --report-bindings --machinefile ~/machi-openmpi.dat hello_usempi_f08.exe [coyote:28957] MCW rank 3 is not bound (or bound to all available processors) [coyote:28957] MCW rank 0 is not bound (or bound to all available processors) [node2:11855] MCW rank 2 is not bound (or bound to all available processors) [node1:24048] MCW rank 4 is not bound (or bound to all available processors) [node1:24048] MCW rank 1 is not bound (or bound to all available processors) Hello, world, I am 0 of 5: Open MPI v1.10, package: Open MPI jdelia@coyote Distribution, ident: 1.10.0, repo rev: v1.10-dev-293-gf694355, Aug 24, 2015 Hello, world, I am 3 of 5: Open MPI v1.10, package: Open MPI jdelia@coyote Distribution, ident: 1.10.0, repo rev: v1.10-dev-293-gf694355, Aug 24, 2015 Hello, world, I am 2 of 5: Open MPI v1.10, package: Open MPI jdelia@coyote Distribution, ident: 1.10.0, repo rev: v1.10-dev-293-gf694355, Aug 24, 2015 Hello, world, I am 1 of 5: Open MPI v1.10, package: Open MPI jdelia@coyote Distribution, ident: 1.10.0, repo rev: v1.10-dev-293-gf694355, Aug 24, 2015 Hello, world, I am 4 of 5: Open MPI v1.10, package: Open MPI jdelia@coyote Distribution, ident: 1.10.0, repo rev: v1.10-dev-293-gf694355, Aug 24, 2015 > I note that your “node1” has two interfaces on the same subnet (192.168.1), > which is usually a “no-no” that can cause trouble. Let’s see if removing > that confusion helps. OK. Thanks for noticing. We will try to remove it and will let you know. Regards, Jorge. > > On Sep 20, 2015, at 3:45 PM, Jorge D'Elia <jde...@intec.unl.edu.ar> wrote: > > > > Hi Ralph, > > > > Many thanks for your fast answer! > > > > ----- Mensaje original ----- > >> De: "Ralph Castain" <r...@open-mpi.org <mailto:r...@open-mpi.org>> > >> Para: "Open MPI Users" <us...@open-mpi.org <mailto:us...@open-mpi.org>> > >> Enviado: Domingo, 20 de Septiembre 2015 18:16:56 > >> Asunto: Re: [OMPI users] send() to socket 9 failed with the 1.10.0 version > >> but not with 1.8.7 one. > >> > >> Is the connection from node1 to the head node a direct one, > >> or is there a difference in the ethernet subnets between them? > > > > The connection from node1 to the head node is a direct one, > > i.e. from the head node to the switch and from the switch to > > the computing nodes. > > > >> Can you show us the output of ifconfig from each node? > > > > Yes of course! Please see attached tgz file that also > > contains the ompi_info logs. > > > > Thanks. > > Jorge. > > > >>> On Sep 20, 2015, at 12:19 PM, Jorge D'Elia <jde...@intec.unl.edu.ar> > >>> wrote: > >>> > >>> Hi all, > >>> > >>> We have used the Open MPI distributions up to the 1.8.7 version > >>> without any problem in a small LINUX cluster built with diskless > >>> nodes (x86_64, Fedora 17, Linux version 4.1.1 (gcc version 4.7.2 > >>> 20120921 (Red Hat 4.7.2-2) (GCC))). > >>> > >>> However, from the 1.8.8 version, we have a problem with the > >>> mpirun command. > >>> > >>> For instance, with the 1.10.0 Open MPI version, we can launch MPI > >>> jobs across multiple node hosts and server sucesfully only if they > >>> are launched from any node but not from the server. In order to > >>> fix, following the hints given in > >>> > >>> http://www.open-mpi.org/faq/?category=running#diagnose-multi-host-problems > >>> > >>> we have tried a simple test: > >>> > >>> [jdelia@coyote ~]$ which mpirun > >>> /usr/beta/openmpi/bin/mpirun > >>> [jdelia@coyote ~]$ mpirun --version > >>> mpirun (Open MPI) 1.10.0 > >>> [jdelia@coyote ~]$ hostname > >>> coyote > >>> [jdelia@coyote ~]$ ssh node1 > >>> [jdelia@node1 ~]$ mpirun --host coyote hostname > >>> coyote > >>> [jdelia@node1 ~]$ exit > >>> logout > >>> Connection to node1 closed. > >>> [jdelia@coyote ~]$ mpirun --host node1 hostname > >>> [node1:17861] [[8026,0],1] tcp_peer_send_blocking: send() to socket 9 > >>> failed: Broken pipe (32) > >>> -------------------------------------------------------------------------- > >>> ORTE was unable to reliably start one or more daemons. > >>> This usually is caused by: > >>> ... snip ... > >>> -------------------------------------------------------------------------- > >>> > >>> The PATH and LD_LIBRARY_PATH in coyote (server) and node1 > >>> were reduced to > >>> > >>> [jdelia@coyote ]$ ssh coyote env | grep -i PATH > >>> LD_LIBRARY_PATH=/usr/beta/openmpi/lib:/usr/beta/gcc-trunk/lib:/usr/beta/gcc-trunk/lib64:/usr/lib:/usr/lib64:/usr/local/lib:/usr/local/lib64 > >>> PATH=.:/usr/beta/openmpi/bin:/usr/beta/gcc-trunk/bin:/usr/lib64/ccache:/usr/bin:/usr/sbin/usr/local/bin:/usr/local/sbin > >>> MODULEPATH=/usr/share/Modules/modulefiles:/etc/modulefiles > >>> QT_PLUGIN_PATH=/usr/lib64/kde4/plugins:/usr/lib/kde4/plugins > >>> > >>> [jdelia@coyote ]$ ssh node1 env | grep -i PATH > >>> LD_LIBRARY_PATH=/usr/beta/openmpi/lib:/usr/beta/gcc-trunk/lib:/usr/beta/gcc-trunk/lib64:/usr/lib:/usr/lib64:/usr/local/lib:/usr/local/lib64 > >>> PATH=.:/usr/beta/openmpi/bin:/usr/beta/gcc-trunk/bin:/usr/lib64/ccache:/usr/bin:/usr/sbin/usr/local/bin:/usr/local/sbin > >>> MODULEPATH=/usr/share/Modules/modulefiles:/etc/modulefiles > >>> > >>> Until the 1.8.7 version these tests were all OK. Then, several > >>> openmpi distributions were rebuilt using the gcc compilers, > >>> both with the system version > >>> > >>> gcc (GCC) 4.7.2 20120921 (Red Hat 4.7.2-2) > >>> > >>> as with the experimental one > >>> > >>> $ gcc --version > >>> gcc (GCC) 6.0.0 20150919 (experimental) > >>> > >>> but without luck. Again, if we go back to 1.8.7 version, and > >>> using the same environment variables, all tests are OK. > >>> > >>> Please, any clue in order to fix this trouble? > >>> > >>> We try to attach the configure log files of the 1.8.7 > >>> and 1.8.10 versions using the beta gcc compiler. > >>> > >>> Thanks in advance. > >>> > >>> Regards, > >>> Jorge. > >>> -- > >>> CIMEC (UNL-CONICET), http://www.cimec.org.ar/ > >>> Predio CONICET-Santa Fe, Colec. Ruta Nac. 168, > >>> Paraje El Pozo, S3000GLN, Santa Fe, ARGENTINA > >>> Univ Nac Litoral (UNL). Cons Nac Inv Científ y Técn (CONICET) > >>> logs.tgz>_______________________________________________ > >>> users mailing list > >>> us...@open-mpi.org <mailto:us...@open-mpi.org> > >>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users > >>> <http://www.open-mpi.org/mailman/listinfo.cgi/users> > >>> Link to this post: > >>> http://www.open-mpi.org/community/lists/users/2015/09/27633.php > >>> <http://www.open-mpi.org/community/lists/users/2015/09/27633.php> > >> > >> _______________________________________________ > >> users mailing list > >> us...@open-mpi.org <mailto:us...@open-mpi.org> > >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users > >> <http://www.open-mpi.org/mailman/listinfo.cgi/users> > >> Link to this post: > >> http://www.open-mpi.org/community/lists/users/2015/09/27636.php > >> <http://www.open-mpi.org/community/lists/users/2015/09/27636.php> > > <ifconfig-ompi-info-log.tgz>_______________________________________________ > > users mailing list > > us...@open-mpi.org <mailto:us...@open-mpi.org> > > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users > > <http://www.open-mpi.org/mailman/listinfo.cgi/users> > > Link to this post: > > http://www.open-mpi.org/community/lists/users/2015/09/27638.php > > <http://www.open-mpi.org/community/lists/users/2015/09/27638.php> > > _______________________________________________ > users mailing list > us...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users > Link to this post: > http://www.open-mpi.org/community/lists/users/2015/09/27641.php