Okay, let’s try doing this:

mpirun -mca oob_tcp_if_include br0 …

This will restrict us to the br0 interface that is common to the two nodes. I 
note that your “node1” has two interfaces on the same subnet (192.168.1), which 
is usually a “no-no” that can cause trouble. Let’s see if removing that 
confusion helps.


> On Sep 20, 2015, at 3:45 PM, Jorge D'Elia <jde...@intec.unl.edu.ar> wrote:
> 
> Hi Ralph,
> 
> Many thanks for your fast answer!
> 
> ----- Mensaje original -----
>> De: "Ralph Castain" <r...@open-mpi.org <mailto:r...@open-mpi.org>>
>> Para: "Open MPI Users" <us...@open-mpi.org <mailto:us...@open-mpi.org>>
>> Enviado: Domingo, 20 de Septiembre 2015 18:16:56
>> Asunto: Re: [OMPI users] send() to socket 9 failed with the 1.10.0 version 
>> but not with 1.8.7 one.
>> 
>> Is the connection from node1 to the head node a direct one, 
>> or is there a difference in the ethernet subnets between them? 
> 
> The connection from node1 to the head node is a direct one, 
> i.e. from the head node to the switch and from the switch to 
> the computing nodes.
> 
>> Can you show us the output of ifconfig from each node?
> 
> Yes of course! Please see attached tgz file that also 
> contains the ompi_info logs.
> 
> Thanks.
> Jorge.
> 
>>> On Sep 20, 2015, at 12:19 PM, Jorge D'Elia <jde...@intec.unl.edu.ar> wrote:
>>> 
>>> Hi all,
>>> 
>>> We have used the Open MPI distributions up to the 1.8.7 version
>>> without any problem in a small LINUX cluster built with diskless
>>> nodes (x86_64, Fedora 17, Linux version 4.1.1 (gcc version 4.7.2
>>> 20120921 (Red Hat 4.7.2-2) (GCC))).
>>> 
>>> However, from the 1.8.8 version, we have a problem with the
>>> mpirun command.
>>> 
>>> For instance, with the 1.10.0 Open MPI version, we can launch MPI
>>> jobs across multiple node hosts and server sucesfully only if they
>>> are launched from any node but not from the server. In order to
>>> fix, following the hints given in
>>> 
>>> http://www.open-mpi.org/faq/?category=running#diagnose-multi-host-problems
>>> 
>>> we have tried a simple test:
>>> 
>>> [jdelia@coyote ~]$ which mpirun
>>> /usr/beta/openmpi/bin/mpirun
>>> [jdelia@coyote ~]$ mpirun --version
>>> mpirun (Open MPI) 1.10.0
>>> [jdelia@coyote ~]$ hostname
>>> coyote
>>> [jdelia@coyote ~]$ ssh node1
>>> [jdelia@node1 ~]$ mpirun --host coyote hostname
>>> coyote
>>> [jdelia@node1 ~]$ exit
>>> logout
>>> Connection to node1 closed.
>>> [jdelia@coyote ~]$ mpirun --host node1 hostname
>>> [node1:17861] [[8026,0],1] tcp_peer_send_blocking: send() to socket 9
>>> failed: Broken pipe (32)
>>> --------------------------------------------------------------------------
>>> ORTE was unable to reliably start one or more daemons.
>>> This usually is caused by:
>>> ... snip ...
>>> --------------------------------------------------------------------------
>>> 
>>> The PATH and LD_LIBRARY_PATH in coyote (server) and node1
>>> were reduced to
>>> 
>>> [jdelia@coyote ]$ ssh coyote env | grep -i PATH
>>> LD_LIBRARY_PATH=/usr/beta/openmpi/lib:/usr/beta/gcc-trunk/lib:/usr/beta/gcc-trunk/lib64:/usr/lib:/usr/lib64:/usr/local/lib:/usr/local/lib64
>>> PATH=.:/usr/beta/openmpi/bin:/usr/beta/gcc-trunk/bin:/usr/lib64/ccache:/usr/bin:/usr/sbin/usr/local/bin:/usr/local/sbin
>>> MODULEPATH=/usr/share/Modules/modulefiles:/etc/modulefiles
>>> QT_PLUGIN_PATH=/usr/lib64/kde4/plugins:/usr/lib/kde4/plugins
>>> 
>>> [jdelia@coyote ]$ ssh node1  env | grep -i PATH
>>> LD_LIBRARY_PATH=/usr/beta/openmpi/lib:/usr/beta/gcc-trunk/lib:/usr/beta/gcc-trunk/lib64:/usr/lib:/usr/lib64:/usr/local/lib:/usr/local/lib64
>>> PATH=.:/usr/beta/openmpi/bin:/usr/beta/gcc-trunk/bin:/usr/lib64/ccache:/usr/bin:/usr/sbin/usr/local/bin:/usr/local/sbin
>>> MODULEPATH=/usr/share/Modules/modulefiles:/etc/modulefiles
>>> 
>>> Until the 1.8.7 version these tests were all OK. Then, several
>>> openmpi distributions were rebuilt using the gcc compilers,
>>> both with the system version
>>> 
>>> gcc (GCC) 4.7.2 20120921 (Red Hat 4.7.2-2)
>>> 
>>> as with the experimental one
>>> 
>>> $ gcc --version
>>> gcc (GCC) 6.0.0 20150919 (experimental)
>>> 
>>> but without luck. Again, if we go back to 1.8.7 version, and
>>> using the same environment variables, all tests are OK.
>>> 
>>> Please, any clue in order to fix this trouble?
>>> 
>>> We try to attach the configure log files of the 1.8.7
>>> and 1.8.10 versions using the beta gcc compiler.
>>> 
>>> Thanks in advance.
>>> 
>>> Regards,
>>> Jorge.
>>> --
>>> CIMEC (UNL-CONICET), http://www.cimec.org.ar/
>>> Predio CONICET-Santa Fe, Colec. Ruta Nac. 168,
>>> Paraje El Pozo, S3000GLN, Santa Fe, ARGENTINA
>>> Univ Nac Litoral (UNL). Cons Nac Inv Científ y Técn (CONICET)
>>> logs.tgz>_______________________________________________
>>> users mailing list
>>> us...@open-mpi.org <mailto:us...@open-mpi.org>
>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users 
>>> <http://www.open-mpi.org/mailman/listinfo.cgi/users>
>>> Link to this post:
>>> http://www.open-mpi.org/community/lists/users/2015/09/27633.php 
>>> <http://www.open-mpi.org/community/lists/users/2015/09/27633.php>
>> 
>> _______________________________________________
>> users mailing list
>> us...@open-mpi.org <mailto:us...@open-mpi.org>
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users 
>> <http://www.open-mpi.org/mailman/listinfo.cgi/users>
>> Link to this post:
>> http://www.open-mpi.org/community/lists/users/2015/09/27636.php 
>> <http://www.open-mpi.org/community/lists/users/2015/09/27636.php>
> <ifconfig-ompi-info-log.tgz>_______________________________________________
> users mailing list
> us...@open-mpi.org <mailto:us...@open-mpi.org>
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users 
> <http://www.open-mpi.org/mailman/listinfo.cgi/users>
> Link to this post: 
> http://www.open-mpi.org/community/lists/users/2015/09/27638.php 
> <http://www.open-mpi.org/community/lists/users/2015/09/27638.php>

Reply via email to