Nikolay --

Thanks for all the detail!  That helps a tremendous amount.

Open MPI actually uses IP networks in *two* ways:

1. for command and control
2. for MPI communications

Your use of btl_tcp_if_include regulates #2, but not #1 -- you need to add 
another MCA param to regulate #1.  Try this:

    mpirun --mca btl_tcp_if_include venet0:0 --mca oob_tcp_if_include venet0:0 
...

See if that works.


> On Jun 24, 2016, at 5:40 AM, kna...@gmail.com wrote:
> 
> Hi all!
> 
> I am trying to build a cluster for MPI jobs using OpenVZ containers 
> (https://openvz.org/Main_Page).
> I've been successfully using openvz+openmpi during many years but can't make 
> it work with OpenMPI 1.10.x.
> So I have a server with openvz support enabled. The output of it's ifconfig:
> 
> [root@server]$ ifconfig
> 
> eth0   Link encap:Ethernet  HWaddr **************************
>          inet addr:10.0.50.35  Bcast:10.0.50.255  Mask:255.255.255.0
>          inet6 addr: fe80::ec4:7aff:feb0:cf7e/64 Scope:Link
>          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
>          RX packets:6117448 errors:103 dropped:0 overruns:0 frame:56
>          TX packets:765411 errors:0 dropped:0 overruns:0 carrier:0
>          collisions:0 txqueuelen:1000
>          RX bytes:3608033195 (3.3 GiB)  TX bytes:70005631 (66.7 MiB)
>          Memory:fb120000-fb13ffff
> 
> lo        Link encap:Local Loopback
>          inet addr:127.0.0.1  Mask:255.0.0.0
>          inet6 addr: ::1/128 Scope:Host
>          UP LOOPBACK RUNNING  MTU:65536  Metric:1
>          RX packets:52 errors:0 dropped:0 overruns:0 frame:0
>          TX packets:52 errors:0 dropped:0 overruns:0 carrier:0
>          collisions:0 txqueuelen:0
>          RX bytes:3788 (3.6 KiB)  TX bytes:3788 (3.6 KiB)
> 
> venet0 Link encap:UNSPEC  HWaddr 
> 00-00-00-00-00-00-00-00-00-00-00-00-00-00-00-00
>          inet6 addr: fe80::1/128 Scope:Link
>          UP BROADCAST POINTOPOINT RUNNING NOARP  MTU:1500  Metric:1
>          RX packets:486052 errors:0 dropped:0 overruns:0 frame:0
>          TX packets:805540 errors:0 dropped:17 overruns:0 carrier:0
>          collisions:0 txqueuelen:0
>          RX bytes:26815645 (25.5 MiB)  TX bytes:1186438623 (1.1 GiB)
> 
> There are two openvz containers running on that server:
> [root@server ~]# vzlist -a
>      CTID      NPROC STATUS    IP_ADDR         HOSTNAME
>       110         16 running   10.0.50.40      ct110.domain.org
>       111         11 running   10.0.50.41      ct111.domain.org
> 
> On one of the container I've built openmpi 1.10.3 with the following commands:
> $ ./configure --prefix=/opt/openmpi/1.10.3 CXX=g++ 
> --with-cuda=/usr/local/cuda CC=gcc CFLAGS=-m64 CXXFLAGS=-m64 2>&1|tee 
> ~/openmpi-1.10.3_v1.log
> 
> $ make -j20
> 
> [root]$ make install
> 
> So openmpi was installed in /opt/openmpi/1.10.3/. The second container is a 
> exact clone of the first one.
> 
> The passwordless ssh was enabled across both containers:
> [user@ct110 ~]$ ssh 10.0.50.41
> Last login: Fri Jun 24 16:49:03 2016 from 10.0.50.40
> 
> [user@ct111 ~]$ ssh 10.0.50.40
> Last login: Fri Jun 24 16:37:35 2016 from 10.0.50.41
> 
> But the simple test via mpi does not work:
> mpirun -np 1 -host 10.0.50.41 hostname
> [ct111.domain.org:00899] [[13749,0],1] tcp_peer_send_blocking: send() to 
> socket 9 failed: Broken pipe (32)
> --------------------------------------------------------------------------
> ORTE was unable to reliably start one or more daemons.
> This usually is caused by:
> 
> * not finding the required libraries and/or binaries on
>  one or more nodes. Please check your PATH and LD_LIBRARY_PATH
>  settings, or configure OMPI with --enable-orterun-prefix-by-default
> 
> * lack of authority to execute on one or more specified nodes.
>  Please verify your allocation and authorities.
> 
> * the inability to write startup files into /tmp (--tmpdir/orte_tmpdir_base).
>  Please check with your sys admin to determine the correct location to use.
> 
> *  compilation of the orted with dynamic libraries when static are required
>  (e.g., on Cray). Please check your configure cmd line and consider using
>  one of the contrib/platform definitions for your system type.
> 
> * an inability to create a connection back to mpirun due to a
>  lack of common network interfaces and/or no route found between
>  them. Please check network connectivity (including firewalls
>  and network routing requirements).
> --------------------------------------------------------------------------
> 
> Although an evironment on the host with 10.0.50.41 ip address seems OK:
> [user@ct110 ~]$ ssh 10.0.50.41 env|grep PATH
> LD_LIBRARY_PATH=:/usr/local/cuda/lib64:/opt/openmpi/1.10.3/lib
> PATH=/usr/local/bin:/bin:/usr/bin:/home/user/bin:/usr/local/cuda/bin:/opt/openmpi/1.10.3/bin
> 
> The ifconfig output from the inside containers:
> [root@ct110 /]# ifconfig
> lo        Link encap:Local Loopback
>          inet addr:127.0.0.1  Mask:255.0.0.0
>          inet6 addr: ::1/128 Scope:Host
>          UP LOOPBACK RUNNING  MTU:65536  Metric:1
>          RX packets:38 errors:0 dropped:0 overruns:0 frame:0
>          TX packets:38 errors:0 dropped:0 overruns:0 carrier:0
>          collisions:0 txqueuelen:0
>          RX bytes:4559 (4.4 KiB)  TX bytes:4559 (4.4 KiB)
> 
> venet0    Link encap:UNSPEC  HWaddr 
> 00-00-00-00-00-00-00-00-00-00-00-00-00-00-00-00
>          inet addr:127.0.0.1  P-t-P:127.0.0.1  Bcast:0.0.0.0  
> Mask:255.255.255.255
>          UP BROADCAST POINTOPOINT RUNNING NOARP  MTU:1500  Metric:1
>          RX packets:772 errors:0 dropped:0 overruns:0 frame:0
>          TX packets:853 errors:0 dropped:0 overruns:0 carrier:0
>          collisions:0 txqueuelen:0
>          RX bytes:112128 (109.5 KiB)  TX bytes:122092 (119.2 KiB)
> 
> venet0:0  Link encap:UNSPEC  HWaddr 
> 00-00-00-00-00-00-00-00-00-00-00-00-00-00-00-00
>          inet addr:10.0.50.40  P-t-P:10.0.50.40  Bcast:10.0.50.40  
> Mask:255.255.255.255
>          UP BROADCAST POINTOPOINT RUNNING NOARP  MTU:1500  Metric:1
> 
> [root@ct111 /]# ifconfig
> lo        Link encap:Local Loopback
>          inet addr:127.0.0.1  Mask:255.0.0.0
>          inet6 addr: ::1/128 Scope:Host
>          UP LOOPBACK RUNNING  MTU:65536  Metric:1
>          RX packets:24 errors:0 dropped:0 overruns:0 frame:0
>          TX packets:24 errors:0 dropped:0 overruns:0 carrier:0
>          collisions:0 txqueuelen:0
>          RX bytes:1200 (1.1 KiB)  TX bytes:1200 (1.1 KiB)
> 
> venet0    Link encap:UNSPEC  HWaddr 
> 00-00-00-00-00-00-00-00-00-00-00-00-00-00-00-00
>          inet addr:127.0.0.1  P-t-P:127.0.0.1  Bcast:0.0.0.0  
> Mask:255.255.255.255
>          UP BROADCAST POINTOPOINT RUNNING NOARP  MTU:1500  Metric:1
>          RX packets:855 errors:0 dropped:0 overruns:0 frame:0
>          TX packets:774 errors:0 dropped:0 overruns:0 carrier:0
>          collisions:0 txqueuelen:0
>          RX bytes:122212 (119.3 KiB)  TX bytes:112304 (109.6 KiB)
> 
> venet0:0  Link encap:UNSPEC  HWaddr 
> 00-00-00-00-00-00-00-00-00-00-00-00-00-00-00-00
>          inet addr:10.0.50.41  P-t-P:10.0.50.41  Bcast:10.0.50.41  
> Mask:255.255.255.255
>          UP BROADCAST POINTOPOINT RUNNING NOARP  MTU:1500  Metric:1
> 
> Exactly the same error I get if I try to restrict network interface to 
> venet0:0:
> [user@ct110 ~]$ /opt/openmpi/1.10.3/bin/mpirun --mca btl self,tcp --mca 
> btl_tcp_if_include venet0:0  -np 1 -host 10.0.50.41 hostname
> [ct111.domain.org:00945] [[13704,0],1] tcp_peer_send_blocking: send() to 
> socket 9 failed: Broken pipe (32)
> [ ... snip....]
> 
> Although I can successfully run hostname and hello.bin executable from the 
> same container where I submit from:
> [user@ct110 hello]$ /opt/openmpi/1.10.3/bin/mpirun --mca btl self,tcp --mca 
> btl_tcp_if_include venet0:0  -np 1 -host 10.0.50.40 hostname
> ct110.domain.org
> [user@ct110 hello]$ /opt/openmpi/1.10.3/bin/mpirun --mca btl self,tcp --mca 
> btl_tcp_if_include venet0:0  -np 1 -host 10.0.50.40 ./hello.bin
> Hello world! from processor 0 (name=ct110.domain.org ) out of 1
> wall clock time = 0.000002
> 
> Iptables is off on both containers.
> 
> I would assume that I faced with bug #3339 
> (https://svn.open-mpi.org/trac/ompi/ticket/3339) but I have another cluster 
> based on openvz containers with openmpi 1.6.5 which works perfectly for a 
> several years.
> 
> I would appreciate any help on that issue.
> 
> Best regards,
> Nikolay.
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post: 
> http://www.open-mpi.org/community/lists/users/2016/06/29540.php


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/

Reply via email to