Nikolay -- Thanks for all the detail! That helps a tremendous amount.
Open MPI actually uses IP networks in *two* ways: 1. for command and control 2. for MPI communications Your use of btl_tcp_if_include regulates #2, but not #1 -- you need to add another MCA param to regulate #1. Try this: mpirun --mca btl_tcp_if_include venet0:0 --mca oob_tcp_if_include venet0:0 ... See if that works. > On Jun 24, 2016, at 5:40 AM, kna...@gmail.com wrote: > > Hi all! > > I am trying to build a cluster for MPI jobs using OpenVZ containers > (https://openvz.org/Main_Page). > I've been successfully using openvz+openmpi during many years but can't make > it work with OpenMPI 1.10.x. > So I have a server with openvz support enabled. The output of it's ifconfig: > > [root@server]$ ifconfig > > eth0 Link encap:Ethernet HWaddr ************************** > inet addr:10.0.50.35 Bcast:10.0.50.255 Mask:255.255.255.0 > inet6 addr: fe80::ec4:7aff:feb0:cf7e/64 Scope:Link > UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 > RX packets:6117448 errors:103 dropped:0 overruns:0 frame:56 > TX packets:765411 errors:0 dropped:0 overruns:0 carrier:0 > collisions:0 txqueuelen:1000 > RX bytes:3608033195 (3.3 GiB) TX bytes:70005631 (66.7 MiB) > Memory:fb120000-fb13ffff > > lo Link encap:Local Loopback > inet addr:127.0.0.1 Mask:255.0.0.0 > inet6 addr: ::1/128 Scope:Host > UP LOOPBACK RUNNING MTU:65536 Metric:1 > RX packets:52 errors:0 dropped:0 overruns:0 frame:0 > TX packets:52 errors:0 dropped:0 overruns:0 carrier:0 > collisions:0 txqueuelen:0 > RX bytes:3788 (3.6 KiB) TX bytes:3788 (3.6 KiB) > > venet0 Link encap:UNSPEC HWaddr > 00-00-00-00-00-00-00-00-00-00-00-00-00-00-00-00 > inet6 addr: fe80::1/128 Scope:Link > UP BROADCAST POINTOPOINT RUNNING NOARP MTU:1500 Metric:1 > RX packets:486052 errors:0 dropped:0 overruns:0 frame:0 > TX packets:805540 errors:0 dropped:17 overruns:0 carrier:0 > collisions:0 txqueuelen:0 > RX bytes:26815645 (25.5 MiB) TX bytes:1186438623 (1.1 GiB) > > There are two openvz containers running on that server: > [root@server ~]# vzlist -a > CTID NPROC STATUS IP_ADDR HOSTNAME > 110 16 running 10.0.50.40 ct110.domain.org > 111 11 running 10.0.50.41 ct111.domain.org > > On one of the container I've built openmpi 1.10.3 with the following commands: > $ ./configure --prefix=/opt/openmpi/1.10.3 CXX=g++ > --with-cuda=/usr/local/cuda CC=gcc CFLAGS=-m64 CXXFLAGS=-m64 2>&1|tee > ~/openmpi-1.10.3_v1.log > > $ make -j20 > > [root]$ make install > > So openmpi was installed in /opt/openmpi/1.10.3/. The second container is a > exact clone of the first one. > > The passwordless ssh was enabled across both containers: > [user@ct110 ~]$ ssh 10.0.50.41 > Last login: Fri Jun 24 16:49:03 2016 from 10.0.50.40 > > [user@ct111 ~]$ ssh 10.0.50.40 > Last login: Fri Jun 24 16:37:35 2016 from 10.0.50.41 > > But the simple test via mpi does not work: > mpirun -np 1 -host 10.0.50.41 hostname > [ct111.domain.org:00899] [[13749,0],1] tcp_peer_send_blocking: send() to > socket 9 failed: Broken pipe (32) > -------------------------------------------------------------------------- > ORTE was unable to reliably start one or more daemons. > This usually is caused by: > > * not finding the required libraries and/or binaries on > one or more nodes. Please check your PATH and LD_LIBRARY_PATH > settings, or configure OMPI with --enable-orterun-prefix-by-default > > * lack of authority to execute on one or more specified nodes. > Please verify your allocation and authorities. > > * the inability to write startup files into /tmp (--tmpdir/orte_tmpdir_base). > Please check with your sys admin to determine the correct location to use. > > * compilation of the orted with dynamic libraries when static are required > (e.g., on Cray). Please check your configure cmd line and consider using > one of the contrib/platform definitions for your system type. > > * an inability to create a connection back to mpirun due to a > lack of common network interfaces and/or no route found between > them. Please check network connectivity (including firewalls > and network routing requirements). > -------------------------------------------------------------------------- > > Although an evironment on the host with 10.0.50.41 ip address seems OK: > [user@ct110 ~]$ ssh 10.0.50.41 env|grep PATH > LD_LIBRARY_PATH=:/usr/local/cuda/lib64:/opt/openmpi/1.10.3/lib > PATH=/usr/local/bin:/bin:/usr/bin:/home/user/bin:/usr/local/cuda/bin:/opt/openmpi/1.10.3/bin > > The ifconfig output from the inside containers: > [root@ct110 /]# ifconfig > lo Link encap:Local Loopback > inet addr:127.0.0.1 Mask:255.0.0.0 > inet6 addr: ::1/128 Scope:Host > UP LOOPBACK RUNNING MTU:65536 Metric:1 > RX packets:38 errors:0 dropped:0 overruns:0 frame:0 > TX packets:38 errors:0 dropped:0 overruns:0 carrier:0 > collisions:0 txqueuelen:0 > RX bytes:4559 (4.4 KiB) TX bytes:4559 (4.4 KiB) > > venet0 Link encap:UNSPEC HWaddr > 00-00-00-00-00-00-00-00-00-00-00-00-00-00-00-00 > inet addr:127.0.0.1 P-t-P:127.0.0.1 Bcast:0.0.0.0 > Mask:255.255.255.255 > UP BROADCAST POINTOPOINT RUNNING NOARP MTU:1500 Metric:1 > RX packets:772 errors:0 dropped:0 overruns:0 frame:0 > TX packets:853 errors:0 dropped:0 overruns:0 carrier:0 > collisions:0 txqueuelen:0 > RX bytes:112128 (109.5 KiB) TX bytes:122092 (119.2 KiB) > > venet0:0 Link encap:UNSPEC HWaddr > 00-00-00-00-00-00-00-00-00-00-00-00-00-00-00-00 > inet addr:10.0.50.40 P-t-P:10.0.50.40 Bcast:10.0.50.40 > Mask:255.255.255.255 > UP BROADCAST POINTOPOINT RUNNING NOARP MTU:1500 Metric:1 > > [root@ct111 /]# ifconfig > lo Link encap:Local Loopback > inet addr:127.0.0.1 Mask:255.0.0.0 > inet6 addr: ::1/128 Scope:Host > UP LOOPBACK RUNNING MTU:65536 Metric:1 > RX packets:24 errors:0 dropped:0 overruns:0 frame:0 > TX packets:24 errors:0 dropped:0 overruns:0 carrier:0 > collisions:0 txqueuelen:0 > RX bytes:1200 (1.1 KiB) TX bytes:1200 (1.1 KiB) > > venet0 Link encap:UNSPEC HWaddr > 00-00-00-00-00-00-00-00-00-00-00-00-00-00-00-00 > inet addr:127.0.0.1 P-t-P:127.0.0.1 Bcast:0.0.0.0 > Mask:255.255.255.255 > UP BROADCAST POINTOPOINT RUNNING NOARP MTU:1500 Metric:1 > RX packets:855 errors:0 dropped:0 overruns:0 frame:0 > TX packets:774 errors:0 dropped:0 overruns:0 carrier:0 > collisions:0 txqueuelen:0 > RX bytes:122212 (119.3 KiB) TX bytes:112304 (109.6 KiB) > > venet0:0 Link encap:UNSPEC HWaddr > 00-00-00-00-00-00-00-00-00-00-00-00-00-00-00-00 > inet addr:10.0.50.41 P-t-P:10.0.50.41 Bcast:10.0.50.41 > Mask:255.255.255.255 > UP BROADCAST POINTOPOINT RUNNING NOARP MTU:1500 Metric:1 > > Exactly the same error I get if I try to restrict network interface to > venet0:0: > [user@ct110 ~]$ /opt/openmpi/1.10.3/bin/mpirun --mca btl self,tcp --mca > btl_tcp_if_include venet0:0 -np 1 -host 10.0.50.41 hostname > [ct111.domain.org:00945] [[13704,0],1] tcp_peer_send_blocking: send() to > socket 9 failed: Broken pipe (32) > [ ... snip....] > > Although I can successfully run hostname and hello.bin executable from the > same container where I submit from: > [user@ct110 hello]$ /opt/openmpi/1.10.3/bin/mpirun --mca btl self,tcp --mca > btl_tcp_if_include venet0:0 -np 1 -host 10.0.50.40 hostname > ct110.domain.org > [user@ct110 hello]$ /opt/openmpi/1.10.3/bin/mpirun --mca btl self,tcp --mca > btl_tcp_if_include venet0:0 -np 1 -host 10.0.50.40 ./hello.bin > Hello world! from processor 0 (name=ct110.domain.org ) out of 1 > wall clock time = 0.000002 > > Iptables is off on both containers. > > I would assume that I faced with bug #3339 > (https://svn.open-mpi.org/trac/ompi/ticket/3339) but I have another cluster > based on openvz containers with openmpi 1.6.5 which works perfectly for a > several years. > > I would appreciate any help on that issue. > > Best regards, > Nikolay. > _______________________________________________ > users mailing list > us...@open-mpi.org > Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users > Link to this post: > http://www.open-mpi.org/community/lists/users/2016/06/29540.php -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/