Hi all!
I am trying to build a cluster for MPI jobs using OpenVZ containers
(https://openvz.org/Main_Page).
I've been successfully using openvz+openmpi during many years but can't make it work with OpenMPI
1.10.x.
So I have a server with openvz support enabled. The output of it's ifconfig:
[root@server]$ ifconfig
eth0 Link encap:Ethernet HWaddr **************************
inet addr:10.0.50.35 Bcast:10.0.50.255 Mask:255.255.255.0
inet6 addr: fe80::ec4:7aff:feb0:cf7e/64 Scope:Link
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
RX packets:6117448 errors:103 dropped:0 overruns:0 frame:56
TX packets:765411 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:3608033195 (3.3 GiB) TX bytes:70005631 (66.7 MiB)
Memory:fb120000-fb13ffff
lo Link encap:Local Loopback
inet addr:127.0.0.1 Mask:255.0.0.0
inet6 addr: ::1/128 Scope:Host
UP LOOPBACK RUNNING MTU:65536 Metric:1
RX packets:52 errors:0 dropped:0 overruns:0 frame:0
TX packets:52 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:0
RX bytes:3788 (3.6 KiB) TX bytes:3788 (3.6 KiB)
venet0 Link encap:UNSPEC HWaddr 00-00-00-00-00-00-00-00-00-00-00-00-00-00-00-00
inet6 addr: fe80::1/128 Scope:Link
UP BROADCAST POINTOPOINT RUNNING NOARP MTU:1500 Metric:1
RX packets:486052 errors:0 dropped:0 overruns:0 frame:0
TX packets:805540 errors:0 dropped:17 overruns:0 carrier:0
collisions:0 txqueuelen:0
RX bytes:26815645 (25.5 MiB) TX bytes:1186438623 (1.1 GiB)
There are two openvz containers running on that server:
[root@server ~]# vzlist -a
CTID NPROC STATUS IP_ADDR HOSTNAME
110 16 running 10.0.50.40 ct110.domain.org
111 11 running 10.0.50.41 ct111.domain.org
On one of the container I've built openmpi 1.10.3 with the following commands:
$ ./configure --prefix=/opt/openmpi/1.10.3 CXX=g++ --with-cuda=/usr/local/cuda CC=gcc CFLAGS=-m64
CXXFLAGS=-m64 2>&1|tee ~/openmpi-1.10.3_v1.log
$ make -j20
[root]$ make install
So openmpi was installed in /opt/openmpi/1.10.3/. The second container is a exact clone of the first
one.
The passwordless ssh was enabled across both containers:
[user@ct110 ~]$ ssh 10.0.50.41
Last login: Fri Jun 24 16:49:03 2016 from 10.0.50.40
[user@ct111 ~]$ ssh 10.0.50.40
Last login: Fri Jun 24 16:37:35 2016 from 10.0.50.41
But the simple test via mpi does not work:
mpirun -np 1 -host 10.0.50.41 hostname
[ct111.domain.org:00899] [[13749,0],1] tcp_peer_send_blocking: send() to socket 9 failed: Broken
pipe (32)
--------------------------------------------------------------------------
ORTE was unable to reliably start one or more daemons.
This usually is caused by:
* not finding the required libraries and/or binaries on
one or more nodes. Please check your PATH and LD_LIBRARY_PATH
settings, or configure OMPI with --enable-orterun-prefix-by-default
* lack of authority to execute on one or more specified nodes.
Please verify your allocation and authorities.
* the inability to write startup files into /tmp (--tmpdir/orte_tmpdir_base).
Please check with your sys admin to determine the correct location to use.
* compilation of the orted with dynamic libraries when static are required
(e.g., on Cray). Please check your configure cmd line and consider using
one of the contrib/platform definitions for your system type.
* an inability to create a connection back to mpirun due to a
lack of common network interfaces and/or no route found between
them. Please check network connectivity (including firewalls
and network routing requirements).
--------------------------------------------------------------------------
Although an evironment on the host with 10.0.50.41 ip address seems OK:
[user@ct110 ~]$ ssh 10.0.50.41 env|grep PATH
LD_LIBRARY_PATH=:/usr/local/cuda/lib64:/opt/openmpi/1.10.3/lib
PATH=/usr/local/bin:/bin:/usr/bin:/home/user/bin:/usr/local/cuda/bin:/opt/openmpi/1.10.3/bin
The ifconfig output from the inside containers:
[root@ct110 /]# ifconfig
lo Link encap:Local Loopback
inet addr:127.0.0.1 Mask:255.0.0.0
inet6 addr: ::1/128 Scope:Host
UP LOOPBACK RUNNING MTU:65536 Metric:1
RX packets:38 errors:0 dropped:0 overruns:0 frame:0
TX packets:38 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:0
RX bytes:4559 (4.4 KiB) TX bytes:4559 (4.4 KiB)
venet0 Link encap:UNSPEC HWaddr
00-00-00-00-00-00-00-00-00-00-00-00-00-00-00-00
inet addr:127.0.0.1 P-t-P:127.0.0.1 Bcast:0.0.0.0
Mask:255.255.255.255
UP BROADCAST POINTOPOINT RUNNING NOARP MTU:1500 Metric:1
RX packets:772 errors:0 dropped:0 overruns:0 frame:0
TX packets:853 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:0
RX bytes:112128 (109.5 KiB) TX bytes:122092 (119.2 KiB)
venet0:0 Link encap:UNSPEC HWaddr
00-00-00-00-00-00-00-00-00-00-00-00-00-00-00-00
inet addr:10.0.50.40 P-t-P:10.0.50.40 Bcast:10.0.50.40
Mask:255.255.255.255
UP BROADCAST POINTOPOINT RUNNING NOARP MTU:1500 Metric:1
[root@ct111 /]# ifconfig
lo Link encap:Local Loopback
inet addr:127.0.0.1 Mask:255.0.0.0
inet6 addr: ::1/128 Scope:Host
UP LOOPBACK RUNNING MTU:65536 Metric:1
RX packets:24 errors:0 dropped:0 overruns:0 frame:0
TX packets:24 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:0
RX bytes:1200 (1.1 KiB) TX bytes:1200 (1.1 KiB)
venet0 Link encap:UNSPEC HWaddr
00-00-00-00-00-00-00-00-00-00-00-00-00-00-00-00
inet addr:127.0.0.1 P-t-P:127.0.0.1 Bcast:0.0.0.0
Mask:255.255.255.255
UP BROADCAST POINTOPOINT RUNNING NOARP MTU:1500 Metric:1
RX packets:855 errors:0 dropped:0 overruns:0 frame:0
TX packets:774 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:0
RX bytes:122212 (119.3 KiB) TX bytes:112304 (109.6 KiB)
venet0:0 Link encap:UNSPEC HWaddr
00-00-00-00-00-00-00-00-00-00-00-00-00-00-00-00
inet addr:10.0.50.41 P-t-P:10.0.50.41 Bcast:10.0.50.41
Mask:255.255.255.255
UP BROADCAST POINTOPOINT RUNNING NOARP MTU:1500 Metric:1
Exactly the same error I get if I try to restrict network interface to venet0:0:
[user@ct110 ~]$ /opt/openmpi/1.10.3/bin/mpirun --mca btl self,tcp --mca btl_tcp_if_include venet0:0
-np 1 -host 10.0.50.41 hostname
[ct111.domain.org:00945] [[13704,0],1] tcp_peer_send_blocking: send() to socket 9 failed: Broken
pipe (32)
[ ... snip....]
Although I can successfully run hostname and hello.bin executable from the same container where I
submit from:
[user@ct110 hello]$ /opt/openmpi/1.10.3/bin/mpirun --mca btl self,tcp --mca btl_tcp_if_include
venet0:0 -np 1 -host 10.0.50.40 hostname
ct110.domain.org
[user@ct110 hello]$ /opt/openmpi/1.10.3/bin/mpirun --mca btl self,tcp --mca btl_tcp_if_include
venet0:0 -np 1 -host 10.0.50.40 ./hello.bin
Hello world! from processor 0 (name=ct110.domain.org ) out of 1
wall clock time = 0.000002
Iptables is off on both containers.
I would assume that I faced with bug #3339 (https://svn.open-mpi.org/trac/ompi/ticket/3339) but I
have another cluster based on openvz containers with openmpi 1.6.5 which works perfectly for a
several years.
I would appreciate any help on that issue.
Best regards,
Nikolay.