You might want to make sure that the run time is working properly before going too much further. E.g., try mpirun'ing hostname (i.e., the Linux command -- a non-MPI program) and make sure that that works.
If that works, then try mpirun'ing the "hello world" example program that comes in the examples/ directory in the Open MPI tarball. That program just initializes and finalizes MPI; it does no actual MPI communication. If that works, then try mpirun'ing the "ring" example program in the same examples/ directory. That does very simple MPI communication. > On Jun 23, 2020, at 2:39 PM, Kulshrestha, Vipul > <vipul_kulshres...@mentor.com> wrote: > > Thanks for the clarification Jeff. > > I am using Open MPI 4.0.1 > > Once fully setup, I intend to run my application in conjunction with grid, so > the resources will be allocated by grid. This makes it very difficult to > specify IP address for btl_tcp_if_include. > > For the named exclude interfaces, it still hanged (with no output) when I > specified btl_base_verbose 100. > > I will try using the CIDR for the below hosts as an experiment. > > Regards, > Vipul > > > > From: Jeff Squyres (jsquyres) [mailto:jsquy...@cisco.com] > Sent: Tuesday, June 23, 2020 1:36 PM > To: Open MPI User's List <users@lists.open-mpi.org> > Cc: Kulshrestha, Vipul <vipul_kulshres...@mentor.com> > Subject: Re: [OMPI users] Question about virtual interface > > https://www.open-mpi.org/faq/?category=tcp#ip-virtual-ip-interfaces is > referring to interfaces like "eth0:0", where the Linux kernel will have the > same index for both "eth0" and "eth0:0". This will cause Open MPI to get > confused (because it identifies Ethernet interfaces by their kernel indexes). > > If you have non-physical Ethernet interfaces (like vibr0, etc.), those should > work just fine with btl_tcp_if_include|exclude. > > What version of Open MPI are you using? > > You might want to "--mca btl_tcp_if_include CIDR" where CIDR is the > representation of the subnet you want to use. This will allow your app to > work, even if that network is on different Ethernet interfaces on different > hosts. For example: > > mpirun --mca btl_tcp_if_include 192.168.10.0/24 ... > > If you're still getting a hang, try with btl_base_verbose value of 100. > > > > > On Jun 18, 2020, at 7:39 PM, Kulshrestha, Vipul via users > <users@lists.open-mpi.org> wrote: > > Hi, > > I have read conflicting statements about OMPI support for virtual interfaces. > > The Open MPI FAQ mentions that virtual IP interfaces are not supported and > this will not be solved by using either btl_tcp_if_include or > btl_tcp_if_exclude. > (https://www.open-mpi.org/faq/?category=tcp#ip-virtual-ip-interfaces) > > However, somewhere else, I read that you can exclude the virtual interfaces > by specifying –mca btl_tcp_if_exclude virbr0,lo > (https://github.com/open-mpi/ompi/issues/6377) > > I am trying this out on different machines and find that it (specifying > btl_tcp_if_exclude virbr0,lo) works on one pair of machine but does not work > on another pair of machines. I am hoping to get an explanation on why does > one work and other does not. > > I tried to generate some verbose output (on the pair of machine where it does > not work) by specifying –mca btl_base_verbose 30, but it just hangs and does > not generate any messages. > > $ mpirun -np 4 --mca btl_base_verbose 30 --mca btl_tcp_if_exclude > virbr0,virbr1,virbr2,virbr3,lo --hostfile host.txt /home/vipulk/mpitest2 100 > ….. > …. > <no output and remains stuck forever> > > The ifconfig output for the 2 machines in the host list are listed below. > > Thanks, > Vipul > > > Host1: > > eno1: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500 > > inet 175.148.218.46 netmask 255.255.255.0 broadcast 175.148.218.255 > > inet6 fe80::9af2:b3ff:fe2a:3e84 prefixlen 64 scopeid 0x20<link> > > ether 98:f2:b3:2a:3e:84 txqueuelen 1000 (Ethernet) > > RX packets 5938671220 bytes 6033195902625 (5.4 TiB) > > RX errors 0 dropped 534674 overruns 0 frame 0 > > TX packets 3933921252 bytes 3077919856788 (2.7 TiB) > > TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0 > > device interrupt 16 > > > eno2: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500 > inet 192.168.1.2 netmask 255.255.255.0 broadcast 192.168.1.255 > inet6 fe80::be68:2aa2:8b42:d6d prefixlen 64 scopeid 0x20<link> > ether 98:f2:b3:2a:3e:85 txqueuelen 1000 (Ethernet) > RX packets 2355308 bytes 279699254 (266.7 MiB) > RX errors 0 dropped 350 overruns 0 frame 0 > TX packets 60 bytes 8732 (8.5 KiB) > TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0 > device interrupt 17 > > eno3: flags=4099<UP,BROADCAST,MULTICAST> mtu 1500 > ether 98:f2:b3:2a:3e:86 txqueuelen 1000 (Ethernet) > RX packets 0 bytes 0 (0.0 B) > RX errors 0 dropped 0 overruns 0 frame 0 > TX packets 0 bytes 0 (0.0 B) > TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0 > device interrupt 16 > > eno4: flags=4099<UP,BROADCAST,MULTICAST> mtu 1500 > ether 98:f2:b3:2a:3e:87 txqueuelen 1000 (Ethernet) > RX packets 0 bytes 0 (0.0 B) > RX errors 0 dropped 0 overruns 0 frame 0 > TX packets 0 bytes 0 (0.0 B) > TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0 > device interrupt 17 > > lo: flags=73<UP,LOOPBACK,RUNNING> mtu 65536 > inet 127.0.0.1 netmask 255.0.0.0 > inet6 ::1 prefixlen 128 scopeid 0x10<host> > loop txqueuelen 1000 (Local Loopback) > RX packets 3161146200 bytes 225991248912 (210.4 GiB) > RX errors 0 dropped 0 overruns 0 frame 0 > TX packets 3161146200 bytes 225991248912 (210.4 GiB) > TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0 > > virbr2: flags=4099<UP,BROADCAST,MULTICAST> mtu 1500 > inet 192.168.122.1 netmask 255.255.255.0 broadcast 192.168.122.255 > ether 52:54:00:0a:cd:21 txqueuelen 1000 (Ethernet) > RX packets 0 bytes 0 (0.0 B) > RX errors 0 dropped 0 overruns 0 frame 0 > TX packets 0 bytes 0 (0.0 B) > TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0 > > virbr3: flags=4099<UP,BROADCAST,MULTICAST> mtu 1500 > inet 192.168.123.1 netmask 255.255.255.0 broadcast 192.168.123.255 > ether 52:54:00:0a:cd:22 txqueuelen 1000 (Ethernet) > RX packets 0 bytes 0 (0.0 B) > RX errors 0 dropped 0 overruns 0 frame 0 > TX packets 0 bytes 0 (0.0 B) > TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0 > > Host2: > eth0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500 > > inet 175.148.218.210 netmask 255.255.255.0 broadcast > 175.148.218.255 > inet6 fe80::9af2:b3ff:fe2a:3e78 prefixlen 64 scopeid 0x20<link> > > ether 98:f2:b3:2a:3e:78 txqueuelen 1000 (Ethernet) > > RX packets 8632800 bytes 3938419917 (3.6 GiB) > > RX errors 0 dropped 350 overruns 0 frame 0 > > TX packets 5504444 bytes 1791707074 (1.6 GiB) > > TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0 > > device interrupt 16 > > > eth1: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500 > inet 192.168.1.2 netmask 255.255.255.0 broadcast 192.168.1.255 > inet6 fe80::9af2:b3ff:fe2a:3e79 prefixlen 64 scopeid 0x20<link> > ether 98:f2:b3:2a:3e:79 txqueuelen 1000 (Ethernet) > RX packets 2317163 bytes 275220791 (262.4 MiB) > RX errors 0 dropped 350 overruns 0 frame 0 > TX packets 336 bytes 26726 (26.0 KiB) > TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0 > device interrupt 17 > > lo: flags=73<UP,LOOPBACK,RUNNING> mtu 65536 > inet 127.0.0.1 netmask 255.0.0.0 > inet6 ::1 prefixlen 128 scopeid 0x10<host> > loop txqueuelen 1000 (Local Loopback) > RX packets 32539 bytes 2540603 (2.4 MiB) > RX errors 0 dropped 0 overruns 0 frame 0 > TX packets 32539 bytes 2540603 (2.4 MiB) > TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0 > > virbr0: flags=4099<UP,BROADCAST,MULTICAST> mtu 1500 > inet 192.168.123.1 netmask 255.255.255.0 broadcast 192.168.123.255 > ether 52:54:00:0a:cd:22 txqueuelen 1000 (Ethernet) > RX packets 0 bytes 0 (0.0 B) > RX errors 0 dropped 0 overruns 0 frame 0 > TX packets 0 bytes 0 (0.0 B) > TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0 > > virbr1: flags=4099<UP,BROADCAST,MULTICAST> mtu 1500 > inet 192.168.122.1 netmask 255.255.255.0 broadcast 192.168.122.255 > ether 52:54:00:0a:cd:21 txqueuelen 1000 (Ethernet) > RX packets 0 bytes 0 (0.0 B) > RX errors 0 dropped 0 overruns 0 frame 0 > TX packets 0 bytes 0 (0.0 B) > TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0 > > > --mca btl_tcp_if_exclude virbr0,lo works on machines with below configuration: > > Host 3: > eno1: flags=4099<UP,BROADCAST,MULTICAST> mtu 1500 > > ether 80:30:e0:3b:c8:40 txqueuelen 1000 (Ethernet) > > RX packets 0 bytes 0 (0.0 B) > > RX errors 0 dropped 0 overruns 0 frame 0 > > TX packets 0 bytes 0 (0.0 B) > > TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0 > > device interrupt 16 > > > eno2: flags=4099<UP,BROADCAST,MULTICAST> mtu 1500 > ether 80:30:e0:3b:c8:41 txqueuelen 1000 (Ethernet) > RX packets 0 bytes 0 (0.0 B) > RX errors 0 dropped 0 overruns 0 frame 0 > TX packets 0 bytes 0 (0.0 B) > TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0 > device interrupt 17 > > eno3: flags=4099<UP,BROADCAST,MULTICAST> mtu 1500 > ether 80:30:e0:3b:c8:42 txqueuelen 1000 (Ethernet) > RX packets 0 bytes 0 (0.0 B) > RX errors 0 dropped 0 overruns 0 frame 0 > TX packets 0 bytes 0 (0.0 B) > TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0 > device interrupt 16 > > eno4: flags=4099<UP,BROADCAST,MULTICAST> mtu 1500 > ether 80:30:e0:3b:c8:43 txqueuelen 1000 (Ethernet) > RX packets 0 bytes 0 (0.0 B) > RX errors 0 dropped 0 overruns 0 frame 0 > TX packets 0 bytes 0 (0.0 B) > TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0 > device interrupt 17 > > eno5: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500 > inet 65.10.19.30 netmask 255.255.255.192 broadcast 65.10.19.63 > inet6 fe80::8230:e0ff:fe20:96a8 prefixlen 64 scopeid 0x20<link> > ether 80:30:e0:20:96:a8 txqueuelen 1000 (Ethernet) > RX packets 1618138239 bytes 1552281705604 (1.4 TiB) > RX errors 184 dropped 0 overruns 184 frame 0 > TX packets 1500861577 bytes 1593767198059 (1.4 TiB) > TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0 > device interrupt 34 memory 0xe8000000-e87fffff > > eno6: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500 > ether 80:30:e0:20:96:ac txqueuelen 1000 (Ethernet) > RX packets 1299786 bytes 150289059 (143.3 MiB) > RX errors 0 dropped 0 overruns 0 frame 0 > TX packets 0 bytes 0 (0.0 B) > TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0 > device interrupt 77 memory 0xe7000000-e77fffff > > lo: flags=73<UP,LOOPBACK,RUNNING> mtu 65536 > inet 127.0.0.1 netmask 255.0.0.0 > inet6 ::1 prefixlen 128 scopeid 0x10<host> > loop txqueuelen 1000 (Local Loopback) > RX packets 20936389 bytes 2632538104 (2.4 GiB) > RX errors 0 dropped 0 overruns 0 frame 0 > TX packets 20936389 bytes 2632538104 (2.4 GiB) > TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0 > > virbr0: flags=4099<UP,BROADCAST,MULTICAST> mtu 1500 > inet 192.168.122.1 netmask 255.255.255.0 broadcast 192.168.122.255 > ether 52:54:00:05:7c:dd txqueuelen 1000 (Ethernet) > RX packets 0 bytes 0 (0.0 B) > RX errors 0 dropped 0 overruns 0 frame 0 > TX packets 0 bytes 0 (0.0 B) > TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0 > > > HOST 4: > > eno1: flags=4099<UP,BROADCAST,MULTICAST> mtu 1500 > ether 80:30:e0:3b:b8:5c txqueuelen 1000 (Ethernet) > RX packets 0 bytes 0 (0.0 B) > RX errors 0 dropped 0 overruns 0 frame 0 > TX packets 0 bytes 0 (0.0 B) > TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0 > device interrupt 16 > > eno2: flags=4099<UP,BROADCAST,MULTICAST> mtu 1500 > ether 80:30:e0:3b:b8:5d txqueuelen 1000 (Ethernet) > RX packets 0 bytes 0 (0.0 B) > RX errors 0 dropped 0 overruns 0 frame 0 > TX packets 0 bytes 0 (0.0 B) > TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0 > device interrupt 17 > > eno3: flags=4099<UP,BROADCAST,MULTICAST> mtu 1500 > ether 80:30:e0:3b:b8:5e txqueuelen 1000 (Ethernet) > RX packets 0 bytes 0 (0.0 B) > RX errors 0 dropped 0 overruns 0 frame 0 > TX packets 0 bytes 0 (0.0 B) > TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0 > device interrupt 16 > > eno4: flags=4099<UP,BROADCAST,MULTICAST> mtu 1500 > ether 80:30:e0:3b:b8:5f txqueuelen 1000 (Ethernet) > RX packets 0 bytes 0 (0.0 B) > RX errors 0 dropped 0 overruns 0 frame 0 > TX packets 0 bytes 0 (0.0 B) > TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0 > device interrupt 17 > > eno5: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500 > inet 65.10.19.29 netmask 255.255.255.192 broadcast 65.10.19.63 > inet6 fe80::8230:e0ff:fe20:96c0 prefixlen 64 scopeid 0x20<link> > ether 80:30:e0:20:96:c0 txqueuelen 1000 (Ethernet) > RX packets 2904054722 bytes 2656941056010 (2.4 TiB) > RX errors 11 dropped 0 overruns 11 frame 0 > TX packets 5801141892 bytes 7474409123677 (6.7 TiB) > TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0 > device interrupt 34 memory 0xe8000000-e87fffff > > eno6: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500 > ether 80:30:e0:20:96:c4 txqueuelen 1000 (Ethernet) > RX packets 1299694 bytes 150265217 (143.3 MiB) > RX errors 0 dropped 0 overruns 0 frame 0 > TX packets 0 bytes 0 (0.0 B) > TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0 > device interrupt 77 memory 0xe7000000-e77fffff > > > lo: flags=73<UP,LOOPBACK,RUNNING> mtu 65536 > inet 127.0.0.1 netmask 255.0.0.0 > inet6 ::1 prefixlen 128 scopeid 0x10<host> > loop txqueuelen 1000 (Local Loopback) > RX packets 19850956 bytes 5578561316 (5.1 GiB) > RX errors 0 dropped 0 overruns 0 frame 0 > TX packets 19850956 bytes 5578561316 (5.1 GiB) > TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0 > > virbr0: flags=4099<UP,BROADCAST,MULTICAST> mtu 1500 > inet 192.168.122.1 netmask 255.255.255.0 broadcast 192.168.122.255 > ether 52:54:00:79:33:89 txqueuelen 1000 (Ethernet) > RX packets 0 bytes 0 (0.0 B) > RX errors 0 dropped 0 overruns 0 frame 0 > TX packets 0 bytes 0 (0.0 B) > TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0 > > > -- > Jeff Squyres > jsquy...@cisco.com -- Jeff Squyres jsquy...@cisco.com