Thanks for the clarification Jeff. I am using Open MPI 4.0.1
Once fully setup, I intend to run my application in conjunction with grid, so the resources will be allocated by grid. This makes it very difficult to specify IP address for btl_tcp_if_include. For the named exclude interfaces, it still hanged (with no output) when I specified btl_base_verbose 100. I will try using the CIDR for the below hosts as an experiment. Regards, Vipul From: Jeff Squyres (jsquyres) [mailto:jsquy...@cisco.com] Sent: Tuesday, June 23, 2020 1:36 PM To: Open MPI User's List <users@lists.open-mpi.org> Cc: Kulshrestha, Vipul <vipul_kulshres...@mentor.com> Subject: Re: [OMPI users] Question about virtual interface https://www.open-mpi.org/faq/?category=tcp#ip-virtual-ip-interfaces is referring to interfaces like "eth0:0", where the Linux kernel will have the same index for both "eth0" and "eth0:0". This will cause Open MPI to get confused (because it identifies Ethernet interfaces by their kernel indexes). If you have non-physical Ethernet interfaces (like vibr0, etc.), those should work just fine with btl_tcp_if_include|exclude. What version of Open MPI are you using? You might want to "--mca btl_tcp_if_include CIDR" where CIDR is the representation of the subnet you want to use. This will allow your app to work, even if that network is on different Ethernet interfaces on different hosts. For example: mpirun --mca btl_tcp_if_include 192.168.10.0/24 ... If you're still getting a hang, try with btl_base_verbose value of 100. On Jun 18, 2020, at 7:39 PM, Kulshrestha, Vipul via users <users@lists.open-mpi.org<mailto:users@lists.open-mpi.org>> wrote: Hi, I have read conflicting statements about OMPI support for virtual interfaces. The Open MPI FAQ mentions that virtual IP interfaces are not supported and this will not be solved by using either btl_tcp_if_include or btl_tcp_if_exclude. (https://www.open-mpi.org/faq/?category=tcp#ip-virtual-ip-interfaces) However, somewhere else, I read that you can exclude the virtual interfaces by specifying –mca btl_tcp_if_exclude virbr0,lo (https://github.com/open-mpi/ompi/issues/6377) I am trying this out on different machines and find that it (specifying btl_tcp_if_exclude virbr0,lo) works on one pair of machine but does not work on another pair of machines. I am hoping to get an explanation on why does one work and other does not. I tried to generate some verbose output (on the pair of machine where it does not work) by specifying –mca btl_base_verbose 30, but it just hangs and does not generate any messages. $ mpirun -np 4 --mca btl_base_verbose 30 --mca btl_tcp_if_exclude virbr0,virbr1,virbr2,virbr3,lo --hostfile host.txt /home/vipulk/mpitest2 100 ….. …. <no output and remains stuck forever> The ifconfig output for the 2 machines in the host list are listed below. Thanks, Vipul Host1: eno1: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500 inet 175.148.218.46 netmask 255.255.255.0 broadcast 175.148.218.255 inet6 fe80::9af2:b3ff:fe2a:3e84 prefixlen 64 scopeid 0x20<link> ether 98:f2:b3:2a:3e:84 txqueuelen 1000 (Ethernet) RX packets 5938671220 bytes 6033195902625 (5.4 TiB) RX errors 0 dropped 534674 overruns 0 frame 0 TX packets 3933921252 bytes 3077919856788 (2.7 TiB) TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0 device interrupt 16 eno2: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500 inet 192.168.1.2 netmask 255.255.255.0 broadcast 192.168.1.255 inet6 fe80::be68:2aa2:8b42:d6d prefixlen 64 scopeid 0x20<link> ether 98:f2:b3:2a:3e:85 txqueuelen 1000 (Ethernet) RX packets 2355308 bytes 279699254 (266.7 MiB) RX errors 0 dropped 350 overruns 0 frame 0 TX packets 60 bytes 8732 (8.5 KiB) TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0 device interrupt 17 eno3: flags=4099<UP,BROADCAST,MULTICAST> mtu 1500 ether 98:f2:b3:2a:3e:86 txqueuelen 1000 (Ethernet) RX packets 0 bytes 0 (0.0 B) RX errors 0 dropped 0 overruns 0 frame 0 TX packets 0 bytes 0 (0.0 B) TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0 device interrupt 16 eno4: flags=4099<UP,BROADCAST,MULTICAST> mtu 1500 ether 98:f2:b3:2a:3e:87 txqueuelen 1000 (Ethernet) RX packets 0 bytes 0 (0.0 B) RX errors 0 dropped 0 overruns 0 frame 0 TX packets 0 bytes 0 (0.0 B) TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0 device interrupt 17 lo: flags=73<UP,LOOPBACK,RUNNING> mtu 65536 inet 127.0.0.1 netmask 255.0.0.0 inet6 ::1 prefixlen 128 scopeid 0x10<host> loop txqueuelen 1000 (Local Loopback) RX packets 3161146200 bytes 225991248912 (210.4 GiB) RX errors 0 dropped 0 overruns 0 frame 0 TX packets 3161146200 bytes 225991248912 (210.4 GiB) TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0 virbr2: flags=4099<UP,BROADCAST,MULTICAST> mtu 1500 inet 192.168.122.1 netmask 255.255.255.0 broadcast 192.168.122.255 ether 52:54:00:0a:cd:21 txqueuelen 1000 (Ethernet) RX packets 0 bytes 0 (0.0 B) RX errors 0 dropped 0 overruns 0 frame 0 TX packets 0 bytes 0 (0.0 B) TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0 virbr3: flags=4099<UP,BROADCAST,MULTICAST> mtu 1500 inet 192.168.123.1 netmask 255.255.255.0 broadcast 192.168.123.255 ether 52:54:00:0a:cd:22 txqueuelen 1000 (Ethernet) RX packets 0 bytes 0 (0.0 B) RX errors 0 dropped 0 overruns 0 frame 0 TX packets 0 bytes 0 (0.0 B) TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0 Host2: eth0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500 inet 175.148.218.210 netmask 255.255.255.0 broadcast 175.148.218.255 inet6 fe80::9af2:b3ff:fe2a:3e78 prefixlen 64 scopeid 0x20<link> ether 98:f2:b3:2a:3e:78 txqueuelen 1000 (Ethernet) RX packets 8632800 bytes 3938419917 (3.6 GiB) RX errors 0 dropped 350 overruns 0 frame 0 TX packets 5504444 bytes 1791707074 (1.6 GiB) TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0 device interrupt 16 eth1: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500 inet 192.168.1.2 netmask 255.255.255.0 broadcast 192.168.1.255 inet6 fe80::9af2:b3ff:fe2a:3e79 prefixlen 64 scopeid 0x20<link> ether 98:f2:b3:2a:3e:79 txqueuelen 1000 (Ethernet) RX packets 2317163 bytes 275220791 (262.4 MiB) RX errors 0 dropped 350 overruns 0 frame 0 TX packets 336 bytes 26726 (26.0 KiB) TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0 device interrupt 17 lo: flags=73<UP,LOOPBACK,RUNNING> mtu 65536 inet 127.0.0.1 netmask 255.0.0.0 inet6 ::1 prefixlen 128 scopeid 0x10<host> loop txqueuelen 1000 (Local Loopback) RX packets 32539 bytes 2540603 (2.4 MiB) RX errors 0 dropped 0 overruns 0 frame 0 TX packets 32539 bytes 2540603 (2.4 MiB) TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0 virbr0: flags=4099<UP,BROADCAST,MULTICAST> mtu 1500 inet 192.168.123.1 netmask 255.255.255.0 broadcast 192.168.123.255 ether 52:54:00:0a:cd:22 txqueuelen 1000 (Ethernet) RX packets 0 bytes 0 (0.0 B) RX errors 0 dropped 0 overruns 0 frame 0 TX packets 0 bytes 0 (0.0 B) TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0 virbr1: flags=4099<UP,BROADCAST,MULTICAST> mtu 1500 inet 192.168.122.1 netmask 255.255.255.0 broadcast 192.168.122.255 ether 52:54:00:0a:cd:21 txqueuelen 1000 (Ethernet) RX packets 0 bytes 0 (0.0 B) RX errors 0 dropped 0 overruns 0 frame 0 TX packets 0 bytes 0 (0.0 B) TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0 --mca btl_tcp_if_exclude virbr0,lo works on machines with below configuration: Host 3: eno1: flags=4099<UP,BROADCAST,MULTICAST> mtu 1500 ether 80:30:e0:3b:c8:40 txqueuelen 1000 (Ethernet) RX packets 0 bytes 0 (0.0 B) RX errors 0 dropped 0 overruns 0 frame 0 TX packets 0 bytes 0 (0.0 B) TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0 device interrupt 16 eno2: flags=4099<UP,BROADCAST,MULTICAST> mtu 1500 ether 80:30:e0:3b:c8:41 txqueuelen 1000 (Ethernet) RX packets 0 bytes 0 (0.0 B) RX errors 0 dropped 0 overruns 0 frame 0 TX packets 0 bytes 0 (0.0 B) TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0 device interrupt 17 eno3: flags=4099<UP,BROADCAST,MULTICAST> mtu 1500 ether 80:30:e0:3b:c8:42 txqueuelen 1000 (Ethernet) RX packets 0 bytes 0 (0.0 B) RX errors 0 dropped 0 overruns 0 frame 0 TX packets 0 bytes 0 (0.0 B) TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0 device interrupt 16 eno4: flags=4099<UP,BROADCAST,MULTICAST> mtu 1500 ether 80:30:e0:3b:c8:43 txqueuelen 1000 (Ethernet) RX packets 0 bytes 0 (0.0 B) RX errors 0 dropped 0 overruns 0 frame 0 TX packets 0 bytes 0 (0.0 B) TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0 device interrupt 17 eno5: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500 inet 65.10.19.30 netmask 255.255.255.192 broadcast 65.10.19.63 inet6 fe80::8230:e0ff:fe20:96a8 prefixlen 64 scopeid 0x20<link> ether 80:30:e0:20:96:a8 txqueuelen 1000 (Ethernet) RX packets 1618138239 bytes 1552281705604 (1.4 TiB) RX errors 184 dropped 0 overruns 184 frame 0 TX packets 1500861577 bytes 1593767198059 (1.4 TiB) TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0 device interrupt 34 memory 0xe8000000-e87fffff eno6: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500 ether 80:30:e0:20:96:ac txqueuelen 1000 (Ethernet) RX packets 1299786 bytes 150289059 (143.3 MiB) RX errors 0 dropped 0 overruns 0 frame 0 TX packets 0 bytes 0 (0.0 B) TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0 device interrupt 77 memory 0xe7000000-e77fffff lo: flags=73<UP,LOOPBACK,RUNNING> mtu 65536 inet 127.0.0.1 netmask 255.0.0.0 inet6 ::1 prefixlen 128 scopeid 0x10<host> loop txqueuelen 1000 (Local Loopback) RX packets 20936389 bytes 2632538104 (2.4 GiB) RX errors 0 dropped 0 overruns 0 frame 0 TX packets 20936389 bytes 2632538104 (2.4 GiB) TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0 virbr0: flags=4099<UP,BROADCAST,MULTICAST> mtu 1500 inet 192.168.122.1 netmask 255.255.255.0 broadcast 192.168.122.255 ether 52:54:00:05:7c:dd txqueuelen 1000 (Ethernet) RX packets 0 bytes 0 (0.0 B) RX errors 0 dropped 0 overruns 0 frame 0 TX packets 0 bytes 0 (0.0 B) TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0 HOST 4: eno1: flags=4099<UP,BROADCAST,MULTICAST> mtu 1500 ether 80:30:e0:3b:b8:5c txqueuelen 1000 (Ethernet) RX packets 0 bytes 0 (0.0 B) RX errors 0 dropped 0 overruns 0 frame 0 TX packets 0 bytes 0 (0.0 B) TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0 device interrupt 16 eno2: flags=4099<UP,BROADCAST,MULTICAST> mtu 1500 ether 80:30:e0:3b:b8:5d txqueuelen 1000 (Ethernet) RX packets 0 bytes 0 (0.0 B) RX errors 0 dropped 0 overruns 0 frame 0 TX packets 0 bytes 0 (0.0 B) TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0 device interrupt 17 eno3: flags=4099<UP,BROADCAST,MULTICAST> mtu 1500 ether 80:30:e0:3b:b8:5e txqueuelen 1000 (Ethernet) RX packets 0 bytes 0 (0.0 B) RX errors 0 dropped 0 overruns 0 frame 0 TX packets 0 bytes 0 (0.0 B) TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0 device interrupt 16 eno4: flags=4099<UP,BROADCAST,MULTICAST> mtu 1500 ether 80:30:e0:3b:b8:5f txqueuelen 1000 (Ethernet) RX packets 0 bytes 0 (0.0 B) RX errors 0 dropped 0 overruns 0 frame 0 TX packets 0 bytes 0 (0.0 B) TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0 device interrupt 17 eno5: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500 inet 65.10.19.29 netmask 255.255.255.192 broadcast 65.10.19.63 inet6 fe80::8230:e0ff:fe20:96c0 prefixlen 64 scopeid 0x20<link> ether 80:30:e0:20:96:c0 txqueuelen 1000 (Ethernet) RX packets 2904054722 bytes 2656941056010 (2.4 TiB) RX errors 11 dropped 0 overruns 11 frame 0 TX packets 5801141892 bytes 7474409123677 (6.7 TiB) TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0 device interrupt 34 memory 0xe8000000-e87fffff eno6: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500 ether 80:30:e0:20:96:c4 txqueuelen 1000 (Ethernet) RX packets 1299694 bytes 150265217 (143.3 MiB) RX errors 0 dropped 0 overruns 0 frame 0 TX packets 0 bytes 0 (0.0 B) TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0 device interrupt 77 memory 0xe7000000-e77fffff lo: flags=73<UP,LOOPBACK,RUNNING> mtu 65536 inet 127.0.0.1 netmask 255.0.0.0 inet6 ::1 prefixlen 128 scopeid 0x10<host> loop txqueuelen 1000 (Local Loopback) RX packets 19850956 bytes 5578561316 (5.1 GiB) RX errors 0 dropped 0 overruns 0 frame 0 TX packets 19850956 bytes 5578561316 (5.1 GiB) TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0 virbr0: flags=4099<UP,BROADCAST,MULTICAST> mtu 1500 inet 192.168.122.1 netmask 255.255.255.0 broadcast 192.168.122.255 ether 52:54:00:79:33:89 txqueuelen 1000 (Ethernet) RX packets 0 bytes 0 (0.0 B) RX errors 0 dropped 0 overruns 0 frame 0 TX packets 0 bytes 0 (0.0 B) TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0 -- Jeff Squyres jsquy...@cisco.com<mailto:jsquy...@cisco.com>