Hello Vipul. could you tell us more about the grid you will be using? Is this the Gridengine scheduler on a local HPC cluster? Or is it running on Kubernetes maybe?
You say that it is difficult to specify the IP address for btl_tcp_if_include I agree with you! But it is quit common to have to write some lines in a batch submission script which parse the HOSTFILE Once you get those lines right you can reuse them in other scripts. If you tell us a little btit more about your grid we maybe can help. On Tue, 23 Jun 2020 at 19:59, Jeff Squyres (jsquyres) via users < users@lists.open-mpi.org> wrote: > You might want to make sure that the run time is working properly before > going too much further. E.g., try mpirun'ing hostname (i.e., the Linux > command -- a non-MPI program) and make sure that that works. > > If that works, then try mpirun'ing the "hello world" example program that > comes in the examples/ directory in the Open MPI tarball. That program > just initializes and finalizes MPI; it does no actual MPI communication. > > If that works, then try mpirun'ing the "ring" example program in the same > examples/ directory. That does very simple MPI communication. > > > > > On Jun 23, 2020, at 2:39 PM, Kulshrestha, Vipul < > vipul_kulshres...@mentor.com> wrote: > > > > Thanks for the clarification Jeff. > > > > I am using Open MPI 4.0.1 > > > > Once fully setup, I intend to run my application in conjunction with > grid, so the resources will be allocated by grid. This makes it very > difficult to specify IP address for btl_tcp_if_include. > > > > For the named exclude interfaces, it still hanged (with no output) when > I specified btl_base_verbose 100. > > > > I will try using the CIDR for the below hosts as an experiment. > > > > Regards, > > Vipul > > > > > > > > From: Jeff Squyres (jsquyres) [mailto:jsquy...@cisco.com] > > Sent: Tuesday, June 23, 2020 1:36 PM > > To: Open MPI User's List <users@lists.open-mpi.org> > > Cc: Kulshrestha, Vipul <vipul_kulshres...@mentor.com> > > Subject: Re: [OMPI users] Question about virtual interface > > > > https://www.open-mpi.org/faq/?category=tcp#ip-virtual-ip-interfaces is > referring to interfaces like "eth0:0", where the Linux kernel will have the > same index for both "eth0" and "eth0:0". This will cause Open MPI to get > confused (because it identifies Ethernet interfaces by their kernel > indexes). > > > > If you have non-physical Ethernet interfaces (like vibr0, etc.), those > should work just fine with btl_tcp_if_include|exclude. > > > > What version of Open MPI are you using? > > > > You might want to "--mca btl_tcp_if_include CIDR" where CIDR is the > representation of the subnet you want to use. This will allow your app to > work, even if that network is on different Ethernet interfaces on different > hosts. For example: > > > > mpirun --mca btl_tcp_if_include 192.168.10.0/24 ... > > > > If you're still getting a hang, try with btl_base_verbose value of 100. > > > > > > > > > > On Jun 18, 2020, at 7:39 PM, Kulshrestha, Vipul via users < > users@lists.open-mpi.org> wrote: > > > > Hi, > > > > I have read conflicting statements about OMPI support for virtual > interfaces. > > > > The Open MPI FAQ mentions that virtual IP interfaces are not supported > and this will not be solved by using either btl_tcp_if_include or > btl_tcp_if_exclude. ( > https://www.open-mpi.org/faq/?category=tcp#ip-virtual-ip-interfaces) > > > > However, somewhere else, I read that you can exclude the virtual > interfaces by specifying –mca btl_tcp_if_exclude virbr0,lo ( > https://github.com/open-mpi/ompi/issues/6377) > > > > I am trying this out on different machines and find that it (specifying > btl_tcp_if_exclude virbr0,lo) works on one pair of machine but does not > work on another pair of machines. I am hoping to get an explanation on why > does one work and other does not. > > > > I tried to generate some verbose output (on the pair of machine where it > does not work) by specifying –mca btl_base_verbose 30, but it just hangs > and does not generate any messages. > > > > $ mpirun -np 4 --mca btl_base_verbose 30 --mca btl_tcp_if_exclude > virbr0,virbr1,virbr2,virbr3,lo --hostfile host.txt /home/vipulk/mpitest2 100 > > ….. > > …. > > <no output and remains stuck forever> > > > > The ifconfig output for the 2 machines in the host list are listed below. > > > > Thanks, > > Vipul > > > > > > Host1: > > > > eno1: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500 > > > inet 175.148.218.46 netmask 255.255.255.0 broadcast > 175.148.218.255 > > inet6 fe80::9af2:b3ff:fe2a:3e84 prefixlen 64 scopeid > 0x20<link> > > ether 98:f2:b3:2a:3e:84 txqueuelen 1000 (Ethernet) > > > RX packets 5938671220 bytes 6033195902625 (5.4 TiB) > > > RX errors 0 dropped 534674 overruns 0 frame 0 > > > TX packets 3933921252 bytes 3077919856788 (2.7 TiB) > > > TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0 > > > device interrupt 16 > > > > > eno2: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500 > > inet 192.168.1.2 netmask 255.255.255.0 broadcast 192.168.1.255 > > inet6 fe80::be68:2aa2:8b42:d6d prefixlen 64 scopeid 0x20<link> > > ether 98:f2:b3:2a:3e:85 txqueuelen 1000 (Ethernet) > > RX packets 2355308 bytes 279699254 (266.7 MiB) > > RX errors 0 dropped 350 overruns 0 frame 0 > > TX packets 60 bytes 8732 (8.5 KiB) > > TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0 > > device interrupt 17 > > > > eno3: flags=4099<UP,BROADCAST,MULTICAST> mtu 1500 > > ether 98:f2:b3:2a:3e:86 txqueuelen 1000 (Ethernet) > > RX packets 0 bytes 0 (0.0 B) > > RX errors 0 dropped 0 overruns 0 frame 0 > > TX packets 0 bytes 0 (0.0 B) > > TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0 > > device interrupt 16 > > > > eno4: flags=4099<UP,BROADCAST,MULTICAST> mtu 1500 > > ether 98:f2:b3:2a:3e:87 txqueuelen 1000 (Ethernet) > > RX packets 0 bytes 0 (0.0 B) > > RX errors 0 dropped 0 overruns 0 frame 0 > > TX packets 0 bytes 0 (0.0 B) > > TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0 > > device interrupt 17 > > > > lo: flags=73<UP,LOOPBACK,RUNNING> mtu 65536 > > inet 127.0.0.1 netmask 255.0.0.0 > > inet6 ::1 prefixlen 128 scopeid 0x10<host> > > loop txqueuelen 1000 (Local Loopback) > > RX packets 3161146200 bytes 225991248912 (210.4 GiB) > > RX errors 0 dropped 0 overruns 0 frame 0 > > TX packets 3161146200 bytes 225991248912 (210.4 GiB) > > TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0 > > > > virbr2: flags=4099<UP,BROADCAST,MULTICAST> mtu 1500 > > inet 192.168.122.1 netmask 255.255.255.0 broadcast > 192.168.122.255 > > ether 52:54:00:0a:cd:21 txqueuelen 1000 (Ethernet) > > RX packets 0 bytes 0 (0.0 B) > > RX errors 0 dropped 0 overruns 0 frame 0 > > TX packets 0 bytes 0 (0.0 B) > > TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0 > > > > virbr3: flags=4099<UP,BROADCAST,MULTICAST> mtu 1500 > > inet 192.168.123.1 netmask 255.255.255.0 broadcast > 192.168.123.255 > > ether 52:54:00:0a:cd:22 txqueuelen 1000 (Ethernet) > > RX packets 0 bytes 0 (0.0 B) > > RX errors 0 dropped 0 overruns 0 frame 0 > > TX packets 0 bytes 0 (0.0 B) > > TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0 > > > > Host2: > > eth0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500 > > > inet 175.148.218.210 netmask 255.255.255.0 broadcast > 175.148.218.255 > > inet6 fe80::9af2:b3ff:fe2a:3e78 prefixlen 64 scopeid > 0x20<link> > > ether 98:f2:b3:2a:3e:78 txqueuelen 1000 (Ethernet) > > > RX packets 8632800 bytes 3938419917 (3.6 GiB) > > > RX errors 0 dropped 350 overruns 0 frame 0 > > > TX packets 5504444 bytes 1791707074 (1.6 GiB) > > > TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0 > > > device interrupt 16 > > > > > eth1: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500 > > inet 192.168.1.2 netmask 255.255.255.0 broadcast 192.168.1.255 > > inet6 fe80::9af2:b3ff:fe2a:3e79 prefixlen 64 scopeid 0x20<link> > > ether 98:f2:b3:2a:3e:79 txqueuelen 1000 (Ethernet) > > RX packets 2317163 bytes 275220791 (262.4 MiB) > > RX errors 0 dropped 350 overruns 0 frame 0 > > TX packets 336 bytes 26726 (26.0 KiB) > > TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0 > > device interrupt 17 > > > > lo: flags=73<UP,LOOPBACK,RUNNING> mtu 65536 > > inet 127.0.0.1 netmask 255.0.0.0 > > inet6 ::1 prefixlen 128 scopeid 0x10<host> > > loop txqueuelen 1000 (Local Loopback) > > RX packets 32539 bytes 2540603 (2.4 MiB) > > RX errors 0 dropped 0 overruns 0 frame 0 > > TX packets 32539 bytes 2540603 (2.4 MiB) > > TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0 > > > > virbr0: flags=4099<UP,BROADCAST,MULTICAST> mtu 1500 > > inet 192.168.123.1 netmask 255.255.255.0 broadcast > 192.168.123.255 > > ether 52:54:00:0a:cd:22 txqueuelen 1000 (Ethernet) > > > RX packets 0 bytes 0 (0.0 B) > > > RX errors 0 dropped 0 overruns 0 frame 0 > > > TX packets 0 bytes 0 (0.0 B) > > > TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0 > > > > > virbr1: flags=4099<UP,BROADCAST,MULTICAST> mtu 1500 > > inet 192.168.122.1 netmask 255.255.255.0 broadcast > 192.168.122.255 > > ether 52:54:00:0a:cd:21 txqueuelen 1000 (Ethernet) > > > RX packets 0 bytes 0 (0.0 B) > > > RX errors 0 dropped 0 overruns 0 frame 0 > > > TX packets 0 bytes 0 (0.0 B) > > > TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0 > > > > > > > --mca btl_tcp_if_exclude virbr0,lo works on machines with below > configuration: > > > > Host 3: > > eno1: flags=4099<UP,BROADCAST,MULTICAST> mtu 1500 > > > ether 80:30:e0:3b:c8:40 txqueuelen 1000 (Ethernet) > > > RX packets 0 bytes 0 (0.0 B) > > > RX errors 0 dropped 0 overruns 0 frame 0 > > > TX packets 0 bytes 0 (0.0 B) > > > TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0 > > > device interrupt 16 > > > > > eno2: flags=4099<UP,BROADCAST,MULTICAST> mtu 1500 > > ether 80:30:e0:3b:c8:41 txqueuelen 1000 (Ethernet) > > RX packets 0 bytes 0 (0.0 B) > > RX errors 0 dropped 0 overruns 0 frame 0 > > TX packets 0 bytes 0 (0.0 B) > > TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0 > > device interrupt 17 > > > > eno3: flags=4099<UP,BROADCAST,MULTICAST> mtu 1500 > > ether 80:30:e0:3b:c8:42 txqueuelen 1000 (Ethernet) > > RX packets 0 bytes 0 (0.0 B) > > RX errors 0 dropped 0 overruns 0 frame 0 > > TX packets 0 bytes 0 (0.0 B) > > TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0 > > device interrupt 16 > > > > eno4: flags=4099<UP,BROADCAST,MULTICAST> mtu 1500 > > ether 80:30:e0:3b:c8:43 txqueuelen 1000 (Ethernet) > > RX packets 0 bytes 0 (0.0 B) > > RX errors 0 dropped 0 overruns 0 frame 0 > > TX packets 0 bytes 0 (0.0 B) > > TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0 > > device interrupt 17 > > > > eno5: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500 > > inet 65.10.19.30 netmask 255.255.255.192 broadcast 65.10.19.63 > > inet6 fe80::8230:e0ff:fe20:96a8 prefixlen 64 scopeid 0x20<link> > > ether 80:30:e0:20:96:a8 txqueuelen 1000 (Ethernet) > > RX packets 1618138239 bytes 1552281705604 (1.4 TiB) > > RX errors 184 dropped 0 overruns 184 frame 0 > > TX packets 1500861577 bytes 1593767198059 (1.4 TiB) > > TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0 > > device interrupt 34 memory 0xe8000000-e87fffff > > > > eno6: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500 > > ether 80:30:e0:20:96:ac txqueuelen 1000 (Ethernet) > > RX packets 1299786 bytes 150289059 (143.3 MiB) > > RX errors 0 dropped 0 overruns 0 frame 0 > > TX packets 0 bytes 0 (0.0 B) > > TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0 > > device interrupt 77 memory 0xe7000000-e77fffff > > > > lo: flags=73<UP,LOOPBACK,RUNNING> mtu 65536 > > inet 127.0.0.1 netmask 255.0.0.0 > > inet6 ::1 prefixlen 128 scopeid 0x10<host> > > loop txqueuelen 1000 (Local Loopback) > > RX packets 20936389 bytes 2632538104 (2.4 GiB) > > RX errors 0 dropped 0 overruns 0 frame 0 > > TX packets 20936389 bytes 2632538104 (2.4 GiB) > > TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0 > > > > virbr0: flags=4099<UP,BROADCAST,MULTICAST> mtu 1500 > > inet 192.168.122.1 netmask 255.255.255.0 broadcast > 192.168.122.255 > > ether 52:54:00:05:7c:dd txqueuelen 1000 (Ethernet) > > RX packets 0 bytes 0 (0.0 B) > > RX errors 0 dropped 0 overruns 0 frame 0 > > TX packets 0 bytes 0 (0.0 B) > > TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0 > > > > > > HOST 4: > > > > eno1: flags=4099<UP,BROADCAST,MULTICAST> mtu 1500 > > ether 80:30:e0:3b:b8:5c txqueuelen 1000 (Ethernet) > > RX packets 0 bytes 0 (0.0 B) > > RX errors 0 dropped 0 overruns 0 frame 0 > > TX packets 0 bytes 0 (0.0 B) > > TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0 > > device interrupt 16 > > > > eno2: flags=4099<UP,BROADCAST,MULTICAST> mtu 1500 > > ether 80:30:e0:3b:b8:5d txqueuelen 1000 (Ethernet) > > RX packets 0 bytes 0 (0.0 B) > > RX errors 0 dropped 0 overruns 0 frame 0 > > TX packets 0 bytes 0 (0.0 B) > > TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0 > > device interrupt 17 > > > > eno3: flags=4099<UP,BROADCAST,MULTICAST> mtu 1500 > > ether 80:30:e0:3b:b8:5e txqueuelen 1000 (Ethernet) > > RX packets 0 bytes 0 (0.0 B) > > RX errors 0 dropped 0 overruns 0 frame 0 > > TX packets 0 bytes 0 (0.0 B) > > TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0 > > device interrupt 16 > > > > eno4: flags=4099<UP,BROADCAST,MULTICAST> mtu 1500 > > ether 80:30:e0:3b:b8:5f txqueuelen 1000 (Ethernet) > > RX packets 0 bytes 0 (0.0 B) > > RX errors 0 dropped 0 overruns 0 frame 0 > > TX packets 0 bytes 0 (0.0 B) > > TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0 > > device interrupt 17 > > > > eno5: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500 > > inet 65.10.19.29 netmask 255.255.255.192 broadcast 65.10.19.63 > > inet6 fe80::8230:e0ff:fe20:96c0 prefixlen 64 scopeid 0x20<link> > > ether 80:30:e0:20:96:c0 txqueuelen 1000 (Ethernet) > > RX packets 2904054722 bytes 2656941056010 (2.4 TiB) > > RX errors 11 dropped 0 overruns 11 frame 0 > > TX packets 5801141892 bytes 7474409123677 (6.7 TiB) > > TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0 > > device interrupt 34 memory 0xe8000000-e87fffff > > > > eno6: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500 > > ether 80:30:e0:20:96:c4 txqueuelen 1000 (Ethernet) > > RX packets 1299694 bytes 150265217 (143.3 MiB) > > RX errors 0 dropped 0 overruns 0 frame 0 > > TX packets 0 bytes 0 (0.0 B) > > TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0 > > device interrupt 77 memory 0xe7000000-e77fffff > > > > > > lo: flags=73<UP,LOOPBACK,RUNNING> mtu 65536 > > inet 127.0.0.1 netmask 255.0.0.0 > > inet6 ::1 prefixlen 128 scopeid 0x10<host> > > loop txqueuelen 1000 (Local Loopback) > > RX packets 19850956 bytes 5578561316 (5.1 GiB) > > RX errors 0 dropped 0 overruns 0 frame 0 > > TX packets 19850956 bytes 5578561316 (5.1 GiB) > > TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0 > > > > virbr0: flags=4099<UP,BROADCAST,MULTICAST> mtu 1500 > > inet 192.168.122.1 netmask 255.255.255.0 broadcast > 192.168.122.255 > > ether 52:54:00:79:33:89 txqueuelen 1000 (Ethernet) > > RX packets 0 bytes 0 (0.0 B) > > RX errors 0 dropped 0 overruns 0 frame 0 > > TX packets 0 bytes 0 (0.0 B) > > TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0 > > > > > > -- > > Jeff Squyres > > jsquy...@cisco.com > > > -- > Jeff Squyres > jsquy...@cisco.com > >