Hello Vipul.  could you tell us more about the grid you will be using?
Is this the Gridengine scheduler on a local HPC cluster? Or is it running
on Kubernetes maybe?

You say that it is difficult to specify the IP address for
btl_tcp_if_include
I agree with you! But it is quit common to have to write some lines in a
batch submission script which parse the HOSTFILE
Once you get those lines right you can reuse them in other scripts.
If you tell us a  little btit more about your grid we maybe can help.




On Tue, 23 Jun 2020 at 19:59, Jeff Squyres (jsquyres) via users <
users@lists.open-mpi.org> wrote:

> You might want to make sure that the run time is working properly before
> going too much further.  E.g., try mpirun'ing hostname (i.e., the Linux
> command -- a non-MPI program) and make sure that that works.
>
> If that works, then try mpirun'ing the "hello world" example program that
> comes in the examples/ directory in the Open MPI tarball.  That program
> just initializes and finalizes MPI; it does no actual MPI communication.
>
> If that works, then try mpirun'ing the "ring" example program in the same
> examples/ directory.  That does very simple MPI communication.
>
>
>
> > On Jun 23, 2020, at 2:39 PM, Kulshrestha, Vipul <
> vipul_kulshres...@mentor.com> wrote:
> >
> > Thanks for the clarification Jeff.
> >
> > I am using Open MPI 4.0.1
> >
> > Once fully setup, I intend to run my application in conjunction with
> grid, so the resources will be allocated by grid. This makes it very
> difficult to specify IP address for btl_tcp_if_include.
> >
> > For the named exclude interfaces, it still hanged (with no output) when
> I specified btl_base_verbose 100.
> >
> > I will try using the CIDR for the below hosts as an experiment.
> >
> > Regards,
> > Vipul
> >
> >
> >
> > From: Jeff Squyres (jsquyres) [mailto:jsquy...@cisco.com]
> > Sent: Tuesday, June 23, 2020 1:36 PM
> > To: Open MPI User's List <users@lists.open-mpi.org>
> > Cc: Kulshrestha, Vipul <vipul_kulshres...@mentor.com>
> > Subject: Re: [OMPI users] Question about virtual interface
> >
> > https://www.open-mpi.org/faq/?category=tcp#ip-virtual-ip-interfaces is
> referring to interfaces like "eth0:0", where the Linux kernel will have the
> same index for both "eth0" and "eth0:0".  This will cause Open MPI to get
> confused (because it identifies Ethernet interfaces by their kernel
> indexes).
> >
> > If you have non-physical Ethernet interfaces (like vibr0, etc.), those
> should work just fine with btl_tcp_if_include|exclude.
> >
> > What version of Open MPI are you using?
> >
> > You might want to "--mca btl_tcp_if_include CIDR" where CIDR is the
> representation of the subnet you want to use.  This will allow your app to
> work, even if that network is on different Ethernet interfaces on different
> hosts.  For example:
> >
> >     mpirun --mca btl_tcp_if_include 192.168.10.0/24 ...
> >
> > If you're still getting a hang, try with btl_base_verbose value of 100.
> >
> >
> >
> >
> > On Jun 18, 2020, at 7:39 PM, Kulshrestha, Vipul via users <
> users@lists.open-mpi.org> wrote:
> >
> > Hi,
> >
> > I have read conflicting statements about OMPI support for virtual
> interfaces.
> >
> > The Open MPI FAQ mentions that virtual IP interfaces are not supported
> and this will not be solved by using either btl_tcp_if_include or
> btl_tcp_if_exclude.  (
> https://www.open-mpi.org/faq/?category=tcp#ip-virtual-ip-interfaces)
> >
> > However, somewhere else, I read that you can exclude the virtual
> interfaces by specifying –mca btl_tcp_if_exclude virbr0,lo (
> https://github.com/open-mpi/ompi/issues/6377)
> >
> > I am trying this out on different machines and find that it (specifying
> btl_tcp_if_exclude virbr0,lo) works on one pair of machine but does not
> work on another pair of machines. I am hoping to get an explanation on why
> does one work and other does not.
> >
> > I tried to generate some verbose output (on the pair of machine where it
> does not work) by specifying –mca btl_base_verbose 30, but it just hangs
> and does not generate any messages.
> >
> > $ mpirun -np 4 --mca btl_base_verbose 30 --mca btl_tcp_if_exclude
> virbr0,virbr1,virbr2,virbr3,lo --hostfile host.txt /home/vipulk/mpitest2 100
> > …..
> > ….
> > <no output and remains stuck forever>
> >
> > The ifconfig output for the 2 machines in the host list are listed below.
> >
> > Thanks,
> > Vipul
> >
> >
> > Host1:
> >
> > eno1: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
>
> >         inet 175.148.218.46  netmask 255.255.255.0  broadcast
> 175.148.218.255
> >         inet6 fe80::9af2:b3ff:fe2a:3e84  prefixlen 64  scopeid
> 0x20<link>
> >         ether 98:f2:b3:2a:3e:84  txqueuelen 1000  (Ethernet)
>
> >         RX packets 5938671220  bytes 6033195902625 (5.4 TiB)
>
> >         RX errors 0  dropped 534674  overruns 0  frame 0
>
> >         TX packets 3933921252  bytes 3077919856788 (2.7 TiB)
>
> >         TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0
>
> >         device interrupt 16
>
> >
> > eno2: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
> >         inet 192.168.1.2  netmask 255.255.255.0  broadcast 192.168.1.255
> >         inet6 fe80::be68:2aa2:8b42:d6d  prefixlen 64  scopeid 0x20<link>
> >         ether 98:f2:b3:2a:3e:85  txqueuelen 1000  (Ethernet)
> >         RX packets 2355308  bytes 279699254 (266.7 MiB)
> >         RX errors 0  dropped 350  overruns 0  frame 0
> >         TX packets 60  bytes 8732 (8.5 KiB)
> >         TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0
> >         device interrupt 17
> >
> > eno3: flags=4099<UP,BROADCAST,MULTICAST>  mtu 1500
> >         ether 98:f2:b3:2a:3e:86  txqueuelen 1000  (Ethernet)
> >         RX packets 0  bytes 0 (0.0 B)
> >         RX errors 0  dropped 0  overruns 0  frame 0
> >         TX packets 0  bytes 0 (0.0 B)
> >         TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0
> >         device interrupt 16
> >
> > eno4: flags=4099<UP,BROADCAST,MULTICAST>  mtu 1500
> >         ether 98:f2:b3:2a:3e:87  txqueuelen 1000  (Ethernet)
> >         RX packets 0  bytes 0 (0.0 B)
> >         RX errors 0  dropped 0  overruns 0  frame 0
> >         TX packets 0  bytes 0 (0.0 B)
> >         TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0
> >         device interrupt 17
> >
> > lo: flags=73<UP,LOOPBACK,RUNNING>  mtu 65536
> >         inet 127.0.0.1  netmask 255.0.0.0
> >         inet6 ::1  prefixlen 128  scopeid 0x10<host>
> >         loop  txqueuelen 1000  (Local Loopback)
> >         RX packets 3161146200  bytes 225991248912 (210.4 GiB)
> >         RX errors 0  dropped 0  overruns 0  frame 0
> >         TX packets 3161146200  bytes 225991248912 (210.4 GiB)
> >         TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0
> >
> > virbr2: flags=4099<UP,BROADCAST,MULTICAST>  mtu 1500
> >         inet 192.168.122.1  netmask 255.255.255.0  broadcast
> 192.168.122.255
> >         ether 52:54:00:0a:cd:21  txqueuelen 1000  (Ethernet)
> >         RX packets 0  bytes 0 (0.0 B)
> >         RX errors 0  dropped 0  overruns 0  frame 0
> >         TX packets 0  bytes 0 (0.0 B)
> >         TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0
> >
> > virbr3: flags=4099<UP,BROADCAST,MULTICAST>  mtu 1500
> >         inet 192.168.123.1  netmask 255.255.255.0  broadcast
> 192.168.123.255
> >         ether 52:54:00:0a:cd:22  txqueuelen 1000  (Ethernet)
> >         RX packets 0  bytes 0 (0.0 B)
> >         RX errors 0  dropped 0  overruns 0  frame 0
> >         TX packets 0  bytes 0 (0.0 B)
> >         TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0
> >
> > Host2:
> > eth0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
>
> >         inet 175.148.218.210  netmask 255.255.255.0  broadcast
> 175.148.218.255
> >         inet6 fe80::9af2:b3ff:fe2a:3e78  prefixlen 64  scopeid
> 0x20<link>
> >         ether 98:f2:b3:2a:3e:78  txqueuelen 1000  (Ethernet)
>
> >         RX packets 8632800  bytes 3938419917 (3.6 GiB)
>
> >         RX errors 0  dropped 350  overruns 0  frame 0
>
> >         TX packets 5504444  bytes 1791707074 (1.6 GiB)
>
> >         TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0
>
> >         device interrupt 16
>
> >
> > eth1: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
> >         inet 192.168.1.2  netmask 255.255.255.0  broadcast 192.168.1.255
> >         inet6 fe80::9af2:b3ff:fe2a:3e79  prefixlen 64  scopeid 0x20<link>
> >         ether 98:f2:b3:2a:3e:79  txqueuelen 1000  (Ethernet)
> >         RX packets 2317163  bytes 275220791 (262.4 MiB)
> >         RX errors 0  dropped 350  overruns 0  frame 0
> >         TX packets 336  bytes 26726 (26.0 KiB)
> >         TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0
> >         device interrupt 17
> >
> > lo: flags=73<UP,LOOPBACK,RUNNING>  mtu 65536
> >         inet 127.0.0.1  netmask 255.0.0.0
> >         inet6 ::1  prefixlen 128  scopeid 0x10<host>
> >         loop  txqueuelen 1000  (Local Loopback)
> >         RX packets 32539  bytes 2540603 (2.4 MiB)
> >         RX errors 0  dropped 0  overruns 0  frame 0
> >         TX packets 32539  bytes 2540603 (2.4 MiB)
> >         TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0
> >
> > virbr0: flags=4099<UP,BROADCAST,MULTICAST>  mtu 1500
> >         inet 192.168.123.1  netmask 255.255.255.0  broadcast
> 192.168.123.255
> >         ether 52:54:00:0a:cd:22  txqueuelen 1000  (Ethernet)
>
> >         RX packets 0  bytes 0 (0.0 B)
>
> >         RX errors 0  dropped 0  overruns 0  frame 0
>
> >         TX packets 0  bytes 0 (0.0 B)
>
> >         TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0
>
> >
> > virbr1: flags=4099<UP,BROADCAST,MULTICAST>  mtu 1500
> >         inet 192.168.122.1  netmask 255.255.255.0  broadcast
> 192.168.122.255
> >         ether 52:54:00:0a:cd:21  txqueuelen 1000  (Ethernet)
>
> >         RX packets 0  bytes 0 (0.0 B)
>
> >         RX errors 0  dropped 0  overruns 0  frame 0
>
> >         TX packets 0  bytes 0 (0.0 B)
>
> >         TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0
>
> >
> >
> > --mca btl_tcp_if_exclude virbr0,lo works on machines with below
> configuration:
> >
> > Host 3:
> > eno1: flags=4099<UP,BROADCAST,MULTICAST>  mtu 1500
>
> >         ether 80:30:e0:3b:c8:40  txqueuelen 1000  (Ethernet)
>
> >         RX packets 0  bytes 0 (0.0 B)
>
> >         RX errors 0  dropped 0  overruns 0  frame 0
>
> >         TX packets 0  bytes 0 (0.0 B)
>
> >         TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0
>
> >         device interrupt 16
>
> >
> > eno2: flags=4099<UP,BROADCAST,MULTICAST>  mtu 1500
> >         ether 80:30:e0:3b:c8:41  txqueuelen 1000  (Ethernet)
> >         RX packets 0  bytes 0 (0.0 B)
> >         RX errors 0  dropped 0  overruns 0  frame 0
> >         TX packets 0  bytes 0 (0.0 B)
> >         TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0
> >         device interrupt 17
> >
> > eno3: flags=4099<UP,BROADCAST,MULTICAST>  mtu 1500
> >         ether 80:30:e0:3b:c8:42  txqueuelen 1000  (Ethernet)
> >         RX packets 0  bytes 0 (0.0 B)
> >         RX errors 0  dropped 0  overruns 0  frame 0
> >         TX packets 0  bytes 0 (0.0 B)
> >         TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0
> >         device interrupt 16
> >
> > eno4: flags=4099<UP,BROADCAST,MULTICAST>  mtu 1500
> >         ether 80:30:e0:3b:c8:43  txqueuelen 1000  (Ethernet)
> >         RX packets 0  bytes 0 (0.0 B)
> >         RX errors 0  dropped 0  overruns 0  frame 0
> >         TX packets 0  bytes 0 (0.0 B)
> >         TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0
> >         device interrupt 17
> >
> > eno5: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
> >         inet 65.10.19.30  netmask 255.255.255.192  broadcast 65.10.19.63
> >         inet6 fe80::8230:e0ff:fe20:96a8  prefixlen 64  scopeid 0x20<link>
> >         ether 80:30:e0:20:96:a8  txqueuelen 1000  (Ethernet)
> >         RX packets 1618138239  bytes 1552281705604 (1.4 TiB)
> >         RX errors 184  dropped 0  overruns 184  frame 0
> >         TX packets 1500861577  bytes 1593767198059 (1.4 TiB)
> >         TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0
> >         device interrupt 34  memory 0xe8000000-e87fffff
> >
> > eno6: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
> >         ether 80:30:e0:20:96:ac  txqueuelen 1000  (Ethernet)
> >         RX packets 1299786  bytes 150289059 (143.3 MiB)
> >         RX errors 0  dropped 0  overruns 0  frame 0
> >         TX packets 0  bytes 0 (0.0 B)
> >         TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0
> >         device interrupt 77  memory 0xe7000000-e77fffff
> >
> > lo: flags=73<UP,LOOPBACK,RUNNING>  mtu 65536
> >         inet 127.0.0.1  netmask 255.0.0.0
> >         inet6 ::1  prefixlen 128  scopeid 0x10<host>
> >         loop  txqueuelen 1000  (Local Loopback)
> >         RX packets 20936389  bytes 2632538104 (2.4 GiB)
> >         RX errors 0  dropped 0  overruns 0  frame 0
> >         TX packets 20936389  bytes 2632538104 (2.4 GiB)
> >         TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0
> >
> > virbr0: flags=4099<UP,BROADCAST,MULTICAST>  mtu 1500
> >         inet 192.168.122.1  netmask 255.255.255.0  broadcast
> 192.168.122.255
> >         ether 52:54:00:05:7c:dd  txqueuelen 1000  (Ethernet)
> >         RX packets 0  bytes 0 (0.0 B)
> >         RX errors 0  dropped 0  overruns 0  frame 0
> >         TX packets 0  bytes 0 (0.0 B)
> >         TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0
> >
> >
> > HOST 4:
> >
> > eno1: flags=4099<UP,BROADCAST,MULTICAST>  mtu 1500
> >         ether 80:30:e0:3b:b8:5c  txqueuelen 1000  (Ethernet)
> >         RX packets 0  bytes 0 (0.0 B)
> >         RX errors 0  dropped 0  overruns 0  frame 0
> >         TX packets 0  bytes 0 (0.0 B)
> >         TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0
> >         device interrupt 16
> >
> > eno2: flags=4099<UP,BROADCAST,MULTICAST>  mtu 1500
> >         ether 80:30:e0:3b:b8:5d  txqueuelen 1000  (Ethernet)
> >         RX packets 0  bytes 0 (0.0 B)
> >         RX errors 0  dropped 0  overruns 0  frame 0
> >         TX packets 0  bytes 0 (0.0 B)
> >         TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0
> >         device interrupt 17
> >
> > eno3: flags=4099<UP,BROADCAST,MULTICAST>  mtu 1500
> >         ether 80:30:e0:3b:b8:5e  txqueuelen 1000  (Ethernet)
> >         RX packets 0  bytes 0 (0.0 B)
> >         RX errors 0  dropped 0  overruns 0  frame 0
> >         TX packets 0  bytes 0 (0.0 B)
> >         TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0
> >         device interrupt 16
> >
> > eno4: flags=4099<UP,BROADCAST,MULTICAST>  mtu 1500
> >         ether 80:30:e0:3b:b8:5f  txqueuelen 1000  (Ethernet)
> >         RX packets 0  bytes 0 (0.0 B)
> >         RX errors 0  dropped 0  overruns 0  frame 0
> >         TX packets 0  bytes 0 (0.0 B)
> >         TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0
> >         device interrupt 17
> >
> > eno5: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
> >         inet 65.10.19.29  netmask 255.255.255.192  broadcast 65.10.19.63
> >         inet6 fe80::8230:e0ff:fe20:96c0  prefixlen 64  scopeid 0x20<link>
> >         ether 80:30:e0:20:96:c0  txqueuelen 1000  (Ethernet)
> >         RX packets 2904054722  bytes 2656941056010 (2.4 TiB)
> >         RX errors 11  dropped 0  overruns 11  frame 0
> >         TX packets 5801141892  bytes 7474409123677 (6.7 TiB)
> >         TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0
> >         device interrupt 34  memory 0xe8000000-e87fffff
> >
> > eno6: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
> >         ether 80:30:e0:20:96:c4  txqueuelen 1000  (Ethernet)
> >         RX packets 1299694  bytes 150265217 (143.3 MiB)
> >         RX errors 0  dropped 0  overruns 0  frame 0
> >         TX packets 0  bytes 0 (0.0 B)
> >         TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0
> >         device interrupt 77  memory 0xe7000000-e77fffff
> >
> >
> > lo: flags=73<UP,LOOPBACK,RUNNING>  mtu 65536
> >         inet 127.0.0.1  netmask 255.0.0.0
> >         inet6 ::1  prefixlen 128  scopeid 0x10<host>
> >         loop  txqueuelen 1000  (Local Loopback)
> >         RX packets 19850956  bytes 5578561316 (5.1 GiB)
> >         RX errors 0  dropped 0  overruns 0  frame 0
> >         TX packets 19850956  bytes 5578561316 (5.1 GiB)
> >         TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0
> >
> > virbr0: flags=4099<UP,BROADCAST,MULTICAST>  mtu 1500
> >         inet 192.168.122.1  netmask 255.255.255.0  broadcast
> 192.168.122.255
> >         ether 52:54:00:79:33:89  txqueuelen 1000  (Ethernet)
> >         RX packets 0  bytes 0 (0.0 B)
> >         RX errors 0  dropped 0  overruns 0  frame 0
> >         TX packets 0  bytes 0 (0.0 B)
> >         TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0
> >
> >
> > --
> > Jeff Squyres
> > jsquy...@cisco.com
>
>
> --
> Jeff Squyres
> jsquy...@cisco.com
>
>

Reply via email to