On Jun 15, 2020, at 3:43 PM, Roberto Herraro via users 
<users@lists.open-mpi.org> wrote:
> 
> We have a small cluster and are running paired HPL to test performance, but 
> are getting poor results. One of our suspicions is that the regular 1GbE 
> interface might be being used, rather than the 100G interface. Is there a 
> command, log, or something else that can be used to determine which interface 
> OpenMPI is using on a multi-NIC server?


There's two questions here:

1. How to make sure Open MPI is only using the 100G Ethernet interfaces (and 
not the 1G Ethernet interface)?
2. How to make sure Open MPI isn't using TCP (which is among the slowest of 
Ethernet transports for MPI/HPC traffic)?

If you only have TCP as a transport option, then Open MPI will default to using 
all available Ethernet interfaces (including your 1G interface).  This means 
that it will be using the TCP BTL (i.e., plugin) for MPI point-to-point 
communications.  You can therefore use the TCP BTL's include / exclude 
functionality to specify the interfaces that you want to use.  You do this by 
setting the btl_tcp_if_include or btl_tcp_if_exclude MCA parameters (i.e., 
run-time parameters passed to Open MPI):

# Only use the "eth1" interfaces on all nodes
mpirun --mca btl_tcp_if_include eth1 ...

# Only use the 10.20.30.0/24 network on all nodes.
mpirun --mca btl_tcp_if_include 10.20.30.0/24 ...

# Only use the 10.20.30.0/24 and 10.40.50.0/24 networks on all nodes.
mpirun --mca btl_tcp_if_include 10.20.30.0/24,10.40.50.0/24 ...

You can use btl_tcp_if_exclude, too (the "include" and "exclude" options are 
mutually exclusive).  Both options can take a comma-delimited list.

You can also use two other MCA parameters to show Open MPI's process of 
selecting which PML and BTLs will be used at run time:

mpirun --mca pml_base_verbose 100 --mca btl_base_verbose 100 ...

PML = Open MPI's point-to-point messaging layer.  It's the back-end behind 
MPI_Send, MPI_Recv, etc.
BTL = One possible set of underlying transports for MPI_Send / MPI_Recv / etc.  
BTLs are generally used when the "ob1" PML is used (they're sometimes used in 
other cases, but for the purposes of this conversation, let's just focus on 
"ob1" and MPI_Send / MPI_Recv / etc.).

That being said, as mentioned above, TCP is among the slowest of Ethernet 
transports.  If you have an HPC-class NIC for Ethernet (e.g., one that support 
Cisco usNIC, AWS EFA, RoCE v2, iWARP, etc.), you can use a different 
communication stack across that Ethernet NIC for better performance than the 
standard POSIX sockets API will provide.

In those cases, you need to make sure that Open MPI was built with the 
transport stack that supports your HPC-class Ethernet NIC (e.g., Libfabric, 
UCX, etc.).  You can use the "ompi_info" command to see what plugins your Open 
MPI supports; you'll likely want to see "ofi" for Libfabric support or "ucx" 
for UCX support.  A summary is also displayed at the end of when you run 
"configure" when building Open MPI.

If Open MPI was built with an HPC-class networking stack and it finds HPC-class 
NICs that can use that networking stack at run-time, it generally auto-selects 
them.  However, sometimes there are cases where it might miss such an 
opportunity, so you can force the use of a specific stack / specific NICs if 
necessary (or even if you just want to be 10000% sure you're using the right 
network).

The specific MCA parameters that you use will depend on what kind of network 
stack and HPC-class Ethernet NIC you're using.

-- 
Jeff Squyres
jsquy...@cisco.com

Reply via email to