On Jun 15, 2020, at 3:43 PM, Roberto Herraro via users <users@lists.open-mpi.org> wrote: > > We have a small cluster and are running paired HPL to test performance, but > are getting poor results. One of our suspicions is that the regular 1GbE > interface might be being used, rather than the 100G interface. Is there a > command, log, or something else that can be used to determine which interface > OpenMPI is using on a multi-NIC server?
There's two questions here: 1. How to make sure Open MPI is only using the 100G Ethernet interfaces (and not the 1G Ethernet interface)? 2. How to make sure Open MPI isn't using TCP (which is among the slowest of Ethernet transports for MPI/HPC traffic)? If you only have TCP as a transport option, then Open MPI will default to using all available Ethernet interfaces (including your 1G interface). This means that it will be using the TCP BTL (i.e., plugin) for MPI point-to-point communications. You can therefore use the TCP BTL's include / exclude functionality to specify the interfaces that you want to use. You do this by setting the btl_tcp_if_include or btl_tcp_if_exclude MCA parameters (i.e., run-time parameters passed to Open MPI): # Only use the "eth1" interfaces on all nodes mpirun --mca btl_tcp_if_include eth1 ... # Only use the 10.20.30.0/24 network on all nodes. mpirun --mca btl_tcp_if_include 10.20.30.0/24 ... # Only use the 10.20.30.0/24 and 10.40.50.0/24 networks on all nodes. mpirun --mca btl_tcp_if_include 10.20.30.0/24,10.40.50.0/24 ... You can use btl_tcp_if_exclude, too (the "include" and "exclude" options are mutually exclusive). Both options can take a comma-delimited list. You can also use two other MCA parameters to show Open MPI's process of selecting which PML and BTLs will be used at run time: mpirun --mca pml_base_verbose 100 --mca btl_base_verbose 100 ... PML = Open MPI's point-to-point messaging layer. It's the back-end behind MPI_Send, MPI_Recv, etc. BTL = One possible set of underlying transports for MPI_Send / MPI_Recv / etc. BTLs are generally used when the "ob1" PML is used (they're sometimes used in other cases, but for the purposes of this conversation, let's just focus on "ob1" and MPI_Send / MPI_Recv / etc.). That being said, as mentioned above, TCP is among the slowest of Ethernet transports. If you have an HPC-class NIC for Ethernet (e.g., one that support Cisco usNIC, AWS EFA, RoCE v2, iWARP, etc.), you can use a different communication stack across that Ethernet NIC for better performance than the standard POSIX sockets API will provide. In those cases, you need to make sure that Open MPI was built with the transport stack that supports your HPC-class Ethernet NIC (e.g., Libfabric, UCX, etc.). You can use the "ompi_info" command to see what plugins your Open MPI supports; you'll likely want to see "ofi" for Libfabric support or "ucx" for UCX support. A summary is also displayed at the end of when you run "configure" when building Open MPI. If Open MPI was built with an HPC-class networking stack and it finds HPC-class NICs that can use that networking stack at run-time, it generally auto-selects them. However, sometimes there are cases where it might miss such an opportunity, so you can force the use of a specific stack / specific NICs if necessary (or even if you just want to be 10000% sure you're using the right network). The specific MCA parameters that you use will depend on what kind of network stack and HPC-class Ethernet NIC you're using. -- Jeff Squyres jsquy...@cisco.com