Thanks for the very thorough explanation and advice, Jeff!


Using the info you provided I can confirm we are using RoCE and OpenMPI is
compiled with UCX support. To confirm we are running RoCE over the 100G
Ethernet interface, we attempted to launch HPL by specifying the UCX device
explicitly via "--mca pml ucx -x UCX_NET_DEVICES=mlx5_0:1", but this
results in the same poor performance.



We also launched using the "--mca pml_base_verbose 100" and "--mca
btl_base_verbose 100" options. HPL starts to launch and the output does
contain the extra information as expected, but it seems to get stuck
initializing and never actually starts. The memory usage for all xhpl
processes stays at 0. Example execution and output:



...

mpi_options="--mca mpi_leave_pinned 1 --bind-to none --report-bindings
--mca pml ucx -x UCX_NET_DEVICES=mlx5_0:1 --mca btl self,vader --mca
pml_base_verbose 100 --map-by ppr:1:l3cache -x OMP_NUM_THREADS=4 -x
OMP_PROC_BIND=TRUE -x OMP_PLACES=cores"
mpirun $mpi_options -app 11.appfile_ccx >> ${outfile} 2>&1

...



Output:



START DATE Wed Jun 17 12:03:35 MST 2020
[n012:71859] MCW rank 32 bound to socket 0[core 0[hwt 0]]:
[B/././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././.][./././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././.]
[n012:71859] mca: base: components_register: registering framework pml
components
[n012:71859] mca: base: components_register: found loaded component ucx
[n012:71859] mca: base: components_register: component ucx register
function successful
[n012:71859] mca: base: components_open: opening pml components
[n012:71859] mca: base: components_open: found loaded component ucx
[n012:71859] mca: base: components_open: component ucx open function
successful
[n012:71859] select: initializing pml component ucx
[n012:71859] select: init returned priority 51
[n012:71859] selected ucx best priority 51
[n012:71859] select: component ucx selected

...

[n011:28937] select: init returned priority 51
[n011:28937] selected ucx best priority 51
[n011:28937] select: component ucx selected
[n011:28931] select: init returned priority 51
[n011:28931] selected ucx best priority 51
[n011:28931] select: component ucx selected

...



 Do you have any idea why adding those verbosity commands would cause the
process to hang? If we just remove those options, everything launches as
expected.



Thanks again!



On Tue, Jun 16, 2020 at 6:18 AM Jeff Squyres (jsquyres) <jsquy...@cisco.com>
wrote:

> On Jun 15, 2020, at 3:43 PM, Roberto Herraro via users <
> users@lists.open-mpi.org> wrote:
> >
> > We have a small cluster and are running paired HPL to test performance,
> but are getting poor results. One of our suspicions is that the regular
> 1GbE interface might be being used, rather than the 100G interface. Is
> there a command, log, or something else that can be used to determine which
> interface OpenMPI is using on a multi-NIC server?
>
>
> There's two questions here:
>
> 1. How to make sure Open MPI is only using the 100G Ethernet interfaces
> (and not the 1G Ethernet interface)?
> 2. How to make sure Open MPI isn't using TCP (which is among the slowest
> of Ethernet transports for MPI/HPC traffic)?
>
> If you only have TCP as a transport option, then Open MPI will default to
> using all available Ethernet interfaces (including your 1G interface).
> This means that it will be using the TCP BTL (i.e., plugin) for MPI
> point-to-point communications.  You can therefore use the TCP BTL's include
> / exclude functionality to specify the interfaces that you want to use.
> You do this by setting the btl_tcp_if_include or btl_tcp_if_exclude MCA
> parameters (i.e., run-time parameters passed to Open MPI):
>
> # Only use the "eth1" interfaces on all nodes
> mpirun --mca btl_tcp_if_include eth1 ...
>
> # Only use the 10.20.30.0/24 network on all nodes.
> mpirun --mca btl_tcp_if_include 10.20.30.0/24 ...
>
> # Only use the 10.20.30.0/24 and 10.40.50.0/24 networks on all nodes.
> mpirun --mca btl_tcp_if_include 10.20.30.0/24,10.40.50.0/24 ...
>
> You can use btl_tcp_if_exclude, too (the "include" and "exclude" options
> are mutually exclusive).  Both options can take a comma-delimited list.
>
> You can also use two other MCA parameters to show Open MPI's process of
> selecting which PML and BTLs will be used at run time:
>
> mpirun --mca pml_base_verbose 100 --mca btl_base_verbose 100 ...
>
> PML = Open MPI's point-to-point messaging layer.  It's the back-end behind
> MPI_Send, MPI_Recv, etc.
> BTL = One possible set of underlying transports for MPI_Send / MPI_Recv /
> etc.  BTLs are generally used when the "ob1" PML is used (they're sometimes
> used in other cases, but for the purposes of this conversation, let's just
> focus on "ob1" and MPI_Send / MPI_Recv / etc.).
>
> That being said, as mentioned above, TCP is among the slowest of Ethernet
> transports.  If you have an HPC-class NIC for Ethernet (e.g., one that
> support Cisco usNIC, AWS EFA, RoCE v2, iWARP, etc.), you can use a
> different communication stack across that Ethernet NIC for better
> performance than the standard POSIX sockets API will provide.
>
> In those cases, you need to make sure that Open MPI was built with the
> transport stack that supports your HPC-class Ethernet NIC (e.g., Libfabric,
> UCX, etc.).  You can use the "ompi_info" command to see what plugins your
> Open MPI supports; you'll likely want to see "ofi" for Libfabric support or
> "ucx" for UCX support.  A summary is also displayed at the end of when you
> run "configure" when building Open MPI.
>
> If Open MPI was built with an HPC-class networking stack and it finds
> HPC-class NICs that can use that networking stack at run-time, it generally
> auto-selects them.  However, sometimes there are cases where it might miss
> such an opportunity, so you can force the use of a specific stack /
> specific NICs if necessary (or even if you just want to be 10000% sure
> you're using the right network).
>
> The specific MCA parameters that you use will depend on what kind of
> network stack and HPC-class Ethernet NIC you're using.
>
> --
> Jeff Squyres
> jsquy...@cisco.com
>
>

Reply via email to