John -

Open MPI's OFI implementation does not stripe messages across processes.  
Instead, an Open MPI process will choose the "closest" NIC on the system (based 
on PCI hops and PCI topology, using hwloc).  If there is ore than one "closest" 
NIC, as is the case on P4, where each Intel socket has two PCI switches, each 
with 2 GPUs and an EFA NIC behind them, then the processes will round-robin 
between the N closest NICs.  This isn't perfect, and the algorithm can get the 
wrong answer for some situations, but on P4 it should generally always get the 
right answer.  The reason for this implementation is that Open MPI uses OFI's 
tagged matching interface and message striping across multiple tagged-matching 
interfaces is rather overly complicated.  An OFI provider could choose to 
stripe messages across devices internally, of course, but we believe that, 
given the topologies involved and the limited cross-PCI switch bandwidths 
available on platforms like P4, that round robin assignment is more beneficial 
to application performance.

This is why you see differences between osu_bw and osu_bw_mr - one is showing 
you single process pair data (meaning you are maxing out the single NIC that 
was chosen) and the other involves multiple processes per instances and is 
therefore using all the NICs.

I think Ralph covered the process mapping / hostfile discussion, so I have 
nothing to add there, other than to point out that all the process / NIC 
mapping happens independently of the hostfile.  The big potential gotcha is 
that the NIC selection algorithm will only give reliable results if you pin 
processes to the socket or smaller.  Pinning to the socket is the default 
behavior for Open MPI, so that should not be a problem.  But if you start 
changing the process pinning behavior, please do remember that you can have 
impact on how Open MPI does NIC selection.  With the multiple PCI switches and 
some of the behaviors of the Intel root complex, you really don't want to be 
driving traffic from one socket to a EFA device attached to another socket on 
the P4 platform.

Hope this helps,

Brian

On 6/11/21, 6:43 AM, "users on behalf of Ralph Castain via users" 
<users-boun...@lists.open-mpi.org on behalf of users@lists.open-mpi.org> wrote:

    CAUTION: This email originated from outside of the organization. Do not 
click links or open attachments unless you can confirm the sender and know the 
content is safe.



    You can still use "map-by" to get what you want since you know there are 
four interfaces per node - just do "--map-by ppr:8:node". Note that you 
definitely do NOT want to list those multiple IP addresses in your hostfile - 
all you are doing is causing extra work for mpirun as it has to DNS resolve 
those addresses back down to their common host. We then totally ignore the fact 
that you specified those addresses, so it is accomplishing nothing (other than 
creating extra work).

    You'll need to talk to AWS about how to drive striping across the 
interfaces. It sounds like they are automatically doing it, but perhaps not 
according to the algorithm you are seeking (i.e., they may not make such a 
linear assignment as you describe).


    > On Jun 8, 2021, at 1:23 PM, John Moore via users 
<users@lists.open-mpi.org> wrote:
    >
    > Hello,
    >
    > I am trying to run OpenMPI on AWSs new p4d instances. These instances 
have 4x 100Gb/s network interfaces, each with their own ipv4 address.
    >
    > I am primarily testing the bandwidth with the osu_micro_benchmarks test 
suite. Specifically I am running the osu_bibw and osu_mbw_mr tests to calculate 
the peak aggregate bandwidth I can achieve between two instances.
    >
    > I have found that running the osu_biwb test can only obtain the achieved 
throughput of one network interface (100 Gb/s).  This is the command I am using:
    > /opt/amazon/openmpi/bin/mpirun -v -x FI_EFA_USE_DEVICE_RDMA=1 -x 
FI_PROVIDER="efa" -np 2 -host host1,host2 --map-by node --mca btl_baes_verbose 
30 --mca btl tcp,self --mca btl_tcp_if_exclude lo,do\cker0  ./osu_bw -m 40000000
    >
    > As far as I understand it, openmpi should be detecting the four 
interfaces and striping data across them, correct?
    >
    > I have found that the osu_mbw_mr test can achieve 4x the bandwidth of a 
single network interface, if the configuration is correct. For example, I am 
using the following command:
    > /opt/amazon/openmpi/bin/mpirun -v -x FI_EFA_USE_DEVICE_RDMA=1 -x 
FI_PROVIDER="efa" -np 8 -hostfile hostfile5 --map-by node --mca 
btl_baes_verbose 30 --mca btl tcp,self --mca btl_tcp_if_exclude lo,d\ocker0  
./osu_mbw_mr
    > This will run four pairs of send/recv calls across the different nodes. 
hostfile5 contains all 8 local ipv4 addresses associated with the four nodes. I 
believe this is why I am getting the expected performance.
    >
    > So, now I want to runa real use case, but I can't use --map-by node. I 
want to run two ranks per ipv4 address (interface) with the ranks ordered 
sequentially according to the hostfile (the first 8 ranks will belong to the 
first host, but the ranks will be divided among four ipv4 addresses to utilize 
the full network bandwidth). But OpenMPI won't allow me to assign slots=2 to 
each ipv4 address because they all belong to the same host.
    >
    > Any recommendation would be greatly appreciated.
    >
    > Thanks,
    > John



Reply via email to