John - Open MPI's OFI implementation does not stripe messages across processes. Instead, an Open MPI process will choose the "closest" NIC on the system (based on PCI hops and PCI topology, using hwloc). If there is ore than one "closest" NIC, as is the case on P4, where each Intel socket has two PCI switches, each with 2 GPUs and an EFA NIC behind them, then the processes will round-robin between the N closest NICs. This isn't perfect, and the algorithm can get the wrong answer for some situations, but on P4 it should generally always get the right answer. The reason for this implementation is that Open MPI uses OFI's tagged matching interface and message striping across multiple tagged-matching interfaces is rather overly complicated. An OFI provider could choose to stripe messages across devices internally, of course, but we believe that, given the topologies involved and the limited cross-PCI switch bandwidths available on platforms like P4, that round robin assignment is more beneficial to application performance.
This is why you see differences between osu_bw and osu_bw_mr - one is showing you single process pair data (meaning you are maxing out the single NIC that was chosen) and the other involves multiple processes per instances and is therefore using all the NICs. I think Ralph covered the process mapping / hostfile discussion, so I have nothing to add there, other than to point out that all the process / NIC mapping happens independently of the hostfile. The big potential gotcha is that the NIC selection algorithm will only give reliable results if you pin processes to the socket or smaller. Pinning to the socket is the default behavior for Open MPI, so that should not be a problem. But if you start changing the process pinning behavior, please do remember that you can have impact on how Open MPI does NIC selection. With the multiple PCI switches and some of the behaviors of the Intel root complex, you really don't want to be driving traffic from one socket to a EFA device attached to another socket on the P4 platform. Hope this helps, Brian On 6/11/21, 6:43 AM, "users on behalf of Ralph Castain via users" <users-boun...@lists.open-mpi.org on behalf of users@lists.open-mpi.org> wrote: CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you can confirm the sender and know the content is safe. You can still use "map-by" to get what you want since you know there are four interfaces per node - just do "--map-by ppr:8:node". Note that you definitely do NOT want to list those multiple IP addresses in your hostfile - all you are doing is causing extra work for mpirun as it has to DNS resolve those addresses back down to their common host. We then totally ignore the fact that you specified those addresses, so it is accomplishing nothing (other than creating extra work). You'll need to talk to AWS about how to drive striping across the interfaces. It sounds like they are automatically doing it, but perhaps not according to the algorithm you are seeking (i.e., they may not make such a linear assignment as you describe). > On Jun 8, 2021, at 1:23 PM, John Moore via users <users@lists.open-mpi.org> wrote: > > Hello, > > I am trying to run OpenMPI on AWSs new p4d instances. These instances have 4x 100Gb/s network interfaces, each with their own ipv4 address. > > I am primarily testing the bandwidth with the osu_micro_benchmarks test suite. Specifically I am running the osu_bibw and osu_mbw_mr tests to calculate the peak aggregate bandwidth I can achieve between two instances. > > I have found that running the osu_biwb test can only obtain the achieved throughput of one network interface (100 Gb/s). This is the command I am using: > /opt/amazon/openmpi/bin/mpirun -v -x FI_EFA_USE_DEVICE_RDMA=1 -x FI_PROVIDER="efa" -np 2 -host host1,host2 --map-by node --mca btl_baes_verbose 30 --mca btl tcp,self --mca btl_tcp_if_exclude lo,do\cker0 ./osu_bw -m 40000000 > > As far as I understand it, openmpi should be detecting the four interfaces and striping data across them, correct? > > I have found that the osu_mbw_mr test can achieve 4x the bandwidth of a single network interface, if the configuration is correct. For example, I am using the following command: > /opt/amazon/openmpi/bin/mpirun -v -x FI_EFA_USE_DEVICE_RDMA=1 -x FI_PROVIDER="efa" -np 8 -hostfile hostfile5 --map-by node --mca btl_baes_verbose 30 --mca btl tcp,self --mca btl_tcp_if_exclude lo,d\ocker0 ./osu_mbw_mr > This will run four pairs of send/recv calls across the different nodes. hostfile5 contains all 8 local ipv4 addresses associated with the four nodes. I believe this is why I am getting the expected performance. > > So, now I want to runa real use case, but I can't use --map-by node. I want to run two ranks per ipv4 address (interface) with the ranks ordered sequentially according to the hostfile (the first 8 ranks will belong to the first host, but the ranks will be divided among four ipv4 addresses to utilize the full network bandwidth). But OpenMPI won't allow me to assign slots=2 to each ipv4 address because they all belong to the same host. > > Any recommendation would be greatly appreciated. > > Thanks, > John