Hi Folks, I'm also have problems reproducing this on one of our OPA clusters:
libpsm2-11.2.78-1.el7.x86_64 libpsm2-devel-11.2.78-1.el7.x86_64 cluster runs RHEL 7.8 hca_id: hfi1_0 transport: InfiniBand (0) fw_ver: 1.27.0 node_guid: 0011:7501:0179:e2d7 sys_image_guid: 0011:7501:0179:e2d7 vendor_id: 0x1175 vendor_part_id: 9456 hw_ver: 0x11 board_id: Intel Omni-Path Host Fabric Interface Adapter 100 Series phys_port_cnt: 1 port: 1 state: PORT_ACTIVE (4) max_mtu: 4096 (5) active_mtu: 4096 (5) sm_lid: 1 port_lid: 99 port_lmc: 0x00 link_layer: InfiniBand using gcc/gfortran 9.3.0 Built Open MPI 4.0.5 without any special configure options. Howard On 1/27/21, 9:47 AM, "users on behalf of Michael Di Domenico via users" <users-boun...@lists.open-mpi.org on behalf of users@lists.open-mpi.org> wrote: for whatever it's worth running the test program on my OPA cluster seems to work. well it keeps spitting out [INFO MEMORY] lines, not sure if it's supposed to stop at some point i'm running rhel7, gcc 10.1, openmpi 4.0.5rc2, with-ofi, without-{psm,ucx,verbs} On Tue, Jan 26, 2021 at 3:44 PM Patrick Begou via users <users@lists.open-mpi.org> wrote: > > Hi Michael > > indeed I'm a little bit lost with all these parameters in OpenMPI, mainly because for years it works just fine out of the box in all my deployments on various architectures, interconnects and linux flavor. Some weeks ago I deploy OpenMPI4.0.5 in Centos8 with gcc10, slurm and UCX on an AMD epyc2 cluster with connectX6, and it just works fine. It is the first time I've such trouble to deploy this library. > > If you have my mail posted the 25/01/2021 in this discussion at 18h54 (may be Paris TZ) there is a small test case attached that show the problem. Did you got it or did the list strip these attachments ? I can provide it again. > > Many thanks > > Patrick > > Le 26/01/2021 à 19:25, Heinz, Michael William a écrit : > > Patrick how are you using original PSM if you’re using Omni-Path hardware? The original PSM was written for QLogic DDR and QDR Infiniband adapters. > > As far as needing openib - the issue is that the PSM2 MTL doesn’t support a subset of MPI operations that we previously used the pt2pt BTL for. For recent version of OMPI, the preferred BTL to use with PSM2 is OFI. > > Is there any chance you can give us a sample MPI app that reproduces the problem? I can’t think of another way I can give you more help without being able to see what’s going on. It’s always possible there’s a bug in the PSM2 MTL but it would be surprising at this point. > > Sent from my iPad > > On Jan 26, 2021, at 1:13 PM, Patrick Begou via users <users@lists.open-mpi.org> wrote: > > > Hi all, > > I ran many tests today. I saw that an older 4.0.2 version of OpenMPI packaged with Nix was running using openib. So I add the --with-verbs option to setup this module. > > That I can see now is that: > > mpirun -hostfile $OAR_NODEFILE --mca mtl psm -mca btl_openib_allow_ib true .... > > - the testcase test_layout_array is running without error > > - the bandwidth measured with osu_bw is half of thar it should be: > > # OSU MPI Bandwidth Test v5.7 > # Size Bandwidth (MB/s) > 1 0.54 > 2 1.13 > 4 2.26 > 8 4.51 > 16 9.06 > 32 17.93 > 64 33.87 > 128 69.29 > 256 161.24 > 512 333.82 > 1024 682.66 > 2048 1188.63 > 4096 1760.14 > 8192 2166.08 > 16384 2036.95 > 32768 3466.63 > 65536 6296.73 > 131072 7509.43 > 262144 9104.78 > 524288 6908.55 > 1048576 5530.37 > 2097152 4489.16 > 4194304 3498.14 > > mpirun -hostfile $OAR_NODEFILE --mca mtl psm2 -mca btl_openib_allow_ib true ... > > - the testcase test_layout_array is not giving correct results > > - the bandwidth measured with osu_bw is the right one: > > # OSU MPI Bandwidth Test v5.7 > # Size Bandwidth (MB/s) > 1 3.73 > 2 7.96 > 4 15.82 > 8 31.22 > 16 51.52 > 32 107.61 > 64 196.51 > 128 438.66 > 256 817.70 > 512 1593.90 > 1024 2786.09 > 2048 4459.77 > 4096 6658.70 > 8192 8092.95 > 16384 8664.43 > 32768 8495.96 > 65536 11458.77 > 131072 12094.64 > 262144 11781.84 > 524288 12297.58 > 1048576 12346.92 > 2097152 12206.53 > 4194304 12167.00 > > But yes, I know openib is deprecated too in 4.0.5. > > Patrick > >