for whatever it's worth running the test program on my OPA cluster
seems to work. well it keeps spitting out [INFO MEMORY] lines, not
sure if it's supposed to stop at some point
i'm running rhel7, gcc 10.1, openmpi 4.0.5rc2, with-ofi, without-{psm,ucx,verbs}
On Tue, Jan 26, 2021 at 3:44 PM Patrick Begou via users
<[email protected]> wrote:
>
> Hi Michael
>
> indeed I'm a little bit lost with all these parameters in OpenMPI, mainly
> because for years it works just fine out of the box in all my deployments on
> various architectures, interconnects and linux flavor. Some weeks ago I
> deploy OpenMPI4.0.5 in Centos8 with gcc10, slurm and UCX on an AMD epyc2
> cluster with connectX6, and it just works fine. It is the first time I've
> such trouble to deploy this library.
>
> If you have my mail posted the 25/01/2021 in this discussion at 18h54 (may
> be Paris TZ) there is a small test case attached that show the problem. Did
> you got it or did the list strip these attachments ? I can provide it again.
>
> Many thanks
>
> Patrick
>
> Le 26/01/2021 à 19:25, Heinz, Michael William a écrit :
>
> Patrick how are you using original PSM if you’re using Omni-Path hardware?
> The original PSM was written for QLogic DDR and QDR Infiniband adapters.
>
> As far as needing openib - the issue is that the PSM2 MTL doesn’t support a
> subset of MPI operations that we previously used the pt2pt BTL for. For
> recent version of OMPI, the preferred BTL to use with PSM2 is OFI.
>
> Is there any chance you can give us a sample MPI app that reproduces the
> problem? I can’t think of another way I can give you more help without being
> able to see what’s going on. It’s always possible there’s a bug in the PSM2
> MTL but it would be surprising at this point.
>
> Sent from my iPad
>
> On Jan 26, 2021, at 1:13 PM, Patrick Begou via users
> <[email protected]> wrote:
>
>
> Hi all,
>
> I ran many tests today. I saw that an older 4.0.2 version of OpenMPI packaged
> with Nix was running using openib. So I add the --with-verbs option to setup
> this module.
>
> That I can see now is that:
>
> mpirun -hostfile $OAR_NODEFILE --mca mtl psm -mca btl_openib_allow_ib true
> ....
>
> - the testcase test_layout_array is running without error
>
> - the bandwidth measured with osu_bw is half of thar it should be:
>
> # OSU MPI Bandwidth Test v5.7
> # Size Bandwidth (MB/s)
> 1 0.54
> 2 1.13
> 4 2.26
> 8 4.51
> 16 9.06
> 32 17.93
> 64 33.87
> 128 69.29
> 256 161.24
> 512 333.82
> 1024 682.66
> 2048 1188.63
> 4096 1760.14
> 8192 2166.08
> 16384 2036.95
> 32768 3466.63
> 65536 6296.73
> 131072 7509.43
> 262144 9104.78
> 524288 6908.55
> 1048576 5530.37
> 2097152 4489.16
> 4194304 3498.14
>
> mpirun -hostfile $OAR_NODEFILE --mca mtl psm2 -mca btl_openib_allow_ib true
> ...
>
> - the testcase test_layout_array is not giving correct results
>
> - the bandwidth measured with osu_bw is the right one:
>
> # OSU MPI Bandwidth Test v5.7
> # Size Bandwidth (MB/s)
> 1 3.73
> 2 7.96
> 4 15.82
> 8 31.22
> 16 51.52
> 32 107.61
> 64 196.51
> 128 438.66
> 256 817.70
> 512 1593.90
> 1024 2786.09
> 2048 4459.77
> 4096 6658.70
> 8192 8092.95
> 16384 8664.43
> 32768 8495.96
> 65536 11458.77
> 131072 12094.64
> 262144 11781.84
> 524288 12297.58
> 1048576 12346.92
> 2097152 12206.53
> 4194304 12167.00
>
> But yes, I know openib is deprecated too in 4.0.5.
>
> Patrick
>
>