Unfortunately, OPA/PSM support for Debian isn't handled by Intel directly or by Cornelis Networks - but I should point out you can download the latest official source for PSM2 and the drivers from Github.
-----Original Message----- From: users <users-boun...@lists.open-mpi.org> On Behalf Of Michael Di Domenico via users Sent: Wednesday, January 27, 2021 3:32 PM To: Open MPI Users <users@lists.open-mpi.org> Cc: Michael Di Domenico <mdidomeni...@gmail.com> Subject: Re: [OMPI users] [EXTERNAL] Re: OpenMPI 4.0.5 error with Omni-path if you have OPA cards, for openmpi you only need --with-ofi, you don't need psm/psm2/verbs/ucx. but this assumes you're running a rhel based distro and have installed the OPA fabric suite of software from Intel/CornelisNetworks. which is what i have. perhaps there's something really odd in debian or there's an incompatibility with the older ofed drivers perhaps included with debian. unfortunately i don't have access to a debian, so i can't be much more help if i had to guess totally pulling junk from the air, there's probably something incompatible with PSM and OPA when running specifically on debian (likely due to library versioning). i don't know how common that is, so it's not clear how flushed out and tested it is On Wed, Jan 27, 2021 at 3:07 PM Patrick Begou via users <users@lists.open-mpi.org> wrote: > > Hi Howard and Michael > > first many thanks for testing with my short application. Yes, when the > test code runs fine it just show the max RSS size of rank 0 process. > When it runs wrong it put a messages about each invalid value found. > > As I said, I have also deployed OpenMPI on various cluster (in DELL > data center at Austin) when I was testing some architectures some > months ago and nor on AMD/Mellanox_IB nor on Intel/Omni-path I got any > problem. The goal was running my tests with same software stacks and > be sure to be able to deploy my software stack on the selected solution. > But as your clusters (and my small local clusters) they were all > running RedHat (or similar Linux flavors) and a modern Gnu compiler (9 or 10). > The university's cluster I have access is running Debian stretch and > provides GCC6 as default compiler. > > I cannot ask for a different OS, but I can deploy a local gcc10 and > build again OpenMPI. UCX is not available on this cluster, should I > deploy a local UCX too ? > > Libpsm2 seams good: > dahu103 : dpkg -l |grep psm > ii libfabric-psm 1.10.0-2-1ifs+deb9 amd64 Dynamic PSM > provider for user-space Open Fabric Interfaces > ii libfabric-psm2 1.10.0-2-1ifs+deb9 amd64 Dynamic PSM2 > provider for user-space Open Fabric Interfaces > ii libpsm-infinipath1 3.3-19-g67c0807-2ifs+deb9 amd64 PSM Messaging > library for Intel Truescale adapters > ii libpsm-infinipath1-dev 3.3-19-g67c0807-2ifs+deb9 amd64 Development > files for libpsm-infinipath1 > ii libpsm2-2 11.2.185-1-1ifs+deb9 amd64 Intel PSM2 > Libraries > ii libpsm2-2-compat 11.2.185-1-1ifs+deb9 amd64 Compat > library for Intel PSM2 > ii libpsm2-dev 11.2.185-1-1ifs+deb9 amd64 Development > files for Intel PSM2 > ii psmisc 22.21-2.1+b2 amd64 utilities > that use the proc file system > > This will be my next try to install OpenMPI on this cluster. > > Patrick > > > Le 27/01/2021 à 18:09, Pritchard Jr., Howard via users a écrit : > > Hi Folks, > > > > I'm also have problems reproducing this on one of our OPA clusters: > > > > libpsm2-11.2.78-1.el7.x86_64 > > libpsm2-devel-11.2.78-1.el7.x86_64 > > > > cluster runs RHEL 7.8 > > > > hca_id: hfi1_0 > > transport: InfiniBand (0) > > fw_ver: 1.27.0 > > node_guid: 0011:7501:0179:e2d7 > > sys_image_guid: 0011:7501:0179:e2d7 > > vendor_id: 0x1175 > > vendor_part_id: 9456 > > hw_ver: 0x11 > > board_id: Intel Omni-Path Host Fabric Interface > > Adapter 100 Series > > phys_port_cnt: 1 > > port: 1 > > state: PORT_ACTIVE (4) > > max_mtu: 4096 (5) > > active_mtu: 4096 (5) > > sm_lid: 1 > > port_lid: 99 > > port_lmc: 0x00 > > link_layer: InfiniBand > > > > using gcc/gfortran 9.3.0 > > > > Built Open MPI 4.0.5 without any special configure options. > > > > Howard > > > > On 1/27/21, 9:47 AM, "users on behalf of Michael Di Domenico via users" > > <users-boun...@lists.open-mpi.org on behalf of users@lists.open-mpi.org> > > wrote: > > > > for whatever it's worth running the test program on my OPA cluster > > seems to work. well it keeps spitting out [INFO MEMORY] lines, not > > sure if it's supposed to stop at some point > > > > i'm running rhel7, gcc 10.1, openmpi 4.0.5rc2, with-ofi, > > without-{psm,ucx,verbs} > > > > On Tue, Jan 26, 2021 at 3:44 PM Patrick Begou via users > > <users@lists.open-mpi.org> wrote: > > > > > > Hi Michael > > > > > > indeed I'm a little bit lost with all these parameters in OpenMPI, > > mainly because for years it works just fine out of the box in all my > > deployments on various architectures, interconnects and linux flavor. Some > > weeks ago I deploy OpenMPI4.0.5 in Centos8 with gcc10, slurm and UCX on an > > AMD epyc2 cluster with connectX6, and it just works fine. It is the first > > time I've such trouble to deploy this library. > > > > > > If you have my mail posted the 25/01/2021 in this discussion at > > 18h54 (may be Paris TZ) there is a small test case attached that show the > > problem. Did you got it or did the list strip these attachments ? I can > > provide it again. > > > > > > Many thanks > > > > > > Patrick > > > > > > Le 26/01/2021 à 19:25, Heinz, Michael William a écrit : > > > > > > Patrick how are you using original PSM if you’re using Omni-Path > > hardware? The original PSM was written for QLogic DDR and QDR Infiniband > > adapters. > > > > > > As far as needing openib - the issue is that the PSM2 MTL doesn’t > > support a subset of MPI operations that we previously used the pt2pt BTL > > for. For recent version of OMPI, the preferred BTL to use with PSM2 is OFI. > > > > > > Is there any chance you can give us a sample MPI app that reproduces > > the problem? I can’t think of another way I can give you more help without > > being able to see what’s going on. It’s always possible there’s a bug in > > the PSM2 MTL but it would be surprising at this point. > > > > > > Sent from my iPad > > > > > > On Jan 26, 2021, at 1:13 PM, Patrick Begou via users > > <users@lists.open-mpi.org> wrote: > > > > > > > > > Hi all, > > > > > > I ran many tests today. I saw that an older 4.0.2 version of OpenMPI > > packaged with Nix was running using openib. So I add the --with-verbs > > option to setup this module. > > > > > > That I can see now is that: > > > > > > mpirun -hostfile $OAR_NODEFILE --mca mtl psm -mca > > btl_openib_allow_ib true .... > > > > > > - the testcase test_layout_array is running without error > > > > > > - the bandwidth measured with osu_bw is half of thar it should be: > > > > > > # OSU MPI Bandwidth Test v5.7 > > > # Size Bandwidth (MB/s) > > > 1 0.54 > > > 2 1.13 > > > 4 2.26 > > > 8 4.51 > > > 16 9.06 > > > 32 17.93 > > > 64 33.87 > > > 128 69.29 > > > 256 161.24 > > > 512 333.82 > > > 1024 682.66 > > > 2048 1188.63 > > > 4096 1760.14 > > > 8192 2166.08 > > > 16384 2036.95 > > > 32768 3466.63 > > > 65536 6296.73 > > > 131072 7509.43 > > > 262144 9104.78 > > > 524288 6908.55 > > > 1048576 5530.37 > > > 2097152 4489.16 > > > 4194304 3498.14 > > > > > > mpirun -hostfile $OAR_NODEFILE --mca mtl psm2 -mca > > btl_openib_allow_ib true ... > > > > > > - the testcase test_layout_array is not giving correct results > > > > > > - the bandwidth measured with osu_bw is the right one: > > > > > > # OSU MPI Bandwidth Test v5.7 > > > # Size Bandwidth (MB/s) > > > 1 3.73 > > > 2 7.96 > > > 4 15.82 > > > 8 31.22 > > > 16 51.52 > > > 32 107.61 > > > 64 196.51 > > > 128 438.66 > > > 256 817.70 > > > 512 1593.90 > > > 1024 2786.09 > > > 2048 4459.77 > > > 4096 6658.70 > > > 8192 8092.95 > > > 16384 8664.43 > > > 32768 8495.96 > > > 65536 11458.77 > > > 131072 12094.64 > > > 262144 11781.84 > > > 524288 12297.58 > > > 1048576 12346.92 > > > 2097152 12206.53 > > > 4194304 12167.00 > > > > > > But yes, I know openib is deprecated too in 4.0.5. > > > > > > Patrick > > > > > > > > > > >