----- Mensaje original ----- > De: "Pavel Mezentsev via users" <users@lists.open-mpi.org> > Para: users@lists.open-mpi.org > CC: "Pavel Mezentsev" <pavel.mezent...@gmail.com> > Enviado: Miércoles, 19 de Mayo 2021 10:53:50 > Asunto: Re: [OMPI users] unable to launch a job on a system with OmniPath > > It took some time but my colleague was able to build OpenMPI and get it > working with OmniPath, however the performance is quite disappointing. > The configuration line used was the following: ./configure > --prefix=$INSTALL_PATH --build=x86_64-pc-linux-gnu > --host=x86_64-pc-linux-gnu --enable-shared --with-hwloc=$EBROOTHWLOC > --with-psm2 --with-ofi=$EBROOTLIBFABRIC --with-libevent=$EBROOTLIBEVENT > --without-orte --disable-oshmem --with-gpfs --with-slurm > --with-pmix=external --with-libevent=external --with-ompi-pmix-rte > > /usr/bin/srun --cpu-bind=none --mpi=pspmix --ntasks-per-node 1 -n 2 xenv -L > Architecture/KNL -L GCC -L OpenMPI env OMPI_MCA_btl_base_verbose="99" > OMPI_MCA_mtl_base_verbose="99" numactl --physcpubind=1 ./osu_bw > ... > [node:18318] select: init of component ofi returned success > [node:18318] mca: base: components_register: registering framework mtl > components > [node:18318] mca: base: components_register: found loaded component ofi > > [node:18318] mca: base: components_register: component ofi register > function successful > [node:18318] mca: base: components_open: opening mtl components > > [node:18318] mca: base: components_open: found loaded component ofi > > [node:18318] mca: base: components_open: component ofi open function > successful > [node:18318] mca:base:select: Auto-selecting mtl components > [node:18318] mca:base:select:( mtl) Querying component [ofi] > > [node:18318] mca:base:select:( mtl) Query of component [ofi] set priority > to 25 > [node:18318] mca:base:select:( mtl) Selected component [ofi] > > [node:18318] select: initializing mtl component ofi > [node:18318] mtl_ofi_component.c:378: mtl:ofi:provider: hfi1_0 > ... > # OSU MPI Bandwidth Test v5.7 > # Size Bandwidth (MB/s) > 1 0.05 > 2 0.10 > 4 0.20 > 8 0.41 > 16 0.77 > 32 1.54 > 64 3.10 > 128 6.09 > 256 12.39 > 512 24.23 > 1024 46.85 > 2048 87.99 > 4096 100.72 > 8192 139.91 > 16384 173.67 > 32768 197.82 > 65536 210.15 > 131072 215.76 > 262144 214.39 > 524288 219.23 > 1048576 223.53 > 2097152 226.93 > 4194304 227.62 > > If I test directly with `ib_write_bw` I get > #bytes #iterations BW peak[MB/sec] BW average[MB/sec] > MsgRate[Mpps] > Conflicting CPU frequency values detected: 1498.727000 != 1559.017000. CPU > Frequency is not max. > 65536 5000 2421.04 2064.33 0.033029 > > I also tried adding `OMPI_MCA_mtl="psm2"` however the job crashes in that > case: > ``` > Error obtaining unique transport key from ORTE > (orte_precondition_transports not present in > > the environment). > ``` > Which is a bit puzzling considering that OpenMPI was build with > `--witout-orte`
Dear Pavel, I can't help you but just in case in the text: > Which is a bit puzzling considering that OpenMPI was build with > `--witout-orte` it should be `--without-orte` ?? Regards, Jorge D' Elia. -- CIMEC (UNL-CONICET), http://www.cimec.org.ar/ Predio CONICET-Santa Fe, Colec. Ruta Nac. 168, Paraje El Pozo, 3000, Santa Fe, ARGENTINA. Tel +54-342-4511594/95 ext 7062, fax: +54-342-4511169 > What am I missing and how can I improve the performance? > > Regards, Pavel Mezentsev. > > On Mon, May 10, 2021 at 6:20 PM Heinz, Michael William < > michael.william.he...@cornelisnetworks.com> wrote: > >> *That warning is an annoying bit of cruft from the openib / verbs provider >> that can be ignored. (Actually, I recommend using “—btl ^openib” to >> suppress the warning.)* >> >> >> >> *That said, there is a known issue with selecting PSM2 and OMPI 4.1.0. I’m >> not sure that that’s the problem you’re hitting, though, because you really >> haven’t provided a lot of information.* >> >> >> >> *I would suggest trying the following to see what happens:* >> >> >> >> *${PATH_TO_OMPI}/mpirun -mca mtl psm2 -mca btl ^openib -mca >> mtl_base_verbose 99 -mca btl_base_verbose 99 -n ${N} -H ${HOSTS} >> my_application* >> >> >> >> *This should give you detailed information on what transports were >> selected and what happened next.* >> >> >> >> *Oh – and make sure your fabric is up with an opainfo or opareport >> command, just to make sure.* >> >> >> >> *From:* users <users-boun...@lists.open-mpi.org> *On Behalf Of *Pavel >> Mezentsev via users >> *Sent:* Monday, May 10, 2021 8:41 AM >> *To:* users@lists.open-mpi.org >> *Cc:* Pavel Mezentsev <pavel.mezent...@gmail.com> >> *Subject:* [OMPI users] unable to launch a job on a system with OmniPath >> >> >> >> Hi! >> >> I'm working on a system with KNL and OmniPath and I'm trying to launch a >> job but it fails. Could someone please advise what parameters I need to add >> to make it work properly? At first I need to make it work within one node, >> however later I need to use multiple nodes and eventually I may need to >> switch to TCP to run a hybrid job where some nodes are connected via >> Infiniband and some nodes are connected via OmniPath. >> >> >> >> So far without any extra parameters I get: >> ``` >> By default, for Open MPI 4.0 and later, infiniband ports on a device >> are not used by default. The intent is to use UCX for these devices. >> You can override this policy by setting the btl_openib_allow_ib MCA >> parameter >> to true. >> >> Local host: XXXXXX >> Local adapter: hfi1_0 >> Local port: 1 >> ``` >> >> If I add `OMPI_MCA_btl_openib_allow_ib="true"` then I get: >> ``` >> Error obtaining unique transport key from ORTE >> (orte_precondition_transports not present in >> the environment). >> >> Local host: XXXXXX >> >> ``` >> Then I tried adding OMPI_MCA_mtl="psm2" or OMPI_MCA_mtl="ofi" to make it >> use omnipath or OMPI_MCA_btl="sm,self" to make it use only shared memory. >> But these parameters did not make any difference. >> There does not seem to be much omni-path related documentation, at least I >> was not able to find anything that would help me but perhaps I missed >> something: >> https://www.open-mpi.org/faq/?category=running#opa-support >> <https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.open-mpi.org%2Ffaq%2F%3Fcategory%3Drunning%23opa-support&data=04%7C01%7Cmichael.william.heinz%40cornelisnetworks.com%7C57fa32f71d054ebd6a5a08d913cd8fbf%7C4dbdb7da74ee4b458747ef5ce5ebe68a%7C0%7C0%7C637562595871907805%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=kJ830bXfZmIMEg4hJkdEw8D6lw66aooAjHMpLL7NZ8c%3D&reserved=0> >> https://www.open-mpi.org/faq/?category=opa >> <https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.open-mpi.org%2Ffaq%2F%3Fcategory%3Dopa&data=04%7C01%7Cmichael.william.heinz%40cornelisnetworks.com%7C57fa32f71d054ebd6a5a08d913cd8fbf%7C4dbdb7da74ee4b458747ef5ce5ebe68a%7C0%7C0%7C637562595871907805%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=SavN0pUsMxdufMBzrTyqSNCNHTVRMA1EUqlcWUMDcBo%3D&reserved=0> >> >> >> >> This is the `configure` line: >> >> ``` >> ./configure --prefix=XXXXX --build=x86_64-pc-linux-gnu >> --host=x86_64-pc-linux-gnu --enable-shared --with-hwloc=$EBROOTHWLOC >> --with-psm2 --with-libevent=$EBROOTLIBEVENT --without-orte --disable-oshmem >> --with-cuda=$EBROOTCUDA --with-gpfs --with-slurm --with-pmix=external >> --with-libevent=external --with-ompi-pmix-rte >> >> ``` >> >> Which also raises another question: if it was built with `--without-orte` >> then why do I get an error about failing to get something from ORTE. >> >> The OpenMPI version is `4.1.0rc1` built with `gcc-9.3.0`. >> >> >> >> Thank you in advance! >> >> Regards, Pavel Mezentsev.