----- Mensaje original -----
> De: "Pavel Mezentsev via users" <users@lists.open-mpi.org>
> Para: users@lists.open-mpi.org
> CC: "Pavel Mezentsev" <pavel.mezent...@gmail.com>
> Enviado: Miércoles, 19 de Mayo 2021 10:53:50
> Asunto: Re: [OMPI users] unable to launch a job on a system with OmniPath
>
> It took some time but my colleague was able to build OpenMPI and get it
> working with OmniPath, however the performance is quite disappointing.
> The configuration line used was the following: ./configure
> --prefix=$INSTALL_PATH  --build=x86_64-pc-linux-gnu
> --host=x86_64-pc-linux-gnu --enable-shared --with-hwloc=$EBROOTHWLOC
> --with-psm2 --with-ofi=$EBROOTLIBFABRIC --with-libevent=$EBROOTLIBEVENT
> --without-orte --disable-oshmem --with-gpfs --with-slurm
> --with-pmix=external --with-libevent=external --with-ompi-pmix-rte
> 
> /usr/bin/srun --cpu-bind=none --mpi=pspmix --ntasks-per-node 1 -n 2 xenv -L
> Architecture/KNL -L GCC -L OpenMPI env OMPI_MCA_btl_base_verbose="99"
> OMPI_MCA_mtl_base_verbose="99" numactl --physcpubind=1 ./osu_bw
> ...
> [node:18318] select: init of component ofi returned success
> [node:18318] mca: base: components_register: registering framework mtl
> components
> [node:18318] mca: base: components_register: found loaded component ofi
> 
> [node:18318] mca: base: components_register: component ofi register
> function successful
> [node:18318] mca: base: components_open: opening mtl components
> 
> [node:18318] mca: base: components_open: found loaded component ofi
> 
> [node:18318] mca: base: components_open: component ofi open function
> successful
> [node:18318] mca:base:select: Auto-selecting mtl components
> [node:18318] mca:base:select:(  mtl) Querying component [ofi]
> 
> [node:18318] mca:base:select:(  mtl) Query of component [ofi] set priority
> to 25
> [node:18318] mca:base:select:(  mtl) Selected component [ofi]
> 
> [node:18318] select: initializing mtl component ofi
> [node:18318] mtl_ofi_component.c:378: mtl:ofi:provider: hfi1_0
> ...
> # OSU MPI Bandwidth Test v5.7
> # Size      Bandwidth (MB/s)
> 1                       0.05
> 2                       0.10
> 4                       0.20
> 8                       0.41
> 16                      0.77
> 32                      1.54
> 64                      3.10
> 128                     6.09
> 256                    12.39
> 512                    24.23
> 1024                   46.85
> 2048                   87.99
> 4096                  100.72
> 8192                  139.91
> 16384                 173.67
> 32768                 197.82
> 65536                 210.15
> 131072                215.76
> 262144                214.39
> 524288                219.23
> 1048576               223.53
> 2097152               226.93
> 4194304               227.62
> 
> If I test directly with `ib_write_bw` I get
> #bytes     #iterations    BW peak[MB/sec]    BW average[MB/sec]
> MsgRate[Mpps]
> Conflicting CPU frequency values detected: 1498.727000 != 1559.017000. CPU
> Frequency is not max.
> 65536      5000             2421.04            2064.33            0.033029
> 
> I also tried adding `OMPI_MCA_mtl="psm2"` however the job crashes in that
> case:
> ```
> Error obtaining unique transport key from ORTE
> (orte_precondition_transports not present in
> 
> the environment).
> ```
> Which is a bit puzzling considering that OpenMPI was build with
> `--witout-orte`

Dear Pavel, I can't help you but just in case in the text:

> Which is a bit puzzling considering that OpenMPI was build with
> `--witout-orte`

it should be `--without-orte` ??


Regards, Jorge D' Elia.
--
CIMEC (UNL-CONICET), http://www.cimec.org.ar/
Predio CONICET-Santa Fe, Colec. Ruta Nac. 168, 
Paraje El Pozo, 3000, Santa Fe, ARGENTINA. 
Tel +54-342-4511594/95 ext 7062, fax: +54-342-4511169


> What am I missing and how can I improve the performance?
> 
> Regards, Pavel Mezentsev.
> 
> On Mon, May 10, 2021 at 6:20 PM Heinz, Michael William <
> michael.william.he...@cornelisnetworks.com> wrote:
> 
>> *That warning is an annoying bit of cruft from the openib / verbs provider
>> that can be ignored. (Actually, I recommend using “—btl ^openib” to
>> suppress the warning.)*
>>
>>
>>
>> *That said, there is a known issue with selecting PSM2 and OMPI 4.1.0. I’m
>> not sure that that’s the problem you’re hitting, though, because you really
>> haven’t provided a lot of information.*
>>
>>
>>
>> *I would suggest trying the following to see what happens:*
>>
>>
>>
>> *${PATH_TO_OMPI}/mpirun -mca mtl psm2 -mca btl ^openib -mca
>> mtl_base_verbose 99 -mca btl_base_verbose 99 -n ${N} -H ${HOSTS}
>> my_application*
>>
>>
>>
>> *This should give you detailed information on what transports were
>> selected and what happened next.*
>>
>>
>>
>> *Oh – and make sure your fabric is up with an opainfo or opareport
>> command, just to make sure.*
>>
>>
>>
>> *From:* users <users-boun...@lists.open-mpi.org> *On Behalf Of *Pavel
>> Mezentsev via users
>> *Sent:* Monday, May 10, 2021 8:41 AM
>> *To:* users@lists.open-mpi.org
>> *Cc:* Pavel Mezentsev <pavel.mezent...@gmail.com>
>> *Subject:* [OMPI users] unable to launch a job on a system with OmniPath
>>
>>
>>
>> Hi!
>>
>> I'm working on a system with KNL and OmniPath and I'm trying to launch a
>> job but it fails. Could someone please advise what parameters I need to add
>> to make it work properly? At first I need to make it work within one node,
>> however later I need to use multiple nodes and eventually I may need to
>> switch to TCP to run a hybrid job where some nodes are connected via
>> Infiniband and some nodes are connected via OmniPath.
>>
>>
>>
>> So far without any extra parameters I get:
>> ```
>> By default, for Open MPI 4.0 and later, infiniband ports on a device
>> are not used by default.  The intent is to use UCX for these devices.
>> You can override this policy by setting the btl_openib_allow_ib MCA
>> parameter
>> to true.
>>
>>   Local host:              XXXXXX
>>   Local adapter:           hfi1_0
>>   Local port:              1
>> ```
>>
>> If I add `OMPI_MCA_btl_openib_allow_ib="true"` then I get:
>> ```
>> Error obtaining unique transport key from ORTE
>> (orte_precondition_transports not present in
>> the environment).
>>
>>   Local host: XXXXXX
>>
>> ```
>> Then I tried adding OMPI_MCA_mtl="psm2" or OMPI_MCA_mtl="ofi" to make it
>> use omnipath or OMPI_MCA_btl="sm,self" to make it use only shared memory.
>> But these parameters did not make any difference.
>> There does not seem to be much omni-path related documentation, at least I
>> was not able to find anything that would help me but perhaps I missed
>> something:
>> https://www.open-mpi.org/faq/?category=running#opa-support
>> <https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.open-mpi.org%2Ffaq%2F%3Fcategory%3Drunning%23opa-support&data=04%7C01%7Cmichael.william.heinz%40cornelisnetworks.com%7C57fa32f71d054ebd6a5a08d913cd8fbf%7C4dbdb7da74ee4b458747ef5ce5ebe68a%7C0%7C0%7C637562595871907805%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=kJ830bXfZmIMEg4hJkdEw8D6lw66aooAjHMpLL7NZ8c%3D&reserved=0>
>> https://www.open-mpi.org/faq/?category=opa
>> <https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.open-mpi.org%2Ffaq%2F%3Fcategory%3Dopa&data=04%7C01%7Cmichael.william.heinz%40cornelisnetworks.com%7C57fa32f71d054ebd6a5a08d913cd8fbf%7C4dbdb7da74ee4b458747ef5ce5ebe68a%7C0%7C0%7C637562595871907805%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=SavN0pUsMxdufMBzrTyqSNCNHTVRMA1EUqlcWUMDcBo%3D&reserved=0>
>>
>>
>>
>> This is the `configure` line:
>>
>> ```
>> ./configure --prefix=XXXXX --build=x86_64-pc-linux-gnu
>>  --host=x86_64-pc-linux-gnu --enable-shared --with-hwloc=$EBROOTHWLOC
>> --with-psm2 --with-libevent=$EBROOTLIBEVENT --without-orte --disable-oshmem
>> --with-cuda=$EBROOTCUDA --with-gpfs --with-slurm --with-pmix=external
>> --with-libevent=external --with-ompi-pmix-rte
>>
>> ```
>>
>> Which also raises another question: if it was built with `--without-orte`
>> then why do I get an error about failing to get something from ORTE.
>>
>> The OpenMPI version is `4.1.0rc1` built with `gcc-9.3.0`.
>>
>>
>>
>> Thank you in advance!
>>
>> Regards, Pavel Mezentsev.

Reply via email to