[OMPI users] unable to launch a job on a system with OmniPath

2021-05-10 Thread Pavel Mezentsev via users
Hi!
I'm working on a system with KNL and OmniPath and I'm trying to launch a
job but it fails. Could someone please advise what parameters I need to add
to make it work properly? At first I need to make it work within one node,
however later I need to use multiple nodes and eventually I may need to
switch to TCP to run a hybrid job where some nodes are connected via
Infiniband and some nodes are connected via OmniPath.

So far without any extra parameters I get:
```
By default, for Open MPI 4.0 and later, infiniband ports on a device
are not used by default.  The intent is to use UCX for these devices.
You can override this policy by setting the btl_openib_allow_ib MCA
parameter
to true.

  Local host:  XX
  Local adapter:   hfi1_0
  Local port:  1
```

If I add `OMPI_MCA_btl_openib_allow_ib="true"` then I get:
```
Error obtaining unique transport key from ORTE
(orte_precondition_transports not present in
the environment).

  Local host: XX

```
Then I tried adding OMPI_MCA_mtl="psm2" or OMPI_MCA_mtl="ofi" to make it
use omnipath or OMPI_MCA_btl="sm,self" to make it use only shared memory.
But these parameters did not make any difference.
There does not seem to be much omni-path related documentation, at least I
was not able to find anything that would help me but perhaps I missed
something:
https://www.open-mpi.org/faq/?category=running#opa-support
https://www.open-mpi.org/faq/?category=opa

This is the `configure` line:
```
./configure --prefix=X --build=x86_64-pc-linux-gnu
 --host=x86_64-pc-linux-gnu --enable-shared --with-hwloc=$EBROOTHWLOC
--with-psm2 --with-libevent=$EBROOTLIBEVENT --without-orte --disable-oshmem
--with-cuda=$EBROOTCUDA --with-gpfs --with-slurm --with-pmix=external
--with-libevent=external --with-ompi-pmix-rte
```
Which also raises another question: if it was built with `--without-orte`
then why do I get an error about failing to get something from ORTE.
The OpenMPI version is `4.1.0rc1` built with `gcc-9.3.0`.

Thank you in advance!
Regards, Pavel Mezentsev.


Re: [OMPI users] unable to launch a job on a system with OmniPath

2021-05-10 Thread Heinz, Michael William via users
That warning is an annoying bit of cruft from the openib / verbs provider that 
can be ignored. (Actually, I recommend using "-btl ^openib" to suppress the 
warning.)

That said, there is a known issue with selecting PSM2 and OMPI 4.1.0. I'm not 
sure that that's the problem you're hitting, though, because you really haven't 
provided a lot of information.

I would suggest trying the following to see what happens:

${PATH_TO_OMPI}/mpirun -mca mtl psm2 -mca btl ^openib -mca mtl_base_verbose 99 
-mca btl_base_verbose 99 -n ${N} -H ${HOSTS} my_application

This should give you detailed information on what transports were selected and 
what happened next.

Oh - and make sure your fabric is up with an opainfo or opareport command, just 
to make sure.

From: users  On Behalf Of Pavel Mezentsev via 
users
Sent: Monday, May 10, 2021 8:41 AM
To: users@lists.open-mpi.org
Cc: Pavel Mezentsev 
Subject: [OMPI users] unable to launch a job on a system with OmniPath

Hi!
I'm working on a system with KNL and OmniPath and I'm trying to launch a job 
but it fails. Could someone please advise what parameters I need to add to make 
it work properly? At first I need to make it work within one node, however 
later I need to use multiple nodes and eventually I may need to switch to TCP 
to run a hybrid job where some nodes are connected via Infiniband and some 
nodes are connected via OmniPath.

So far without any extra parameters I get:
```
By default, for Open MPI 4.0 and later, infiniband ports on a device
are not used by default.  The intent is to use UCX for these devices.
You can override this policy by setting the btl_openib_allow_ib MCA parameter
to true.

  Local host:  XX
  Local adapter:   hfi1_0
  Local port:  1
```

If I add `OMPI_MCA_btl_openib_allow_ib="true"` then I get:
```
Error obtaining unique transport key from ORTE (orte_precondition_transports 
not present in
the environment).

  Local host: XX

```
Then I tried adding OMPI_MCA_mtl="psm2" or OMPI_MCA_mtl="ofi" to make it use 
omnipath or OMPI_MCA_btl="sm,self" to make it use only shared memory. But these 
parameters did not make any difference.
There does not seem to be much omni-path related documentation, at least I was 
not able to find anything that would help me but perhaps I missed something:
https://www.open-mpi.org/faq/?category=running#opa-support
https://www.open-mpi.org/faq/?category=opa

This is the `configure` line:
```
./configure --prefix=X --build=x86_64-pc-linux-gnu  
--host=x86_64-pc-linux-gnu --enable-shared --with-hwloc=$EBROOTHWLOC 
--with-psm2 --with-libevent=$EBROOTLIBEVENT --without-orte --disable-oshmem 
--with-cuda=$EBROOTCUDA --with-gpfs --with-slurm --with-pmix=external 
--with-libevent=external --with-ompi-pmix-rte
```
Which also raises another question: if it was built with `--without-orte` then 
why do I get an error about failing to get something from ORTE.
The OpenMPI version is `4.1.0rc1` built with `gcc-9.3.0`.

Thank you in advance!
Regards, Pavel Mezentsev.