You might want to try two things:

1. Upgrade to Open MPI v4.0.1.
2. Use the UCX PML instead of the openib BTL.

You may need to download/install UCX first.

Then configure Open MPI:

./configure --with-ucx --without-verbs --enable-mca-no-build=btl-uct ...

This will build the UCX PML, and that should get used by default when you 
mpirun.

Note that the "--enable-mca-no-build..." option is because it looks like we 
have a plugin (the BTL UCT plugin, to be specific) in the v4.0.1 release that 
does not compile successfully with the latest version of UCX.  This will be 
fixed in a subsequent Open MPI v4.0.x release.


> On May 9, 2019, at 10:17 AM, Koutsoukos Dimitrios via users 
> <users@lists.open-mpi.org> wrote:
> 
> Hi all,
> 
> I am trying to run MPI on a distributed mode. The cluster setup is an 
> 8-machine cluster with Debian 8 (Jessie), Intel Xeon E5-2609 2.40 GHz and 
> Mellanox-QDR HCA Infiniband. My MPI version is 3.0.4. I can successfully run 
> a simple command on all nodes that doesn’t use the infiniband but when I am 
> running my experiments I am receiving the following error from one of the 
> nodes:
> -------------------------------------------------------------------------
> Failed to modify the attributes of a queue pair (QP):
> 
> Hostname: euler04
> Mask for QP attributes to be modified: 65537
> Error:    Invalid argument
> --------------------------------------------------------------------------
> --------------------------------------------------------------------------
> Open MPI has detected that there are UD-capable Verbs devices on your
> system, but none of them were able to be setup properly.  This may
> indicate a problem on this system.
> 
> You job will continue, but Open MPI will ignore the "ud" oob component
> in this run.
> 
> Hostname: euler04
> --------------------------------------------------------------------------
> --------------------------------------------------------------------------
> Failed to modify the attributes of a queue pair (QP):
> 
> Hostname: euler04
> Mask for QP attributes to be modified: 65537
> Error:    Invalid argument
> --------------------------------------------------------------------------
> --------------------------------------------------------------------------
> Open MPI has detected that there are UD-capable Verbs devices on your
> system, but none of them were able to be setup properly.  This may
> indicate a problem on this system.
> 
> You job will continue, but Open MPI will ignore the "ud" oob component
> in this run.
> 
> Hostname: euler04
> --------------------------------------------------------------------------
> [euler04][[29717,1],29][connect/btl_openib_connect_udcm.c:972:udcm_module_create_listen_qp]
>  error modifing QP to RTS errno says Invalid argument; errno=22
> [euler04][[29717,1],25][connect/btl_openib_connect_udcm.c:972:udcm_module_create_listen_qp]
>  error modifing QP to RTS errno says Invalid argument; errno=22
> [euler04][[29717,1],24][connect/btl_openib_connect_udcm.c:972:udcm_module_create_listen_qp]
>  error modifing QP to RTS errno says Invalid argument; errno=22
> [euler04][[29717,1],31][connect/btl_openib_connect_udcm.c:972:udcm_module_create_listen_qp]
>  error modifing QP to RTS errno says Invalid argument; errno=22
> [euler04][[29717,1],30][connect/btl_openib_connect_udcm.c:972:udcm_module_create_listen_qp]
>  error modifing QP to RTS errno says Invalid argument; errno=22
> [euler04][[29717,1],27][connect/btl_openib_connect_udcm.c:972:udcm_module_create_listen_qp]
>  error modifing QP to RTS errno says Invalid argument; errno=22
> [euler04][[29717,1],26][connect/btl_openib_connect_udcm.c:972:udcm_module_create_listen_qp]
>  error modifing QP to RTS errno says Invalid argument; errno=22
> [euler04][[29717,1],28][connect/btl_openib_connect_udcm.c:972:udcm_module_create_listen_qp]
>  error modifing QP to RTS errno says Invalid argument; errno=22
> 
> Note that I am compiling MPI from source on a shared NFS using the commands:
> ./configure prefix=/path/to/NFS/
> make
> make install 
> And also that my cluster configuration in all of the nodes is the same. I am 
> running my job using /path/to/NFS/mpirun —hostfile hostfile 
> ./executable_name. I am not receiving any error when I am excluding this 
> host. Is this a hardware error? Should I try a different MPI version? Any 
> help would be appreciated.
> 
> Thanks very much in advance for your help,
> Dimitris
> _______________________________________________
> users mailing list
> users@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/users


-- 
Jeff Squyres
jsquy...@cisco.com

_______________________________________________
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Reply via email to