You might want to try two things: 1. Upgrade to Open MPI v4.0.1. 2. Use the UCX PML instead of the openib BTL.
You may need to download/install UCX first. Then configure Open MPI: ./configure --with-ucx --without-verbs --enable-mca-no-build=btl-uct ... This will build the UCX PML, and that should get used by default when you mpirun. Note that the "--enable-mca-no-build..." option is because it looks like we have a plugin (the BTL UCT plugin, to be specific) in the v4.0.1 release that does not compile successfully with the latest version of UCX. This will be fixed in a subsequent Open MPI v4.0.x release. > On May 9, 2019, at 10:17 AM, Koutsoukos Dimitrios via users > <users@lists.open-mpi.org> wrote: > > Hi all, > > I am trying to run MPI on a distributed mode. The cluster setup is an > 8-machine cluster with Debian 8 (Jessie), Intel Xeon E5-2609 2.40 GHz and > Mellanox-QDR HCA Infiniband. My MPI version is 3.0.4. I can successfully run > a simple command on all nodes that doesn’t use the infiniband but when I am > running my experiments I am receiving the following error from one of the > nodes: > ------------------------------------------------------------------------- > Failed to modify the attributes of a queue pair (QP): > > Hostname: euler04 > Mask for QP attributes to be modified: 65537 > Error: Invalid argument > -------------------------------------------------------------------------- > -------------------------------------------------------------------------- > Open MPI has detected that there are UD-capable Verbs devices on your > system, but none of them were able to be setup properly. This may > indicate a problem on this system. > > You job will continue, but Open MPI will ignore the "ud" oob component > in this run. > > Hostname: euler04 > -------------------------------------------------------------------------- > -------------------------------------------------------------------------- > Failed to modify the attributes of a queue pair (QP): > > Hostname: euler04 > Mask for QP attributes to be modified: 65537 > Error: Invalid argument > -------------------------------------------------------------------------- > -------------------------------------------------------------------------- > Open MPI has detected that there are UD-capable Verbs devices on your > system, but none of them were able to be setup properly. This may > indicate a problem on this system. > > You job will continue, but Open MPI will ignore the "ud" oob component > in this run. > > Hostname: euler04 > -------------------------------------------------------------------------- > [euler04][[29717,1],29][connect/btl_openib_connect_udcm.c:972:udcm_module_create_listen_qp] > error modifing QP to RTS errno says Invalid argument; errno=22 > [euler04][[29717,1],25][connect/btl_openib_connect_udcm.c:972:udcm_module_create_listen_qp] > error modifing QP to RTS errno says Invalid argument; errno=22 > [euler04][[29717,1],24][connect/btl_openib_connect_udcm.c:972:udcm_module_create_listen_qp] > error modifing QP to RTS errno says Invalid argument; errno=22 > [euler04][[29717,1],31][connect/btl_openib_connect_udcm.c:972:udcm_module_create_listen_qp] > error modifing QP to RTS errno says Invalid argument; errno=22 > [euler04][[29717,1],30][connect/btl_openib_connect_udcm.c:972:udcm_module_create_listen_qp] > error modifing QP to RTS errno says Invalid argument; errno=22 > [euler04][[29717,1],27][connect/btl_openib_connect_udcm.c:972:udcm_module_create_listen_qp] > error modifing QP to RTS errno says Invalid argument; errno=22 > [euler04][[29717,1],26][connect/btl_openib_connect_udcm.c:972:udcm_module_create_listen_qp] > error modifing QP to RTS errno says Invalid argument; errno=22 > [euler04][[29717,1],28][connect/btl_openib_connect_udcm.c:972:udcm_module_create_listen_qp] > error modifing QP to RTS errno says Invalid argument; errno=22 > > Note that I am compiling MPI from source on a shared NFS using the commands: > ./configure prefix=/path/to/NFS/ > make > make install > And also that my cluster configuration in all of the nodes is the same. I am > running my job using /path/to/NFS/mpirun —hostfile hostfile > ./executable_name. I am not receiving any error when I am excluding this > host. Is this a hardware error? Should I try a different MPI version? Any > help would be appreciated. > > Thanks very much in advance for your help, > Dimitris > _______________________________________________ > users mailing list > users@lists.open-mpi.org > https://lists.open-mpi.org/mailman/listinfo/users -- Jeff Squyres jsquy...@cisco.com _______________________________________________ users mailing list users@lists.open-mpi.org https://lists.open-mpi.org/mailman/listinfo/users