Hi all,
The configuration might be a bit exotic:
Kernel 4.1.5 vanilla, Mellanox OFED 3.0-2.0.1
ccc174 1 x dual port ConnectX-3
mini4 2 x single port ConnectX-2
mini2 8 x single port ConnectX-2
MIS20025
The following does work:
using oob coonection manager in 1.7.3:
everything works, except latencies are really bad compared to 1.8.8
udcm in 1.8.8:
everything works as long as I exclude mlx4_0:2 by setting:
--mca btl_openib_if_include
'mlx4_0:1,mlx4_1:1,mlx4_2:1,mlx4_3:1,mlx4_4:1,mlx4_5:1,mlx4_6:1,mlx4_7:1'
if I include mlx4_0:2 I get:
[mini4][[62272,1],4][connect/btl_openib_connect_udcm.c:1907:udcm_process_messages]
could not initialize cpc data for endpoint
libibverbs: ibv_create_ah failed to query port.
rdmacm in 1.8.8 only works between ccc174 and mini4, running across all
three nodes will produce:
mpirun --mca btl_openib_cpc_include rdmacm --mca
btl_openib_warn_default_gid_prefix 0 --hostfile ~/hostlist -np 40
./osu_alltoall
--------------------------------------------------------------------------
No OpenFabrics connection schemes reported that they were able to be
used on a specific port. As such, the openib BTL (OpenFabrics
support) will be disabled for this port.
Local host: mini2
Local device: mlx4_7
Local port: 1
CPCs attempted: rdmacm
--------------------------------------------------------------------------
[ccc174][[61500,1],1][btl_openib_proc.c:157:mca_btl_openib_proc_create]
[btl_openib_proc.c:157] ompi_modex_recv failed for peer [[61500,1],9]
Any help would be much appreciated.
Regards,
Tobias