Hi all,

The configuration might be a bit exotic:

Kernel 4.1.5 vanilla, Mellanox OFED 3.0-2.0.1

ccc174 1 x dual port ConnectX-3
mini4   2 x single port ConnectX-2
mini2   8 x single port ConnectX-2
MIS20025

The following does work:

using oob coonection manager in 1.7.3:
everything works, except latencies are really bad compared to 1.8.8

udcm in 1.8.8:
everything works as long as I exclude mlx4_0:2 by setting:
--mca btl_openib_if_include 'mlx4_0:1,mlx4_1:1,mlx4_2:1,mlx4_3:1,mlx4_4:1,mlx4_5:1,mlx4_6:1,mlx4_7:1'
if I include mlx4_0:2 I get:
[mini4][[62272,1],4][connect/btl_openib_connect_udcm.c:1907:udcm_process_messages] could not initialize cpc data for endpoint
libibverbs: ibv_create_ah failed to query port.

rdmacm in 1.8.8 only works between ccc174 and mini4, running across all three nodes will produce:

mpirun --mca btl_openib_cpc_include rdmacm --mca btl_openib_warn_default_gid_prefix 0 --hostfile ~/hostlist -np 40 ./osu_alltoall
--------------------------------------------------------------------------
No OpenFabrics connection schemes reported that they were able to be
used on a specific port.  As such, the openib BTL (OpenFabrics
support) will be disabled for this port.

  Local host:           mini2
  Local device:         mlx4_7
  Local port:           1
  CPCs attempted:       rdmacm
--------------------------------------------------------------------------
[ccc174][[61500,1],1][btl_openib_proc.c:157:mca_btl_openib_proc_create] [btl_openib_proc.c:157] ompi_modex_recv failed for peer [[61500,1],9]

Any help would be much appreciated.

Regards,
Tobias

Reply via email to