Jeff, On 01-Oct-11 1:01 AM, Konz, Jeffrey (SSA Solution Centers) wrote: > Encountered a problem when trying to run OpenMPI 1.5.4 with RoCE over 10GbE > fabric. > > Got this run time error: > > An invalid CPC name was specified via the btl_openib_cpc_include MCA > parameter. > > Local host: atl3-14 > btl_openib_cpc_include value: rdmacm > Invalid name: rdmacm > All possible valid names: oob,xoob > -------------------------------------------------------------------------- > [atl3-14:07184] mca: base: components_open: component btl / openib open > function failed > [atl3-12:09178] mca: base: components_open: component btl / openib open > function failed > > Used these options to mpirun: > "--mca btl openib,self,sm --mca btl_openib_cpc_include rdmacm -mca > btl_openib_if_include mlx4_0:2" > > We have a Mellanox LOM with two ports, first is an IB port, second is an > 10GbE port. > Running over the IB port and TCP over the 10GbE port work fine. > > Built OpenMPI with this option "--enable-openib-rdmacm". > Our system has OFED 1.5.2 with librdmacm-1.0.13-1 > > I noticed this output from configure script: > checking rdma/rdma_cma.h usability... no > checking rdma/rdma_cma.h presence... no > checking for rdma/rdma_cma.h... no > checking whether IBV_LINK_LAYER_ETHERNET is declared... yes > checking if RDMAoE support is enabled... yes > checking for infiniband/driver.h... yes > checking if ConnectX XRC support is enabled... yes > checking if dynamic SL is enabled... no > checking if OpenFabrics RDMACM support is enabled... no > > Are we missing a build option or a piece of software? > Config.log and output from "ompi_info --all" attached.
You shouldn't use the "--enable-openib-rdmacm" option - rdmacm support is enabled by default, providing librdmacm is found on the machine. So the question is, why OMPI config script didn't find it? OMPI looks for "rdma/rdma_cma.h" header. Do you have it on you build machine? The usual location of this file is /usr/include/rdma/rdma_cma.h Another reason might be this: it appears that OMPI is including "rdma/rdma_cma.h" rather than <rdma/rdma_cma.h>. Please apply the following tiny fix to OMPI source: Index: ompi/config/ompi_check_openib.m4 =================================================================== --- ompi/config/ompi_check_openib.m4 (revision 25228) +++ ompi/config/ompi_check_openib.m4 (working copy) @@ -207,7 +207,7 @@ [AC_CHECK_LIB([rdmacm], [rdma_create_id], [AC_MSG_CHECKING([for rdma_get_peer_addr]) $1_msg=no - AC_LINK_IFELSE([AC_LANG_PROGRAM([[#include "rdma/rdma_cma.h" + AC_LINK_IFELSE([AC_LANG_PROGRAM([[#include <rdma/rdma_cma.h> ]], [[void *ret = (void*) rdma_get_peer_addr((struct rdma_cm_id*)0);]])], [$1_have_rdmacm=1 $1_msg=yes]) Run autogen.sh & configure and check if rdmacm is found. -- YK > % ibv_devinfo > hca_id: mlx4_0 > transport: InfiniBand (0) > fw_ver: 2.9.1000 > node_guid: 78e7:d103:0021:4464 > sys_image_guid: 78e7:d103:0021:4467 > vendor_id: 0x02c9 > vendor_part_id: 26438 > hw_ver: 0xB0 > board_id: HP_0200000003 > phys_port_cnt: 2 > port: 1 > state: PORT_ACTIVE (4) > max_mtu: 2048 (4) > active_mtu: 2048 (4) > sm_lid: 34 > port_lid: 11 > port_lmc: 0x00 > link_layer: IB > > port: 2 > state: PORT_ACTIVE (4) > max_mtu: 2048 (4) > active_mtu: 1024 (3) > sm_lid: 0 > port_lid: 0 > port_lmc: 0x00 > link_layer: Ethernet > > % /sbin/ifconfig > eth0 Link encap:Ethernet HWaddr 78:E7:D1:21:44:60 > inet addr:16.113.180.147 Bcast:16.113.183.255 Mask:255.255.252.0 > inet6 addr: fe80::7ae7:d1ff:fe21:4460/64 Scope:Link > UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 > RX packets:1861763 errors:0 dropped:0 overruns:0 frame:0 > TX packets:1776402 errors:0 dropped:0 overruns:0 carrier:0 > collisions:0 txqueuelen:1000 > RX bytes:712448939 (679.4 MiB) TX bytes:994111004 (948.0 MiB) > Memory:fb9e0000-fba00000 > > eth2 Link encap:Ethernet HWaddr 78:E7:D1:21:44:65 > inet addr:10.10.0.147 Bcast:10.10.0.255 Mask:255.255.255.0 > inet6 addr: fe80::78e7:d100:121:4465/64 Scope:Link > UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 > RX packets:8519814 errors:0 dropped:0 overruns:0 frame:0 > TX packets:8555715 errors:0 dropped:0 overruns:0 carrier:0 > collisions:0 txqueuelen:1000 > RX bytes:12370127778 (11.5 GiB) TX bytes:12372246315 (11.5 GiB) > > ib0 Link encap:InfiniBand HWaddr > 80:00:00:4D:FE:80:00:00:00:00:00:00:00:00:00:00:00:00:00:00 > inet addr:192.168.0.147 Bcast:192.168.0.255 Mask:255.255.255.0 > inet6 addr: fe80::7ae7:d103:21:4465/64 Scope:Link > UP BROADCAST RUNNING MULTICAST MTU:16384 Metric:1 > RX packets:1989 errors:0 dropped:0 overruns:0 frame:0 > TX packets:208 errors:0 dropped:0 overruns:0 carrier:0 > collisions:0 txqueuelen:256 > RX bytes:275196 (268.7 KiB) TX bytes:19202 (18.7 KiB) > > lo Link encap:Local Loopback > inet addr:127.0.0.1 Mask:255.0.0.0 > inet6 addr: ::1/128 Scope:Host > UP LOOPBACK RUNNING MTU:16436 Metric:1 > RX packets:42224 errors:0 dropped:0 overruns:0 frame:0 > TX packets:42224 errors:0 dropped:0 overruns:0 carrier:0 > collisions:0 txqueuelen:0 > RX bytes:3115668 (2.9 MiB) TX bytes:3115668 (2.9 MiB) > > Thanks, > > -Jeff > > > /**********************************************************/ > /* Jeff Konz jeffrey.k...@hp.com */ > /* Solutions Architect HPC Benchmarking */ > /* Americas Shared Solutions Architecture (SSA) */ > /* Hewlett-Packard Company */ > /* Office: 248-491-7480 Mobile: 248-345-6857 */ > /**********************************************************/ > > > > > > > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users