On Feb 28, 2011, at 12:49 PM, Jagga Soorma wrote: > -bash-3.2$ mpiexec --mca btl openib,self -mca btl_openib_warn_default_gid_ > prefix 0 -np 2 --hostfile mpihosts > /home/jagga/osu-micro-benchmarks-3.3/openmpi/ofed-1.5.2/bin/osu_latency
Your use of btl_openib_warn_default_gid_prefix may have brought up a subtle issue in Open MPI's verbs support. More below. > # OSU MPI Latency Test v3.3 > # Size Latency (us) > [amber04][[10252,1],1][connect/btl_openib_connect_oob.c:325:qp_connect_all] > error modifing QP to RTR errno says Invalid argument > [amber04][[10252,1],1][connect/btl_openib_connect_oob.c:815:rml_recv_cb] > error in endpoint reply start connect Looking at this error message and your ibv_devinfo output: > [root@amber03 ~]# ibv_devinfo > hca_id: mlx4_0 > transport: InfiniBand (0) > fw_ver: 2.7.9294 > node_guid: 78e7:d103:0021:8884 > sys_image_guid: 78e7:d103:0021:8887 > vendor_id: 0x02c9 > vendor_part_id: 26438 > hw_ver: 0xB0 > board_id: HP_0200000003 > phys_port_cnt: 2 > port: 1 > state: PORT_ACTIVE (4) > max_mtu: 2048 (4) > active_mtu: 2048 (4) > sm_lid: 1 > port_lid: 20 > port_lmc: 0x00 > link_layer: IB > > port: 2 > state: PORT_ACTIVE (4) > max_mtu: 2048 (4) > active_mtu: 1024 (3) > sm_lid: 0 > port_lid: 0 > port_lmc: 0x00 > link_layer: Ethernet It looks like you have 1 HCA port as IB and the other at Ethernet. I'm wondering if OMPI is not taking the device transport into account and is *only* using the subnet ID to determine reachability (i.e., I'm wondering if we didn't anticipate multiple devices/ports with the same subnet ID but with different transports). I pointed this out to Mellanox yesterday; I think they're following up on it. In the meantime, a workaround might be to set a non-default subnet ID on your IB network. That should allow Open MPI to tell these networks apart without additional help. -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/