On Feb 28, 2011, at 12:49 PM, Jagga Soorma wrote:

> -bash-3.2$ mpiexec --mca btl openib,self -mca btl_openib_warn_default_gid_
> prefix 0 -np 2 --hostfile mpihosts 
> /home/jagga/osu-micro-benchmarks-3.3/openmpi/ofed-1.5.2/bin/osu_latency

Your use of btl_openib_warn_default_gid_prefix may have brought up a subtle 
issue in Open MPI's verbs support.  More below.

> # OSU MPI Latency Test v3.3
> # Size            Latency (us)
> [amber04][[10252,1],1][connect/btl_openib_connect_oob.c:325:qp_connect_all] 
> error modifing QP to RTR errno says Invalid argument
> [amber04][[10252,1],1][connect/btl_openib_connect_oob.c:815:rml_recv_cb] 
> error in endpoint reply start connect

Looking at this error message and your ibv_devinfo output:

> [root@amber03 ~]# ibv_devinfo 
> hca_id:    mlx4_0
>     transport:            InfiniBand (0)
>     fw_ver:                2.7.9294
>     node_guid:            78e7:d103:0021:8884
>     sys_image_guid:            78e7:d103:0021:8887
>     vendor_id:            0x02c9
>     vendor_part_id:            26438
>     hw_ver:                0xB0
>     board_id:            HP_0200000003
>     phys_port_cnt:            2
>         port:    1
>             state:            PORT_ACTIVE (4)
>             max_mtu:        2048 (4)
>             active_mtu:        2048 (4)
>             sm_lid:            1
>             port_lid:        20
>             port_lmc:        0x00
>             link_layer:        IB
> 
>         port:    2
>             state:            PORT_ACTIVE (4)
>             max_mtu:        2048 (4)
>             active_mtu:        1024 (3)
>             sm_lid:            0
>             port_lid:        0
>             port_lmc:        0x00
>             link_layer:        Ethernet

It looks like you have 1 HCA port as IB and the other at Ethernet.

I'm wondering if OMPI is not taking the device transport into account and is 
*only* using the subnet ID to determine reachability (i.e., I'm wondering if we 
didn't anticipate multiple devices/ports with the same subnet ID but with 
different transports).  I pointed this out to Mellanox yesterday; I think 
they're following up on it.

In the meantime, a workaround might be to set a non-default subnet ID on your 
IB network.  That should allow Open MPI to tell these networks apart without 
additional help.

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/


Reply via email to