Do you have the same HCA adapter type on all of your machines ?
In the error log I see mlx4 error message , and mlx4 is connectX driver,
but ibv_devinfo show some older hca.
Pasha
Jeff Layton wrote:
Pasha,
Here you go... :) Thanks for looking at this.
Jeff
hca_id: mthca0
fw_ver: 4.8.200
node_guid: 0003:ba00:0100:38ac
sys_image_guid: 0003:ba00:0100:38af
vendor_id: 0x02c9
vendor_part_id: 25208
hw_ver: 0xA0
board_id: MT_00B0010001
phys_port_cnt: 2
max_mr_size: 0xffffffffffffffff
page_size_cap: 0xfffff000
max_qp: 64512
max_qp_wr: 65535
device_cap_flags: 0x00001c76
max_sge: 59
max_sge_rd: 0
max_cq: 65408
max_cqe: 131071
max_mr: 131056
max_pd: 32768
max_qp_rd_atom: 4
max_ee_rd_atom: 0
max_res_rd_atom: 258048
max_qp_init_rd_atom: 128
max_ee_init_rd_atom: 0
atomic_cap: ATOMIC_HCA (1)
max_ee: 0
max_rdd: 0
max_mw: 0
max_raw_ipv6_qp: 0
max_raw_ethy_qp: 0
max_mcast_grp: 8192
max_mcast_qp_attach: 56
max_total_mcast_qp_attach: 458752
max_ah: 0
max_fmr: 0
max_srq: 960
max_srq_wr: 65535
max_srq_sge: 31
max_pkeys: 64
local_ca_ack_delay: 15
port: 1
state: PORT_ACTIVE (4)
max_mtu: 2048 (4)
active_mtu: 2048 (4)
sm_lid: 41
port_lid: 41
port_lmc: 0x00
max_msg_sz: 0x80000000
port_cap_flags: 0x02510a6a
max_vl_num: 8 (4)
bad_pkey_cntr: 0x0
qkey_viol_cntr: 0x0
sm_sl: 0
pkey_tbl_len: 64
gid_tbl_len: 32
subnet_timeout: 18
init_type_reply: 0
active_width: 4X (2)
active_speed: 2.5 Gbps (1)
phys_state: LINK_UP (5)
GID[ 0]:
fe80:0000:0000:0000:0003:ba00:0100:38ad
port: 2
state: PORT_DOWN (1)
max_mtu: 2048 (4)
active_mtu: 512 (2)
sm_lid: 0
port_lid: 0
port_lmc: 0x00
max_msg_sz: 0x80000000
port_cap_flags: 0x02510a68
max_vl_num: 8 (4)
bad_pkey_cntr: 0x0
qkey_viol_cntr: 0x0
sm_sl: 0
pkey_tbl_len: 64
gid_tbl_len: 32
subnet_timeout: 0
init_type_reply: 0
active_width: 4X (2)
active_speed: 2.5 Gbps (1)
phys_state: POLLING (2)
GID[ 0]:
fe80:0000:0000:0000:0003:ba00:0100:38ae
Jeff,
Can you please provide more information about you HCA type
(ibv_devinfo -v).
Do you see this error immediate during startup, or you get it during
your run ?
Thanks,
Pasha
Jeff Layton wrote:
Evening everyone,
I'm running a CFD code on IB and I've encountered an error I'm not
sure about and I'm looking for some guidance on where to start
looking. Here's the error:
mlx4: local QP operation err (QPN 260092, WQE index 9a9e0000, vendor
syndrome 6f, opcode = 5e)
[0,1,6][btl_openib_component.c:1392:btl_openib_component_progress]
from compute-2-0.local to: compute-2-0.local erro
r polling HP CQ with status LOCAL QP OPERATION ERROR status number 2
for wr_id 37742320 opcode 0
mpirun noticed that job rank 0 with PID 21220 on node
compute-2-0.local exited on signal 15 (Terminated).
78 additional processes aborted (not shown)
This is openmpi-1.2.9rc2 (sorry - need to upgrade to 1.3.0). The
code works correctly for smaller cases, but when I run larger cases
I get this error.
I'm heading to bed but I'll check email tomorrow (so to sleep and
run but it's been a long day).
TIA!
Jeff
------------------------------------------------------------------------
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users