Both 1.4.1 and 1.4.2 exhibit the same behaviors w/ OFED 1.5. It wasn't OFED 1.4 after all (after some more digging around through our update logs).
All of the ibv_*_pingpong tests appear to work correctly. I'll try running a few more tests (np=2 over two nodes, some of the OSU benchmarks, etc.) ibv_devinfo shows the following on all of my systems (obviously, the guid values change). Also, its normal for port 2 to be down as its not connected yet: # ibv_devinfo -v hca_id: mlx4_0 transport: InfiniBand (0) fw_ver: 2.6.000 node_guid: 0023:7dff:ff94:9078 sys_image_guid: 0023:7dff:ff94:907b vendor_id: 0x02c9 vendor_part_id: 25418 hw_ver: 0xA0 board_id: HP_09D0000001 phys_port_cnt: 2 max_mr_size: 0xffffffffffffffff page_size_cap: 0xfffffe00 max_qp: 261568 max_qp_wr: 16351 device_cap_flags: 0x006c9c76 max_sge: 32 max_sge_rd: 0 max_cq: 65408 max_cqe: 4194303 max_mr: 524272 max_pd: 32764 max_qp_rd_atom: 16 max_ee_rd_atom: 0 max_res_rd_atom: 4185088 max_qp_init_rd_atom: 128 max_ee_init_rd_atom: 0 atomic_cap: ATOMIC_HCA (1) max_ee: 0 max_rdd: 0 max_mw: 0 max_raw_ipv6_qp: 0 max_raw_ethy_qp: 2 max_mcast_grp: 8192 max_mcast_qp_attach: 56 max_total_mcast_qp_attach: 458752 max_ah: 0 max_fmr: 0 max_srq: 65472 max_srq_wr: 16383 max_srq_sge: 31 max_pkeys: 128 local_ca_ack_delay: 15 port: 1 state: PORT_ACTIVE (4) max_mtu: 2048 (4) active_mtu: 2048 (4) sm_lid: 1 port_lid: 192 port_lmc: 0x00 max_msg_sz: 0x40000000 port_cap_flags: 0x02510868 max_vl_num: 8 (4) bad_pkey_cntr: 0x0 qkey_viol_cntr: 0x0 sm_sl: 0 pkey_tbl_len: 128 gid_tbl_len: 128 subnet_timeout: 17 init_type_reply: 0 active_width: 4X (2) active_speed: 5.0 Gbps (2) phys_state: LINK_UP (5) GID[ 0]: fe80:0000:0000:0000:0023:7dff:ff94:9079 port: 2 state: PORT_DOWN (1) max_mtu: 2048 (4) active_mtu: 2048 (4) sm_lid: 0 port_lid: 0 port_lmc: 0x00 max_msg_sz: 0x40000000 port_cap_flags: 0x02510868 max_vl_num: 8 (4) bad_pkey_cntr: 0x0 qkey_viol_cntr: 0x0 sm_sl: 0 pkey_tbl_len: 128 gid_tbl_len: 128 subnet_timeout: 0 init_type_reply: 0 active_width: 4X (2) active_speed: 2.5 Gbps (1) phys_state: POLLING (2) GID[ 0]: fe80:0000:0000:0000:0023:7dff:ff94:907a On Tue, 2010-07-27 at 06:14 -0400, Terry Dontje wrote: > A clarification from your previous email, you had your code working > with OMPI 1.4.1 but an older version of OFED? Then you upgraded to > OFED 1.4 and things stopped working? Sounds like your current system > is set up with OMPI 1.4.2 and OFED 1.5. Anyways, I am a little > confused as to when things might have actually broke. > > My first guess would be something may be wrong with the OFED setup. > Have checked the status of your ib devices with ibv_devinfo? Have you > ran any of the OFED rc tests like ibv_rc_pingpong? > > If the above seems ok have you tried to run a simpler OMPI test like > connectivity? I would see if a simple np=2 run spanning across two > nodes fails? > > What OS distribution and version you are running on? > > --td > Brian Smith wrote: > > In case my previous e-mail is too vague for anyone to address, here's a > > backtrace from my application. This version, compiled with Intel > > 11.1.064 (OpenMPI 1.4.2 w/ gcc 4.4.2), hangs during MPI_Alltoall > > instead. Running on 16 CPUs, Opteron 2427, Mellanox Technologies > > MT25418 w/ OFED 1.5 > > > > strace on all ranks repeatedly shows: > > poll([{fd=4, events=POLLIN}, {fd=5, events=POLLIN}, {fd=6, > > events=POLLIN}, {fd=7, events=POLLIN}, {fd=10, events=POLLIN}, {fd=22, > > events=POLLIN}, {fd=23, events=POLLIN}], 7, 0) = 0 (Timeout) > > ... > > > > gdb --pid=<pid> > > (gdb) bt > > #0 sm_fifo_read () at btl_sm.h:267 > > #1 mca_btl_sm_component_progress () at btl_sm_component.c:391 > > #2 0x00002b00085116ea in opal_progress () at > > runtime/opal_progress.c:207 > > #3 0x00002b0007def215 in opal_condition_wait (count=2, > > requests=0x7fffd27802a0, statuses=0x7fffd2780270) > > at ../opal/threads/condition.h:99 > > #4 ompi_request_default_wait_all (count=2, requests=0x7fffd27802a0, > > statuses=0x7fffd2780270) at request/req_wait.c:262 > > #5 0x00002b0007e805b7 in ompi_coll_tuned_sendrecv_actual > > (sendbuf=0x2aaac2c4c210, scount=28000, > > sdatatype=0x2b0008198ea0, dest=6, stag=-13, recvbuf=<value optimized > > out>, rcount=28000, rdatatype=0x2b0008198ea0, > > source=10, rtag=-13, comm=0x16ad7420, status=0x0) at > > coll_tuned_util.c:55 > > #6 0x00002b0007e8705f in ompi_coll_tuned_sendrecv (sbuf=0x2aaac2b04010, > > scount=28000, sdtype=0x2b0008198ea0, > > rbuf=0x2aaac99a2010, rcount=28000, rdtype=0x2b0008198ea0, > > comm=0x16ad7420, module=0x16ad8450) > > at coll_tuned_util.h:60 > > #7 ompi_coll_tuned_alltoall_intra_pairwise (sbuf=0x2aaac2b04010, > > scount=28000, sdtype=0x2b0008198ea0, > > rbuf=0x2aaac99a2010, rcount=28000, rdtype=0x2b0008198ea0, > > comm=0x16ad7420, module=0x16ad8450) > > at coll_tuned_alltoall.c:70 > > #8 0x00002b0007e0a71f in PMPI_Alltoall (sendbuf=0x2aaac2b04010, > > sendcount=28000, sendtype=0x2b0008198ea0, > > recvbuf=0x2aaac99a2010, recvcount=28000, recvtype=0x2b0008198ea0, > > comm=0x16ad7420) at palltoall.c:84 > > #9 0x00002b0007b8bc86 in mpi_alltoall_f (sendbuf=0x2aaac2b04010 "", > > sendcount=0x7fffd27806a0, > > sendtype=<value optimized out>, > > recvbuf=0x2aaac99a2010 "6%\177e\373\354\306>\346\226z\262\347\350 > > \260>\032ya(\303\003\272\276\231\343\322\363zjþ\230\247i\232\307PԾ(\304 > > \373\321D\261ľ\204֜Εh־H\266H\342l2\245\276\231C7]\003\250Ǿ`\277\231\272 > > \265E\261>j\213ѓ\370\002\263>НØx.\254>}\332-\313\371\326\320>\346\245f > > \304\f\214\262\276\070\222zf#'\321>\024\066̆\026\227ɾ.T\277\266}\366 > > \270>h|\323L\330\fƾ^z\214!q*\277\276pQ?O\346\067\270>~\006\300", > > recvcount=0x7fffd27806a4, recvtype=0xb67490, > > comm=0x12d9ba0, ierr=0x7fffd27806a8) at palltoall_f.c:76 > > #10 0x00000000004634cc in m_sumf_d_ () > > #11 0x0000000000463072 in m_sum_z_ () > > #12 0x00000000004c8a8b in mrg_grid_rc_ () > > #13 0x00000000004ffc5e in rhosym_ () > > #14 0x0000000000610dc6 in us_mp_set_charge_ () > > #15 0x0000000000771c43 in elmin_ () > > #16 0x0000000000453853 in MAIN__ () > > #17 0x000000000042f15c in main () > > > > On other processes: > > > > (gdb) bt > > #0 0x0000003692a0b725 in pthread_spin_lock () > > from /lib64/libpthread.so.0 > > #1 0x00002aaaaacdfa7b in ibv_cmd_create_qp () > > from /usr/lib64/libmlx4-rdmav2.so > > #2 0x00002b9dc1db3ff8 in progress_one_device () > > at /usr/include/infiniband/verbs.h:884 > > #3 btl_openib_component_progress () at btl_openib_component.c:3451 > > #4 0x00002b9dc24736ea in opal_progress () at > > runtime/opal_progress.c:207 > > #5 0x00002b9dc1d51215 in opal_condition_wait (count=2, > > requests=0x7fffece3cc20, statuses=0x7fffece3cbf0) > > at ../opal/threads/condition.h:99 > > #6 ompi_request_default_wait_all (count=2, requests=0x7fffece3cc20, > > statuses=0x7fffece3cbf0) at request/req_wait.c:262 > > #7 0x00002b9dc1de25b7 in ompi_coll_tuned_sendrecv_actual > > (sendbuf=0x2aaac2c4c210, scount=28000, > > sdatatype=0x2b9dc20faea0, dest=6, stag=-13, recvbuf=<value optimized > > out>, rcount=28000, rdatatype=0x2b9dc20faea0, > > source=10, rtag=-13, comm=0x1745b420, status=0x0) at > > coll_tuned_util.c:55 > > #8 0x00002b9dc1de905f in ompi_coll_tuned_sendrecv (sbuf=0x2aaac2b04010, > > scount=28000, sdtype=0x2b9dc20faea0, > > rbuf=0x2aaac99a2010, rcount=28000, rdtype=0x2b9dc20faea0, > > comm=0x1745b420, module=0x1745c450) > > at coll_tuned_util.h:60 > > #9 ompi_coll_tuned_alltoall_intra_pairwise (sbuf=0x2aaac2b04010, > > scount=28000, sdtype=0x2b9dc20faea0, > > rbuf=0x2aaac99a2010, rcount=28000, rdtype=0x2b9dc20faea0, > > comm=0x1745b420, module=0x1745c450) > > at coll_tuned_alltoall.c:70 > > #10 0x00002b9dc1d6c71f in PMPI_Alltoall (sendbuf=0x2aaac2b04010, > > sendcount=28000, sendtype=0x2b9dc20faea0, > > recvbuf=0x2aaac99a2010, recvcount=28000, recvtype=0x2b9dc20faea0, > > comm=0x1745b420) at palltoall.c:84 > > #11 0x00002b9dc1aedc86 in mpi_alltoall_f (sendbuf=0x2aaac2b04010 "", > > sendcount=0x7fffece3d020, > > sendtype=<value optimized out>, > > recvbuf=0x2aaac99a2010 "6%\177e\373\354\306>\346\226z\262\347\350 > > \260>\032ya(\303\003\272\276\231\343\322\363zjþ\230\247i\232\307PԾ(\304 > > \373\321D\261ľ\204֜Εh־H\266H\342l2\245\276\231C7]\003\250Ǿ`\277\231\272 > > \265E\261>j\213ѓ\370\002\263>НØx.\254>}\332-\313\371\326\320>\346\245f > > \304\f\214\262\276\070\222zf#'\321>\024\066̆\026\227ɾ.T\277\266}\366 > > \270>h|\323L\330\fƾ^z\214!q*\277\276pQ?O\346\067\270>~\006\300", > > recvcount=0x7fffece3d024, recvtype=0xb67490, > > comm=0x12d9ba0, ierr=0x7fffece3d028) at palltoall_f.c:76 > > #12 0x00000000004634cc in m_sumf_d_ () > > #13 0x0000000000463072 in m_sum_z_ () > > #14 0x00000000004c8a8b in mrg_grid_rc_ () > > #15 0x00000000004ffc5e in rhosym_ () > > #16 0x0000000000610dc6 in us_mp_set_charge_ () > > #17 0x0000000000771c43 in elmin_ () > > #18 0x0000000000453853 in MAIN__ () > > #19 0x000000000042f15c in main () > > > > > > I set up padb to collect a full report on the process and I've attached > > it to this message. Let me know if I can provide anything further. > > > > Thanks, > > -Brian > > > > > > > > On Wed, 2010-07-21 at 10:07 -0400, Brian Smith wrote: > > > > > Hi, All, > > > > > > A couple of applications that I'm using -- VASP and Charmm -- end up > > > "stuck" (for lack of a better word) during a waitall call after some > > > non-blocking send/recv action. This only happens when utilizing the > > > openib btl. I've followed a couple of bugs where this seemed to happen > > > in some previous revisions and tried the work-arounds provided, but to > > > no avail. I'm going to try running against a previous version to see if > > > its a regression of some sort, but this problem didn't seem to exist in > > > 1.4.1 until our systems were updated to OFED >= 1.4. Any suggestions > > > besides the obvious, "well, down-grade from >= 1.4"? What additional > > > info can I provide to help? > > > > > > Thanks, > > > -Brian > > > > > > > > > > > > > > > > ____________________________________________________________________ > > > > _______________________________________________ > > users mailing list > > us...@open-mpi.org > > http://www.open-mpi.org/mailman/listinfo.cgi/users > > > -- > Oracle > Terry D. Dontje | Principal Software Engineer > Developer Tools Engineering | +1.650.633.7054 > Oracle - Performance Technologies > 95 Network Drive, Burlington, MA 01803 > Email terry.don...@oracle.com > >