Errr...  That's not good.  :-(

Do you have a small example that you can share that duplicates the problem?



On Jun 6, 2008, at 1:51 AM, Matt Hughes wrote:

2008/6/4 Jeff Squyres <jsquy...@cisco.com>:
Would it be possible for you to try a trunk nightly tarball snapshot,
perchance?

I have attempted to use openmpi-1.3a1r18569.  After some pain getting
MPI_Comm_spawn to work (I will write about that in a separate
message), I was able to get my app started.  If segfaults in
btl_openib_handle_incoming() by dereferencing a null pointer:

#0  0x0000000000000000 in ?? ()
#1 0x0000002a98059777 in btl_openib_handle_incoming (openib_btl=0xb8b900, ep=0xbecb70, frag=0xc8da80, byte_len=24) at btl_openib_component.c: 2129 #2 0x0000002a9805b674 in handle_wc (hca=0xb80670, cq=0, wc=0x7fbfffdfd0)
  at btl_openib_component.c:2397
#3  0x0000002a9805bbef in poll_hca (hca=0xb80670, count=1)
  at btl_openib_component.c:2508
#4  0x0000002a9805c1ac in progress_one_hca (hca=0xb80670)
  at btl_openib_component.c:2616
#5  0x0000002a9805c24f in btl_openib_component_progress ()
  at btl_openib_component.c:2641
#6  0x0000002a97f42308 in mca_bml_r2_progress () at bml_r2.c:93
#7 0x0000002a95a44c2c in opal_progress () at runtime/ opal_progress.c:187 #8 0x0000002a97d1f10c in opal_condition_wait (c=0x2a958b8b40, m=0x2a958b8bc0)
  at ../../../../opal/threads/condition.h:100
#9  0x0000002a97d1ef88 in ompi_request_wait_completion (req=0xbdfc80)
  at ../../../../ompi/request/request.h:381
#10 0x0000002a97d1ee64 in mca_pml_ob1_recv (addr=0xc52d14, count=1,
  datatype=0x2a958abe60, src=1, tag=-19, comm=0xbe0cf0, status=0x0)
  at pml_ob1_irecv.c:104
#11 0x0000002a98c1b182 in ompi_coll_tuned_gather_intra_basic_linear (
  sbuf=0x7fbfffe984, scount=1, sdtype=0x2a958abe60, rbuf=0xc52d10,
rcount=1, rdtype=0x2a958abe60, root=0, comm=0xbe0cf0, module=0xda00e0)
  at coll_tuned_gather.c:408
#12 0x0000002a98c07fc1 in ompi_coll_tuned_gather_intra_dec_fixed (
  sbuf=0x7fbfffe984, scount=1, sdtype=0x2a958abe60, rbuf=0xc52d10,
rcount=1, rdtype=0x2a958abe60, root=0, comm=0xbe0cf0, module=0xda00e0)
  at coll_tuned_decision_fixed.c:723
#13 0x0000002a95715f0f in PMPI_Gather (sendbuf=0x7fbfffe984, sendcount=1,
  sendtype=0x2a958abe60, recvbuf=0xc52d10, recvcount=1,
  recvtype=0x2a958abe60, root=0, comm=0xbe0cf0) at pgather.c:141

This same build works fine with the TCP component and at least doesn't
crash with 1.2.6.  The only thing that may be unusual about my build
of openmpi 1.3 is that it is configured with --without-memory-manager
(it seems to cause crashes in another library I use).  I tried
rebuilding, omitting --without-memory-manager, but it failed in the
same way.

mch




On May 29, 2008, at 3:50 AM, Matt Hughes wrote:

I have a program which uses MPI::Comm::Spawn to start processes on
compute nodes (c0-0, c0-1, etc).  The communication between the
compute nodes consists of ISend and IRecv pairs, while communication
between the compute nodes consists of gather and bcast operations.
After executing ~80 successful loops (gather/bcast pairs), I get this
error message from the head node process during a gather call:

[0,1,0][btl_openib_component.c:1332:btl_openib_component_progress]
from headnode.local to: c0-0 error polling HP CQ with status WORK
REQUEST FLUSHED ERROR status number 5 for wr_id 18504944 opcode 1

The relevant environment variables:
OMPI_MCA_btl_openib_rd_num=128
OMPI_MCA_btl_openib_verbose=1
OMPI_MCA_btl_base_verbose=1
OMPI_MCA_btl_openib_rd_low=75
OMPI_MCA_btl_base_debug=1
OMPI_MCA_btl_openib_warn_no_hca_params_found=0
OMPI_MCA_btl_openib_warn_default_gid_prefix=0
OMPI_MCA_btl=self,openib

If rd_low and rd_num are left at their default values, the program
simply hangs in the gather call after about 20 iterations (a gather
and a bcast).

Can anyone shed any light on what this error message means or what
might be done about it?

Thanks,
mch
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users


--
Jeff Squyres
Cisco Systems

_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users


--
Jeff Squyres
Cisco Systems

Reply via email to