Dylan -- Sorry for the delay in replying.
On an offhand guess, does the problem go away if you run with: --mca mpi_leave_pinned 0 ? On Mar 20, 2012, at 3:35 PM, Dylan Nelson wrote: > Hello, > > I've been having trouble for awhile now running some OpenMPI+IB jobs on > multiple tasks. The problems are all "hangs" and are not reproducible - the > same execution started again will in general proceed just fine where > previously it got stuck, but then later get stuck. These stuck processes are > pegged at 100% CPU usage and remain there for days if not killed. > > The same nature of problem exists in oMPI 1.2.5, 1.4.2, and 1.5.3 (for the > code I am running). This is quite possible some problem in the > configuration/cluster, I am not claiming that it is a bug in oMPI, but was > just hopeful that someone might have a guess as to what is going on. > > In ancient 1.2.5 the problem manifests as (I attach gdb to the stalled > process on one of the child nodes): > > -------------------------------------------------------------------- > > (gdb) bt > #0 0x00002b8135b3f699 in ibv_cmd_create_qp () from > /usr/lib64/libmlx4-rdmav2.so > #1 0x00002b8135b3faa6 in ibv_cmd_create_qp () from > /usr/lib64/libmlx4-rdmav2.so > #2 0x00002b813407bff1 in btl_openib_component_progress () > from /n/sw/openmpi-1.2.5-gcc-4.1.2/lib/openmpi/mca_btl_openib.so > #3 0x00002b8133e6f04a in mca_bml_r2_progress () from > /n/sw/openmpi-1.2.5-gcc-4.1.2/lib/openmpi/mca_bml_r2.so > #4 0x00002b812f52c9ba in opal_progress () from > /n/sw/openmpi-1.2.5-gcc-4.1.2/lib64/libopen-pal.so.0 > #5 0x00002b812f067b05 in ompi_request_wait_all () from > /n/sw/openmpi-1.2.5-gcc-4.1.2/lib64/libmpi.so.0 > #6 0x0000000000000000 in ?? () > (gdb) next > Single stepping until exit from function ibv_cmd_create_qp, which has no > line number information. > 0x00002b8135b3f358 in pthread_spin_unlock@plt () from > /usr/lib64/libmlx4-rdmav2.so > (gdb) next > Single stepping until exit from function pthread_spin_unlock@plt, which has > no line number information. > 0x00000038c860b760 in pthread_spin_unlock () from /lib64/libpthread.so.0 > (gdb) next > Single stepping until exit from function pthread_spin_unlock, which has no > line number information. > 0x00002b8135b3fc21 in ibv_cmd_create_qp () from /usr/lib64/libmlx4-rdmav2.so > (gdb) next > Single stepping until exit from function ibv_cmd_create_qp, which has no > line number information. > 0x00002b813407bff1 in btl_openib_component_progress () > from /n/sw/openmpi-1.2.5-gcc-4.1.2/lib/openmpi/mca_btl_openib.so > (gdb) next > Single stepping until exit from function btl_openib_component_progress, > which has no line number information. > 0x00002b8133e6f04a in mca_bml_r2_progress () from > /n/sw/openmpi-1.2.5-gcc-4.1.2/lib/openmpi/mca_bml_r2.so > (gdb) next > Single stepping until exit from function mca_bml_r2_progress, which has no > line number information. > 0x00002b812f52c9ba in opal_progress () from > /n/sw/openmpi-1.2.5-gcc-4.1.2/lib64/libopen-pal.so.0 > (gdb) next > Single stepping until exit from function opal_progress, which has no line > number information. > 0x00002b812f067b05 in ompi_request_wait_all () from > /n/sw/openmpi-1.2.5-gcc-4.1.2/lib64/libmpi.so.0 > (gdb) next > Single stepping until exit from function ompi_request_wait_all, which has no > line number information. > > ---hang--- (infinite loop?) > > On a different task: > > 0x00002ba2383b4982 in opal_progress () from > /n/sw/openmpi-1.2.5-gcc-4.1.2/lib64/libopen-pal.so.0 > (gdb) bt > #0 0x00002ba2383b4982 in opal_progress () from > /n/sw/openmpi-1.2.5-gcc-4.1.2/lib64/libopen-pal.so.0 > #1 0x00002ba237eefb05 in ompi_request_wait_all () from > /n/sw/openmpi-1.2.5-gcc-4.1.2/lib64/libmpi.so.0 > #2 0x0000000000000000 in ?? () > (gdb) next > Single stepping until exit from function opal_progress, which has no line > number information. > 0x00002ba237eefb05 in ompi_request_wait_all () from > /n/sw/openmpi-1.2.5-gcc-4.1.2/lib64/libmpi.so.0 > (gdb) next > Single stepping until exit from function ompi_request_wait_all, which has no > line number information. > > ---hang--- > > -------------------------------------------------------------------- > > On 1.5.3 a similar "hang" problem happens but the backtrace goes back to the > original code call which is a MPI_Sendrecv(): > > -------------------------------------------------------------------- > > 3510 OPAL_THREAD_UNLOCK(&endpoint->eager_rdma_local.lock); > (gdb) bt > #0 progress_one_device () at btl_openib_component.c:3510 > #1 btl_openib_component_progress () at btl_openib_component.c:3541 > #2 0x00002b722f348b35 in opal_progress () at runtime/opal_progress.c:207 > #3 0x00002b722f287025 in opal_condition_wait (buf=0x2aaaab636298, > count=251328, datatype=0x6ef240, dst=12, tag=35, > sendmode=MCA_PML_BASE_SEND_STANDARD, comm=0x6ee430) at > ../../../../opal/threads/condition.h:99 > #4 ompi_request_wait_completion (buf=0x2aaaab636298, count=251328, > datatype=0x6ef240, dst=12, tag=35, > sendmode=MCA_PML_BASE_SEND_STANDARD, comm=0x6ee430) at > ../../../../ompi/request/request.h:377 > #5 mca_pml_ob1_send (buf=0x2aaaab636298, count=251328, datatype=0x6ef240, > dst=12, tag=35, > sendmode=MCA_PML_BASE_SEND_STANDARD, comm=0x6ee430) at > pml_ob1_isend.c:125 > #6 0x00002b722f1cb568 in PMPI_Sendrecv (sendbuf=0x2aaba9587398, > sendcount=251328, sendtype=0x6ef240, dest=12, > sendtag=35, recvbuf=0x2aaba7a555f8, recvcount=259008, recvtype=0x6ef240, > source=12, recvtag=35, comm=0x6ee430, > status=0x6f2160) at psendrecv.c:84 > #7 0x0000000000472fd5 in voronoi_ghost_search (T=0xf70b40) at > voronoi_ghost_search.c:190 > #8 0x00000000004485c6 in create_mesh () at voronoi.c:107 > #9 0x0000000000411b1c in run () at run.c:215 #10 0x0000000000410d8a in main > (argc=3, argv=0x7fff3fc25948) at main.c:190 > (gdb) next > 3466 for(i = 0; i < c; i++) { > (gdb) next > 3467 endpoint = device->eager_rdma_buffers[i]; > (gdb) next > 3469 if(!endpoint) > (gdb) next > 3472 OPAL_THREAD_LOCK(&endpoint->eager_rdma_local.lock); > (gdb) next > 3473 frag = MCA_BTL_OPENIB_GET_LOCAL_RDMA_FRAG(endpoint, > (gdb) next > 3476 if(MCA_BTL_OPENIB_RDMA_FRAG_LOCAL(frag)) { > (gdb) next > 3510 OPAL_THREAD_UNLOCK(&endpoint->eager_rdma_local.lock); > > -------------------------------------------------------------------- > > The OS is: Linux version 2.6.18-194.32.1.el5 > (mockbu...@builder10.centos.org) (gcc version 4.1.2 20080704 (Red Hat > 4.1.2-48)) > > The output from ibv_devinfo: > > -------------------------------------------------------------------- > > hca_id: mlx4_0 > transport: InfiniBand (0) > fw_ver: 2.5.000 > node_guid: 0018:8b90:97fe:2149 > sys_image_guid: 0018:8b90:97fe:214c > vendor_id: 0x02c9 > vendor_part_id: 25418 > hw_ver: 0xA0 > board_id: DEL08C0000001 > phys_port_cnt: 2 > port: 1 > state: PORT_ACTIVE (4) > max_mtu: 2048 (4) > active_mtu: 2048 (4) > sm_lid: 2 > port_lid: 166 > port_lmc: 0x00 > > port: 2 > state: PORT_DOWN (1) > max_mtu: 2048 (4) > active_mtu: 2048 (4) > sm_lid: 0 > port_lid: 0 > port_lmc: 0x00 > > -------------------------------------------------------------------- > > I am no MPI expert but just hopeful of any suggestions. Thanks! > > Dylan Nelson > > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/