Dylan --

Sorry for the delay in replying.

On an offhand guess, does the problem go away if you run with:

  --mca mpi_leave_pinned 0

?


On Mar 20, 2012, at 3:35 PM, Dylan Nelson wrote:

> Hello,
> 
> I've been having trouble for awhile now running some OpenMPI+IB jobs on
> multiple tasks. The problems are all "hangs" and are not reproducible - the
> same execution started again will in general proceed just fine where
> previously it got stuck, but then later get stuck. These stuck processes are
> pegged at 100% CPU usage and remain there for days if not killed.
> 
> The same nature of problem exists in oMPI 1.2.5, 1.4.2, and 1.5.3 (for the
> code I am running). This is quite possible some problem in the
> configuration/cluster, I am not claiming that it is a bug in oMPI, but was
> just hopeful that someone might have a guess as to what is going on.
> 
> In ancient 1.2.5 the problem manifests as (I attach gdb to the stalled
> process on one of the child nodes):
> 
> --------------------------------------------------------------------
> 
> (gdb) bt
> #0  0x00002b8135b3f699 in ibv_cmd_create_qp () from
> /usr/lib64/libmlx4-rdmav2.so
> #1  0x00002b8135b3faa6 in ibv_cmd_create_qp () from
> /usr/lib64/libmlx4-rdmav2.so
> #2  0x00002b813407bff1 in btl_openib_component_progress ()
>   from /n/sw/openmpi-1.2.5-gcc-4.1.2/lib/openmpi/mca_btl_openib.so
> #3  0x00002b8133e6f04a in mca_bml_r2_progress () from
> /n/sw/openmpi-1.2.5-gcc-4.1.2/lib/openmpi/mca_bml_r2.so
> #4  0x00002b812f52c9ba in opal_progress () from
> /n/sw/openmpi-1.2.5-gcc-4.1.2/lib64/libopen-pal.so.0
> #5  0x00002b812f067b05 in ompi_request_wait_all () from
> /n/sw/openmpi-1.2.5-gcc-4.1.2/lib64/libmpi.so.0
> #6  0x0000000000000000 in ?? ()
> (gdb) next
> Single stepping until exit from function ibv_cmd_create_qp, which has no
> line number information.
> 0x00002b8135b3f358 in pthread_spin_unlock@plt () from
> /usr/lib64/libmlx4-rdmav2.so
> (gdb) next
> Single stepping until exit from function pthread_spin_unlock@plt, which has
> no line number information.
> 0x00000038c860b760 in pthread_spin_unlock () from /lib64/libpthread.so.0
> (gdb) next
> Single stepping until exit from function pthread_spin_unlock, which has no
> line number information.
> 0x00002b8135b3fc21 in ibv_cmd_create_qp () from /usr/lib64/libmlx4-rdmav2.so
> (gdb) next
> Single stepping until exit from function ibv_cmd_create_qp, which has no
> line number information.
> 0x00002b813407bff1 in btl_openib_component_progress ()
>   from /n/sw/openmpi-1.2.5-gcc-4.1.2/lib/openmpi/mca_btl_openib.so
> (gdb) next
> Single stepping until exit from function btl_openib_component_progress,
> which has no line number information.
> 0x00002b8133e6f04a in mca_bml_r2_progress () from
> /n/sw/openmpi-1.2.5-gcc-4.1.2/lib/openmpi/mca_bml_r2.so
> (gdb) next
> Single stepping until exit from function mca_bml_r2_progress, which has no
> line number information.
> 0x00002b812f52c9ba in opal_progress () from
> /n/sw/openmpi-1.2.5-gcc-4.1.2/lib64/libopen-pal.so.0
> (gdb) next
> Single stepping until exit from function opal_progress, which has no line
> number information.
> 0x00002b812f067b05 in ompi_request_wait_all () from
> /n/sw/openmpi-1.2.5-gcc-4.1.2/lib64/libmpi.so.0
> (gdb) next
> Single stepping until exit from function ompi_request_wait_all, which has no
> line number information.
> 
> ---hang--- (infinite loop?)
> 
> On a different task:
> 
> 0x00002ba2383b4982 in opal_progress () from
> /n/sw/openmpi-1.2.5-gcc-4.1.2/lib64/libopen-pal.so.0
> (gdb) bt
> #0  0x00002ba2383b4982 in opal_progress () from
> /n/sw/openmpi-1.2.5-gcc-4.1.2/lib64/libopen-pal.so.0
> #1  0x00002ba237eefb05 in ompi_request_wait_all () from
> /n/sw/openmpi-1.2.5-gcc-4.1.2/lib64/libmpi.so.0
> #2  0x0000000000000000 in ?? ()
> (gdb) next
> Single stepping until exit from function opal_progress, which has no line
> number information.
> 0x00002ba237eefb05 in ompi_request_wait_all () from
> /n/sw/openmpi-1.2.5-gcc-4.1.2/lib64/libmpi.so.0
> (gdb) next
> Single stepping until exit from function ompi_request_wait_all, which has no
> line number information.
> 
> ---hang---
> 
> --------------------------------------------------------------------
> 
> On 1.5.3 a similar "hang" problem happens but the backtrace goes back to the
> original code call which is a MPI_Sendrecv():
> 
> --------------------------------------------------------------------
> 
> 3510                OPAL_THREAD_UNLOCK(&endpoint->eager_rdma_local.lock);
> (gdb) bt
> #0  progress_one_device () at btl_openib_component.c:3510
> #1  btl_openib_component_progress () at btl_openib_component.c:3541
> #2  0x00002b722f348b35 in opal_progress () at runtime/opal_progress.c:207
> #3  0x00002b722f287025 in opal_condition_wait (buf=0x2aaaab636298,
> count=251328, datatype=0x6ef240, dst=12, tag=35,
>    sendmode=MCA_PML_BASE_SEND_STANDARD, comm=0x6ee430) at
> ../../../../opal/threads/condition.h:99
> #4  ompi_request_wait_completion (buf=0x2aaaab636298, count=251328,
> datatype=0x6ef240, dst=12, tag=35,
>    sendmode=MCA_PML_BASE_SEND_STANDARD, comm=0x6ee430) at
> ../../../../ompi/request/request.h:377
> #5  mca_pml_ob1_send (buf=0x2aaaab636298, count=251328, datatype=0x6ef240,
> dst=12, tag=35,
>    sendmode=MCA_PML_BASE_SEND_STANDARD, comm=0x6ee430) at
> pml_ob1_isend.c:125
> #6  0x00002b722f1cb568 in PMPI_Sendrecv (sendbuf=0x2aaba9587398,
> sendcount=251328, sendtype=0x6ef240, dest=12,
>    sendtag=35, recvbuf=0x2aaba7a555f8, recvcount=259008, recvtype=0x6ef240,
> source=12, recvtag=35, comm=0x6ee430,
>    status=0x6f2160) at psendrecv.c:84
> #7  0x0000000000472fd5 in voronoi_ghost_search (T=0xf70b40) at
> voronoi_ghost_search.c:190
> #8  0x00000000004485c6 in create_mesh () at voronoi.c:107
> #9  0x0000000000411b1c in run () at run.c:215 #10 0x0000000000410d8a in main
> (argc=3, argv=0x7fff3fc25948) at main.c:190
> (gdb) next
> 3466        for(i = 0; i < c; i++) {
> (gdb) next
> 3467            endpoint = device->eager_rdma_buffers[i];
> (gdb) next
> 3469            if(!endpoint)
> (gdb) next
> 3472            OPAL_THREAD_LOCK(&endpoint->eager_rdma_local.lock);
> (gdb) next
> 3473            frag = MCA_BTL_OPENIB_GET_LOCAL_RDMA_FRAG(endpoint,
> (gdb) next
> 3476            if(MCA_BTL_OPENIB_RDMA_FRAG_LOCAL(frag)) {
> (gdb) next
> 3510                OPAL_THREAD_UNLOCK(&endpoint->eager_rdma_local.lock);
> 
> --------------------------------------------------------------------
> 
> The OS is: Linux version 2.6.18-194.32.1.el5
> (mockbu...@builder10.centos.org) (gcc version 4.1.2 20080704 (Red Hat
> 4.1.2-48))
> 
> The output from ibv_devinfo:
> 
> --------------------------------------------------------------------
> 
> hca_id: mlx4_0
>        transport:                      InfiniBand (0)
>        fw_ver:                         2.5.000
>        node_guid:                      0018:8b90:97fe:2149
>        sys_image_guid:                 0018:8b90:97fe:214c
>        vendor_id:                      0x02c9
>        vendor_part_id:                 25418
>        hw_ver:                         0xA0
>        board_id:                       DEL08C0000001
>        phys_port_cnt:                  2
>                port:   1
>                        state:                  PORT_ACTIVE (4)
>                        max_mtu:                2048 (4)
>                        active_mtu:             2048 (4)
>                        sm_lid:                 2
>                        port_lid:               166
>                        port_lmc:               0x00
> 
>                port:   2
>                        state:                  PORT_DOWN (1)
>                        max_mtu:                2048 (4)
>                        active_mtu:             2048 (4)
>                        sm_lid:                 0
>                        port_lid:               0
>                        port_lmc:               0x00
> 
> --------------------------------------------------------------------
> 
> I am no MPI expert but just hopeful of any suggestions. Thanks!
> 
> Dylan Nelson
> 
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/


Reply via email to