Dear Open MPI users list, From time to time, I experience a mutex deadlock in Open-MPI 1.1.2. The stack trace is available at the end of the mail. The deadlock seems to be caused by lines 118 & 119 of the ompi/mca/btl/tcp/btl_tcp.c file, in function mca_btl_tcp_add_procs: OBJ_RELEASE(tcp_endpoint); OPAL_THREAD_UNLOCK(&tcp_proc->proc_lock); (of course, I did not check whether line numbers have changed since 1.1.2.) Indeed, releasing tcp_endpoint causes a call to mca_btl_tcp_proc_remove that attempts to acquire the mutex tcp_proc->proc_lock, which is already held by the thread (OBJ_THREAD_LOCK(&tcp_proc->proc_lock) at line 103 of the ompi/mca/btl/tcp/btl_tcp.c file). Switching the two lines above (ie releasing the mutex before destructing tcp_endpoint) seems to be sufficient to fix the deadlock. Maybe should the changes done in the mca_btl_tcp_proc_insert function be reverted rather than releasing the mutex before tcp_endpoint? As far as I looked, the problem seems to still appear in the trunk revision 13359.
Second point. Is there any reason why MPI_Comm_spawn is restricted to execute the new process(es) only on hosts listed in either the --host option or in the hostfile? Or did I miss something? Best regards, Jeremy ------------------------------------------------------------------------------ stack trace as dumped by open-mpi (gdb version follows): opal_mutex_lock(): Resource deadlock avoided Signal:6 info.si_errno:0(Success) si_code:-6() [0] func:/home1/jbuisson/soft/openmpi-1.1.2/lib/libopal.so.0 [0x8addeb] [1] func:/lib/tls/libpthread.so.0 [0x176e40] [2] func:/lib/tls/libc.so.6(abort+0x1d5) [0xa294e5] [3] func:/home1/jbuisson/soft/openmpi-1.1.2/lib/openmpi/mca_btl_tcp.so [0x65f8a3] [4] func:/home1/jbuisson/soft/openmpi-1.1.2/lib/openmpi/mca_btl_tcp.so(mca_btl_tcp_proc_remove+0x2a) [0x65fff0] [5] func:/home1/jbuisson/soft/openmpi-1.1.2/lib/openmpi/mca_btl_tcp.so [0x65cb24] [6] func:/home1/jbuisson/soft/openmpi-1.1.2/lib/openmpi/mca_btl_tcp.so [0x659465] [7] func:/home1/jbuisson/soft/openmpi-1.1.2/lib/openmpi/mca_btl_tcp.so(mca_btl_tcp_add_procs+0x10f) [0x65927b] [8] func:/home1/jbuisson/soft/openmpi-1.1.2/lib/openmpi/mca_bml_r2.so(mca_bml_r2_add_procs+0x1bb) [0x628023] [9] func:/home1/jbuisson/soft/openmpi-1.1.2/lib/openmpi/mca_pml_ob1.so(mca_pml_ob1_add_procs+0xd6) [0x61650b] [10] func:/home1/jbuisson/soft/openmpi-1.1.2/lib/libmpi.so.0(ompi_comm_get_rport+0x1f8) [0xb82303] [11] func:/home1/jbuisson/soft/openmpi-1.1.2/lib/libmpi.so.0(ompi_comm_connect_accept+0xbb) [0xb81b43] [12] func:/home1/jbuisson/soft/openmpi-1.1.2/lib/libmpi.so.0(PMPI_Comm_spawn+0x3de) [0xbb671a] [13] func:/home1/jbuisson/target/bin/mpi-spawner(__gxx_personality_v0+0x3d2) [0x804bb8e] [14] func:/home1/jbuisson/target/bin/mpi-spawner [0x804bdff] [15] func:/home1/jbuisson/target/bin/mpi-spawner [0x804bfd4] [16] func:/lib/tls/libc.so.6(__libc_start_main+0xda) [0xa1578a] [17] func:/home1/jbuisson/target/bin/mpi-spawner(__gxx_personality_v0+0x75) [0x804b831] *** End of error message *** Same stack, dumped by gdb: #0 0x00176357 in __pause_nocancel () from /lib/tls/libpthread.so.0 #1 0x008ade9b in opal_show_stackframe (signo=6, info=0xbfff9290, p=0xbfff9310) at stacktrace.c:306 #2 <signal handler called> #3 0x00a27cdf in raise () from /lib/tls/libc.so.6 #4 0x00a294e5 in abort () from /lib/tls/libc.so.6 #5 0x0065f8a3 in opal_mutex_lock (m=0x8ff8250) at ../../../../opal/threads/mutex_unix.h:104 #6 0x0065fff0 in mca_btl_tcp_proc_remove (btl_proc=0x8ff8220, btl_endpoint=0x900eba0) at btl_tcp_proc.c:296 #7 0x0065cb24 in mca_btl_tcp_endpoint_destruct (endpoint=0x900eba0) at btl_tcp_endpoint.c:99 #8 0x00659465 in opal_obj_run_destructors (object=0x900eba0) at ../../../../opal/class/opal_object.h:405 #9 0x0065927b in mca_btl_tcp_add_procs (btl=0x8e57c30, nprocs=1, ompi_procs=0x8ff7ac8, peers=0x8ff7ad8, reachable=0xbfff98e4) at btl_tcp.c:118 #10 0x00628023 in mca_bml_r2_add_procs (nprocs=1, procs=0x8ff7ac8, bml_endpoints=0x8ff60b8, reachable=0xbfff98e4) at bml_r2.c:231 #11 0x0061650b in mca_pml_ob1_add_procs (procs=0xbfff9930, nprocs=1) at pml_ob1.c:133 #12 0x00b82303 in ompi_comm_get_rport (port=0x0, send_first=0, proc=0x8e51c70, tag=2000) at communicator/comm_dyn.c:305 #13 0x00b81b43 in ompi_comm_connect_accept (comm=0x8ff8ce0, root=0, port=0x0, send_first=0, newcomm=0xbfff9a38, tag=2000) at communicator/comm_dyn.c:85 #14 0x00bb671a in PMPI_Comm_spawn (command=0x8ff88f0 "/home1/jbuisson/target/bin/sample-npb-ft-pp", argv=0xbfff9b40, maxprocs=1, info=0x8ff73e0, root=0, comm=0x8ff8ce0, intercomm=0xbfff9aa4, array_of_errcodes=0x0) at pcomm_spawn.c:110
signature.asc
Description: OpenPGP digital signature