Some precision about this thread,

I have read the answer you provided for thread "MPI_Comm_Spawn" posted by 
rozzen.vincent
I have actually reproduced the same behavior on my debian sarge installation
i.e
1) mpi_com_spawn failure after 31 spawns ("--disable-threads" is set)
2) MPI applications lock when "--enable-threads" is set

* For issue 1)
MPI 1.2 release solves the problem, so it does not seem to be a system 
limitation but anyway, now, it is behind us

* For issue 2)
I have been in contact with Rozenn. After a little talk with her, I have done a new test 
with a "--enable-debug" setting of OpenMpi 1.2 (stable version).

The gdb log is a little bit explicit on the deadlock situation.
-----------------------------------------------------
main*******************************
main : Start MPI*
opal_mutex_lock(): Resource deadlock avoided
[host10:20607] *** Process received signal ***
[host10:20607] Signal: Aborted (6)
[host10:20607] Signal code:  (-6)
[host10:20607] [ 0] [0xffffe440]
[host10:20607] [ 1] /lib/tls/libc.so.6(abort+0x1d2) [0x4029cfa2]
[host10:20607] [ 2] /usr/local/Mpi/CURRENT_MPI/lib/libmpi.so.0 [0x40061d25]
[host10:20607] [ 3] /usr/local/Mpi/CURRENT_MPI/lib/libmpi.so.0 [0x4006030e]
[host10:20607] [ 4] /usr/local/Mpi/CURRENT_MPI/lib/libmpi.so.0 [0x40061e23]
[host10:20607] [ 5] /usr/local/Mpi/CURRENT_MPI/lib/libmpi.so.0 [0x40060175]
[host10:20607] [ 6] /usr/local/Mpi/CURRENT_MPI/lib/libmpi.so.0 [0x40061da3]
[host10:20607] [ 7] /usr/local/Mpi/CURRENT_MPI/lib/libmpi.so.0 [0x40062315]
[host10:20607] [ 8] 
/usr/local/Mpi/CURRENT_MPI/lib/libmpi.so.0(ompi_proc_unpack+0x15a) [0x40061392]
[host10:20607] [ 9] 
/usr/local/Mpi/CURRENT_MPI/lib/libmpi.so.0(ompi_comm_connect_accept+0x45c) 
[0x4004dd62]
[host10:20607] [10] 
/usr/local/Mpi/CURRENT_MPI/lib/libmpi.so.0(PMPI_Comm_spawn+0x346) [0x400949a8]
[host10:20607] [11] spawn(main+0xe2) [0x80489a6]
[host10:20607] [12] /lib/tls/libc.so.6(__libc_start_main+0xf4) [0x40288974]
[host10:20607] [13] spawn [0x8048821]
[host10:20607] *** End of error message ***
[host10:20602] [0,0,0]-[0,1,0] mca_oob_tcp_msg_recv: readv failed: Connection 
reset by peer (104)
------------------------------------------------------------------------------------


So, it seems that the lock is in the spawn code.
I have also discovered that the spawned program is also locked in the spawn 
mechanism.
Here after, a gdb log from the spawned program.


------------------------------------------------------------------------------------------
#0  0x4019c436 in __lll_mutex_lock_wait () from /lib/tls/libpthread.so.0
#1  0x40199893 in _L_mutex_lock_26 () from /lib/tls/libpthread.so.0
#2  0xbffff4b8 in ?? ()
#3  0xbffff4b8 in ?? ()
#4  0x00000000 in ?? ()
#5  0x400a663c in __JCR_LIST__ () from 
/usr/local/Mpi/CURRENT_MPI/lib/libmpi.so.0
#6  0x400a663c in __JCR_LIST__ () from 
/usr/local/Mpi/CURRENT_MPI/lib/libmpi.so.0
#7  0x401347a4 in opal_condition_t_class () from 
/usr/local/Mpi/CURRENT_MPI/lib/libopen-pal.so.0
#8  0xbffff4e8 in ?? ()
#9  0x400554a8 in ompi_proc_construct () from 
/usr/local/Mpi/CURRENT_MPI/lib/libmpi.so.0
#10 0x400554a8 in ompi_proc_construct () from 
/usr/local/Mpi/CURRENT_MPI/lib/libmpi.so.0
#11 0x40056946 in ompi_proc_find_and_add () from 
/usr/local/Mpi/CURRENT_MPI/lib/libmpi.so.0
#12 0x4005609e in ompi_proc_unpack () from 
/usr/local/Mpi/CURRENT_MPI/lib/libmpi.so.0
#13 0x400481cd in ompi_comm_connect_accept () from 
/usr/local/Mpi/CURRENT_MPI/lib/libmpi.so.0
#14 0x40049b2a in ompi_comm_dyn_init () from 
/usr/local/Mpi/CURRENT_MPI/lib/libmpi.so.0
#15 0x40058e6d in ompi_mpi_init () from 
/usr/local/Mpi/CURRENT_MPI/lib/libmpi.so.0
#16 0x4007e122 in PMPI_Init_thread () from 
/usr/local/Mpi/CURRENT_MPI/lib/libmpi.so.0
#17 0x08048a3b in main (argc=1, argv=0xbffff844) at ExeToSpawned6.c:31
-----------------------------------------------------------------------------------------------

Hopefully, it can help you to investigate.



Herve

Reply via email to