Some precision about this thread, I have read the answer you provided for thread "MPI_Comm_Spawn" posted by rozzen.vincent I have actually reproduced the same behavior on my debian sarge installation i.e 1) mpi_com_spawn failure after 31 spawns ("--disable-threads" is set) 2) MPI applications lock when "--enable-threads" is set
* For issue 1) MPI 1.2 release solves the problem, so it does not seem to be a system limitation but anyway, now, it is behind us * For issue 2) I have been in contact with Rozenn. After a little talk with her, I have done a new test with a "--enable-debug" setting of OpenMpi 1.2 (stable version). The gdb log is a little bit explicit on the deadlock situation. ----------------------------------------------------- main******************************* main : Start MPI* opal_mutex_lock(): Resource deadlock avoided [host10:20607] *** Process received signal *** [host10:20607] Signal: Aborted (6) [host10:20607] Signal code: (-6) [host10:20607] [ 0] [0xffffe440] [host10:20607] [ 1] /lib/tls/libc.so.6(abort+0x1d2) [0x4029cfa2] [host10:20607] [ 2] /usr/local/Mpi/CURRENT_MPI/lib/libmpi.so.0 [0x40061d25] [host10:20607] [ 3] /usr/local/Mpi/CURRENT_MPI/lib/libmpi.so.0 [0x4006030e] [host10:20607] [ 4] /usr/local/Mpi/CURRENT_MPI/lib/libmpi.so.0 [0x40061e23] [host10:20607] [ 5] /usr/local/Mpi/CURRENT_MPI/lib/libmpi.so.0 [0x40060175] [host10:20607] [ 6] /usr/local/Mpi/CURRENT_MPI/lib/libmpi.so.0 [0x40061da3] [host10:20607] [ 7] /usr/local/Mpi/CURRENT_MPI/lib/libmpi.so.0 [0x40062315] [host10:20607] [ 8] /usr/local/Mpi/CURRENT_MPI/lib/libmpi.so.0(ompi_proc_unpack+0x15a) [0x40061392] [host10:20607] [ 9] /usr/local/Mpi/CURRENT_MPI/lib/libmpi.so.0(ompi_comm_connect_accept+0x45c) [0x4004dd62] [host10:20607] [10] /usr/local/Mpi/CURRENT_MPI/lib/libmpi.so.0(PMPI_Comm_spawn+0x346) [0x400949a8] [host10:20607] [11] spawn(main+0xe2) [0x80489a6] [host10:20607] [12] /lib/tls/libc.so.6(__libc_start_main+0xf4) [0x40288974] [host10:20607] [13] spawn [0x8048821] [host10:20607] *** End of error message *** [host10:20602] [0,0,0]-[0,1,0] mca_oob_tcp_msg_recv: readv failed: Connection reset by peer (104) ------------------------------------------------------------------------------------ So, it seems that the lock is in the spawn code. I have also discovered that the spawned program is also locked in the spawn mechanism. Here after, a gdb log from the spawned program. ------------------------------------------------------------------------------------------ #0 0x4019c436 in __lll_mutex_lock_wait () from /lib/tls/libpthread.so.0 #1 0x40199893 in _L_mutex_lock_26 () from /lib/tls/libpthread.so.0 #2 0xbffff4b8 in ?? () #3 0xbffff4b8 in ?? () #4 0x00000000 in ?? () #5 0x400a663c in __JCR_LIST__ () from /usr/local/Mpi/CURRENT_MPI/lib/libmpi.so.0 #6 0x400a663c in __JCR_LIST__ () from /usr/local/Mpi/CURRENT_MPI/lib/libmpi.so.0 #7 0x401347a4 in opal_condition_t_class () from /usr/local/Mpi/CURRENT_MPI/lib/libopen-pal.so.0 #8 0xbffff4e8 in ?? () #9 0x400554a8 in ompi_proc_construct () from /usr/local/Mpi/CURRENT_MPI/lib/libmpi.so.0 #10 0x400554a8 in ompi_proc_construct () from /usr/local/Mpi/CURRENT_MPI/lib/libmpi.so.0 #11 0x40056946 in ompi_proc_find_and_add () from /usr/local/Mpi/CURRENT_MPI/lib/libmpi.so.0 #12 0x4005609e in ompi_proc_unpack () from /usr/local/Mpi/CURRENT_MPI/lib/libmpi.so.0 #13 0x400481cd in ompi_comm_connect_accept () from /usr/local/Mpi/CURRENT_MPI/lib/libmpi.so.0 #14 0x40049b2a in ompi_comm_dyn_init () from /usr/local/Mpi/CURRENT_MPI/lib/libmpi.so.0 #15 0x40058e6d in ompi_mpi_init () from /usr/local/Mpi/CURRENT_MPI/lib/libmpi.so.0 #16 0x4007e122 in PMPI_Init_thread () from /usr/local/Mpi/CURRENT_MPI/lib/libmpi.so.0 #17 0x08048a3b in main (argc=1, argv=0xbffff844) at ExeToSpawned6.c:31 ----------------------------------------------------------------------------------------------- Hopefully, it can help you to investigate. Herve