> For 1.8, could you try running Helgrind and see what happens? Helgrind complains about loads of 'possible data race' but does not detect anything wrong when the actual deadlock occurs. When I exit the program it does tell that a threads still own some lock, but does not reveal the addresses of those in a meaningfull way for me:
==26762== Thread #1: Exiting thread still holds 1 lock ==26762== at 0x5A81B4D: waitpid (waitpid.c:41) ==26762== by 0x4F0A289: scm_waitpid (posix.c:560) ==27182== by 0x5A7BF09: pthread_mutex_lock (pthread_mutex_lock.c:61) ==26762== by 0x4E8FCBF: deval (eval.c:4229) ==27182== by 0x4C25BEF: pthread_mutex_lock (hg_intercepts.c:488) ==27182== by 0x4EF6606: scm_i_thread_put_to_sleep (threads.c:1676) ==26762== by 0x4E89B4F: scm_i_eval_x (eval.c:5900) ==27182== by 0x4E96D93: scm_i_gc (gc.c:550) ==27182== by 0x4E96CBC: scm_gc_for_newcell (gc.c:507) ==26762== by 0x4E8FCED: deval (eval.c:4232) ==27182== by 0x4EAC1B8: scm_cell (inline.h:122) ==26762== by 0x4E89C62: scm_i_eval (eval.c:5910) ==26762== by 0x4E710D7: scm_start_stack (debug.c:457) ==26762== by 0x4E71199: scm_m_start_stack (debug.c:473) ==26762== ==27182== by 0x4E91F5E: scm_dapply (eval.c:5012) ==27182== (how pthread_mutex_lock apears to call scm_waitpid is not clear to me) I don't know how helgrind works exactly, and thus can not be sure its supposed to detect when a thread lock a mutex it already owns (especially after a fork). As to why it does not happen with guile2, this is still a mystery. My theory about this deadlock is that the thread that calls open-process owns the scm_i_port_table_mutex when open-process is called, and thus the port-for-each call deadlock. But since guile2's open-process does the same fork (not vfork), takes the same scm_i_port_table_mutex in port-for-each, which mutex is still not recursive, and yet does not deadlock, then maybe my theory is wrong in the first place - or maybe the path that calls open-process while scm_i_port_table_mutex is locked disapeared in guile2, maybe due to the change of garbage collector (since the GC also grab this lock I believe). Or maybe the deadlock involves another lock in addition to this one. I'm going to turn scm_i_port_table_mutex into a recursive mutex in order to try to invalidate my theory. sorry I'm thinking aloud but maybe this can give you some better idea?