I'm working on an implementation of SCSH-style "process forms" for Guile, and I'm noticing occasional hangs. I think I have an understanding of root cause, and I'd like people to double-check my analysis.
My code forks its process using the "primitive-fork" function. The function's return value indicates whether the current process is the parent or the child process. The parent and child have user-level data that start out identical but can vary independently thereafter: stacks and heaps. The parent and child have kernel-level data that are shared: file descriptors, and (crucially) mutexes. All we can do to stop sharing the kernel-level data is to drop our handles to the data. The BDW-GC implementation is configured to be thread safe, in case Guile runs multiple threads. Therefore per <http://www.hboehm.info/gc/scale.html>: "It causes the collector to acquire a lock around essentially all allocation and garbage collection activity." That means after the child process spawns, there is one kernel mutex controlling access to two heaps in two separate processes. If the child process needs to do work in the GC layer, it blocks: the signal delivery thread in the parent is holding the mutex, and will hold the mutex until it gets some data on its reporting pipe. This happens when a race condition ends up in the wrong order. Based on this comment from scm_fork() I should be seeing a warning when I fork with a running thread: scm_i_finalizer_pre_fork (); if (scm_ilength (scm_all_threads ()) != 1) /* Other threads may be holding on to resources that Guile needs -- it is not safe to permit one thread to fork while others are running. In addition, POSIX clearly specifies that if a multi-threaded program forks, the child must only call functions that are async-signal-safe. We can't guarantee that in general. The best we can do is to allow forking only very early, before any call to sigaction spawns the signal-handling thread. */ scm_display (scm_from_latin1_string ("warning: call to primitive-fork while multiple threads are running;\n" " further behavior unspecified. See \"Processes\" in the\n" " manual, for more information.\n"), scm_current_warning_port ()); (This is all Guile 2.2 code.) The call to scm_i_finalizer_pre_fork() killed off the finalization thread, so we're safe there: void scm_i_finalizer_pre_fork (void) { #if SCM_USE_PTHREAD_THREADS if (automatic_finalization_p) { stop_finalization_thread (); GC_set_finalizer_notifier (spawn_finalizer_thread); } #endif But nothing stops the signal delivery thread. In fact, scm_all_threads() explicitly skips the signal delivery thread; we don't get a warning: { /* We can not allocate while holding the thread_admin_mutex because of the way GC is done. */ int n = thread_count; scm_i_thread *t; SCM list = scm_c_make_list (n, SCM_UNSPECIFIED), *l; scm_i_pthread_mutex_lock (&thread_admin_mutex); l = &list; for (t = all_threads; t && n > 0; t = t->next_thread) { if (t != scm_i_signal_delivery_thread) { SCM_SETCAR (*l, t->handle); l = SCM_CDRLOC (*l); } n--; } *l = SCM_EOL; scm_i_pthread_mutex_unlock (&thread_admin_mutex); return list; } The signal delivery thread is running in order to support SCSH's "early" auto-reap policy, triggered by SIGCHLD. The alternative is the "late" policy, which triggers after garbage collections. That's not good for parents that do lots of spawning but very little garbage generation compared to their heap size. They end up with lots of zombies. One solution to support the "early" policy might be to tweak scm_fork() so it: 1. Blocks signals. 2. Records the current custom handlers. 3. Resets all handlers. 4. Kills the signal delivery thread. 5. Forks. 6. Starts the signal delivery thread in parent and child. 7. Re-loads the custom handlers in parent and child. 8. Unblocks signals. Does anyone have other possibilities? I don't think there's a safe, general solution for running "identical" finalizers in the parent and the child, so shutting down the finalizer in the child is the best we can do. Is it worth restarting just the parent's finalizer thread after forking? Other, independent, cleanup opportunities: - The docs for "primitive-fork" need to mention that calling "primitive-fork" shuts down finalizers for the parent and the child. - Calling “restore-signals” should stop any running signal delivery thread, to bring Guile back to a consistent state. Thanks, Derek -- Derek Upham s...@blarg.net