Excerpts from Iain Buclaw's message of September 3, 2025 9:19 pm: > Excerpts from Rainer Orth's message of September 3, 2025 10:20 am: >>>> >>>> I regularly (but not always) see timeouts on Solaris, both on sparc and >>>> x86: >>>> >>>> WARNING: libphobos.gc/forkgc2.d execution test program timed out. >>>> FAIL: libphobos.gc/forkgc2.d execution test >>>> WARNING: libphobos.gc/startbackgc.d execution test program timed out. >>>> FAIL: libphobos.gc/startbackgc.d execution test >> >> I haven't tried investigating what's wrong on Solaris with those two, >> but they sure are annoying, especially since they are so unreliable: >> sometimes both PASS, sometimes one or the other, sometimes both. >> >> I'd thought about skipping them on Solaris, too, just to avoid the noise >> and the timeouts, but haven't gotten around to that. >> >> However, fixing this at the root would certainly be best. >> > > I currently have a gdb session on cfarm, process has hung for forkgc2, > and just looking at the backtrace. > > * There are 11 threads in total (main + 10 new'd Threads) > * All threads are suspended (in sigsuspend) except for two > * The first of those threads is the one that's requested all threads to > suspend using pthread_kill(SIGRTMIN), and is stuck inside a sem_wait > for one more call to sem_post(). > * The second is stuck in a SpinLock.lock loop, called from > _prefork_handler() inside forkx() inside fork() - my guess would be > the handler being called is _d_gcx_atfork_prepare(). > * Specific to Solaris, I've clocked this line in the forkx > implementation: > > https://github.com/illumos/illumos-gate/blob/a21856a054bd854f39d1d55a6b0d547cb0d2039f/usr/src/lib/libc/port/threads/scalls.c#L177 > > I think what's going on is that the thread that wants to do a GC > collection has issued a signal to all threads, but Solaris has called > sigoff() in the last thread being fork'd, so the signal never reaches. > > This behaviour does not change when COLLECT_FORK is disabled, so Solaris > would still be affected. >
I forgot to mention, thread #1 that wants to do a GC has control of the SpinLock. So that's why thread #2 is stuck in its current loop. The order of operations that lead to Solaris hanging at runtime are: 1. Thread #1 calls GC.lockNR() and has hold of the global GC SpinLock. 2. Thread #2 calls fork(). It too calls GC.lockNR() in _d_gcx_atfork_prepare() and is waiting for the global lock. 3. Thread #1 decides to call thread_suspendAll() and will never release the GC lock until all threads are suspended. 4. Thread #2 will never suspend because Solaris has set sigoff() on it until the pthread_atfork prepare handler has returned (it won't). It would appear that there should be some other fine grained lock to prevent this kind of deadlock. Iain.