Excerpts from Iain Buclaw's message of September 3, 2025 9:19 pm:
> Excerpts from Rainer Orth's message of September 3, 2025 10:20 am:
>>>> 
>>>> I regularly (but not always) see timeouts on Solaris, both on sparc and
>>>> x86:
>>>> 
>>>> WARNING: libphobos.gc/forkgc2.d execution test program timed out.
>>>> FAIL: libphobos.gc/forkgc2.d execution test
>>>> WARNING: libphobos.gc/startbackgc.d execution test program timed out.
>>>> FAIL: libphobos.gc/startbackgc.d execution test
>> 
>> I haven't tried investigating what's wrong on Solaris with those two,
>> but they sure are annoying, especially since they are so unreliable:
>> sometimes both PASS, sometimes one or the other, sometimes both.
>> 
>> I'd thought about skipping them on Solaris, too, just to avoid the noise
>> and the timeouts, but haven't gotten around to that.
>> 
>> However, fixing this at the root would certainly be best.
>> 
> 
> I currently have a gdb session on cfarm, process has hung for forkgc2, 
> and just looking at the backtrace.
> 
> * There are 11 threads in total (main + 10 new'd Threads)
> * All threads are suspended (in sigsuspend) except for two
> * The first of those threads is the one that's requested all threads to 
>   suspend using pthread_kill(SIGRTMIN), and is stuck inside a sem_wait 
>   for one more call to sem_post().
> * The second is stuck in a SpinLock.lock loop, called from 
>   _prefork_handler() inside forkx() inside fork() - my guess would be 
>   the  handler being called is _d_gcx_atfork_prepare().
> * Specific to Solaris, I've clocked this line in the forkx 
>   implementation:
> 
> https://github.com/illumos/illumos-gate/blob/a21856a054bd854f39d1d55a6b0d547cb0d2039f/usr/src/lib/libc/port/threads/scalls.c#L177
> 
> I think what's going on is that the thread that wants to do a GC 
> collection has issued a signal to all threads, but Solaris has called 
> sigoff() in the last thread being fork'd, so the signal never reaches.
> 
> This behaviour does not change when COLLECT_FORK is disabled, so Solaris 
> would still be affected.
> 

I forgot to mention, thread #1 that wants to do a GC has control of the 
SpinLock.  So that's why thread #2 is stuck in its current loop.

The order of operations that lead to Solaris hanging at runtime are:
1. Thread #1 calls GC.lockNR() and has hold of the global GC SpinLock.
2. Thread #2 calls fork(). It too calls GC.lockNR() in 
   _d_gcx_atfork_prepare() and is waiting for the global lock.
3. Thread #1 decides to call thread_suspendAll() and will never release 
   the GC lock until all threads are suspended.
4. Thread #2 will never suspend because Solaris has set sigoff() on it 
   until the pthread_atfork prepare handler has returned (it won't).

It would appear that there should be some other fine grained lock to 
prevent this kind of deadlock.

Iain.

Reply via email to