Hi, On 2022-04-09 16:10:02 -0700, Andres Freund wrote: > It's not that (although I still suspect it's a problem). It's a self-deadlock, > because StandbyTimeoutHandler(), which ResolveRecoveryConflictWithBufferPin() > *explicitly enables*, calls SendRecoveryConflictWithBufferPin(). Which does > CancelDBBackends(). Which ResolveRecoveryConflictWithBufferPin() also calls, > if the deadlock timeout is reached. > > To make it easier to hit, I put a pg_usleep(10000) in CancelDBBackends(), and > boom: > > [... backtrace ... ] > > it's reproducible on linux. > > > I'm lacking words I dare to put in an email to describe how bad an idea it is > to call CancelDBBackends() from within a timeout function, particularly when > the function enabling the timeout also calls that function itself. Before > disabling timers.
It's been broken in different ways all the way back to 9.0, from what I can see, but I didn't check every single version. Afaics the fix is to nuke the idea of doing anything substantial in the signal handler from orbit, and instead just set a flag in the handler. Then check whether the timeout happend after ProcWaitForSignal() and call SendRecoveryConflictWithBufferPin(). Also worth noting that the disable_all_timeouts() calls appears to break STARTUP_PROGRESS_TIMEOUT. Greetings, Andres Freund