On Sun, Apr 10, 2022 at 11:00 AM Andres Freund <and...@anarazel.de> wrote: > On 2022-04-09 14:39:16 -0700, Andres Freund wrote: > > On 2022-04-09 17:00:41 -0400, Tom Lane wrote: > > > Thomas Munro <thomas.mu...@gmail.com> writes: > > > > Unlike most "procsignal" handler routines, RecoveryConflictInterrupt() > > > > doesn't just set a sig_atomic_t flag and poke the latch. Is the extra > > > > stuff it does safe? For example, is this call stack OK (to pick one > > > > that jumps out, but not the only one)? > > > > > > > procsignal_sigusr1_handler > > > > -> RecoveryConflictInterrupt > > > > -> HoldingBufferPinThatDelaysRecovery > > > > -> GetPrivateRefCount > > > > -> GetPrivateRefCountEntry > > > > -> hash_search(...hash table that might be in the middle of an > > > > update...) > > > > > > Ugh. That one was safe before somebody decided we needed a hash table > > > for buffer refcounts, but it's surely not safe now. > > > > Mea culpa. This is 4b4b680c3d6d - from 2014. > > Whoa. There's way worse: StandbyTimeoutHandler() calls > SendRecoveryConflictWithBufferPin(), which calls CancelDBBackends(), which > acquires lwlocks etc. > > Which very plausibly is the cause for the issue I'm investigating in > https://www.postgresql.org/message-id/20220409220054.fqn5arvbeesmxdg5%40alap3.anarazel.de
Huh. I wouldn't have started a separate thread for this if I'd realised I was getting close to the cause of the CI failure... I thought this was an incidental observation. Anyway, I made a first attempt at fixing this SIGUSR1 problem (I think Andres is looking at the SIGALRM problem in the other thread). Instead of bothering to create N different XXXPending variables for the different conflict "reasons", I used an array. Other than that, it's much like existing examples. The existing use of the global variable RecoveryConflictReason seems a little woolly. Doesn't it get clobbered every time a signal arrives, even if we determine that there is no conflict? Not sure why that's OK, but anyway, this patch always sets it together with RecoveryConflictPending = true.
From 1ba60808a23159b8d99cfec70111b6724a55e57b Mon Sep 17 00:00:00 2001 From: Thomas Munro <thomas.mu...@gmail.com> Date: Tue, 12 Apr 2022 07:33:59 +1200 Subject: [PATCH] Fix recovery conflict SIGUSR1 handling. We shouldn't be doing real work in a signal handler. Move the check into the next CFI. (There's a related problem in the recovery conflict SIGALRM handling, for another patch.) Discussion: https://postgr.es/m/CA%2BhUKGK3PGKwcKqzoosamn36YW-fsuTdOPPF1i_rtEO%3DnEYKSg%40mail.gmail.com --- src/backend/storage/ipc/procsignal.c | 12 +++---- src/backend/tcop/postgres.c | 53 ++++++++++++++++++---------- src/include/storage/procsignal.h | 4 ++- src/include/tcop/tcopprot.h | 3 +- 4 files changed, 45 insertions(+), 27 deletions(-) diff --git a/src/backend/storage/ipc/procsignal.c b/src/backend/storage/ipc/procsignal.c index f41563a0a4..ce08580b5b 100644 --- a/src/backend/storage/ipc/procsignal.c +++ b/src/backend/storage/ipc/procsignal.c @@ -653,22 +653,22 @@ procsignal_sigusr1_handler(SIGNAL_ARGS) HandleLogMemoryContextInterrupt(); if (CheckProcSignal(PROCSIG_RECOVERY_CONFLICT_DATABASE)) - RecoveryConflictInterrupt(PROCSIG_RECOVERY_CONFLICT_DATABASE); + HandleRecoveryConflictInterrupt(PROCSIG_RECOVERY_CONFLICT_DATABASE); if (CheckProcSignal(PROCSIG_RECOVERY_CONFLICT_TABLESPACE)) - RecoveryConflictInterrupt(PROCSIG_RECOVERY_CONFLICT_TABLESPACE); + HandleRecoveryConflictInterrupt(PROCSIG_RECOVERY_CONFLICT_TABLESPACE); if (CheckProcSignal(PROCSIG_RECOVERY_CONFLICT_LOCK)) - RecoveryConflictInterrupt(PROCSIG_RECOVERY_CONFLICT_LOCK); + HandleRecoveryConflictInterrupt(PROCSIG_RECOVERY_CONFLICT_LOCK); if (CheckProcSignal(PROCSIG_RECOVERY_CONFLICT_SNAPSHOT)) - RecoveryConflictInterrupt(PROCSIG_RECOVERY_CONFLICT_SNAPSHOT); + HandleRecoveryConflictInterrupt(PROCSIG_RECOVERY_CONFLICT_SNAPSHOT); if (CheckProcSignal(PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK)) - RecoveryConflictInterrupt(PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK); + HandleRecoveryConflictInterrupt(PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK); if (CheckProcSignal(PROCSIG_RECOVERY_CONFLICT_BUFFERPIN)) - RecoveryConflictInterrupt(PROCSIG_RECOVERY_CONFLICT_BUFFERPIN); + HandleRecoveryConflictInterrupt(PROCSIG_RECOVERY_CONFLICT_BUFFERPIN); SetLatch(MyLatch); diff --git a/src/backend/tcop/postgres.c b/src/backend/tcop/postgres.c index 95dc2e2c83..a89066badb 100644 --- a/src/backend/tcop/postgres.c +++ b/src/backend/tcop/postgres.c @@ -172,6 +172,7 @@ static bool EchoQuery = false; /* -E switch */ static bool UseSemiNewlineNewline = false; /* -j switch */ /* whether or not, and why, we were canceled by conflict with recovery */ +volatile sig_atomic_t RecoveryConflictInterruptPending[NUM_PROCSIGNALS]; static bool RecoveryConflictPending = false; static bool RecoveryConflictRetryable = true; static ProcSignalReason RecoveryConflictReason; @@ -2993,22 +2994,31 @@ FloatExceptionHandler(SIGNAL_ARGS) } /* - * RecoveryConflictInterrupt: out-of-line portion of recovery conflict - * handling following receipt of SIGUSR1. Designed to be similar to die() - * and StatementCancelHandler(). Called only by a normal user backend - * that begins a transaction during recovery. + * Tell the next CHECK_FOR_INTERRUPTS() to check for a particular type of + * recovery conflict. Runs in a SIGUSR1 handler. */ void -RecoveryConflictInterrupt(ProcSignalReason reason) +HandleRecoveryConflictInterrupt(ProcSignalReason reason) { - int save_errno = errno; + RecoveryConflictInterruptPending[reason] = true; + InterruptPending = true; + /* latch will be set by procsignal_sigusr1_handler */ +} +/* + * Check one recovery conflict reason. This is called when the corresponding + * RecoveryConflictInterruptPending flags is set. If we decide that a conflict + * exists, then RecoveryConflictReason and RecoveryConflictPending will be set, + * to be handled later in the same invocation of ProcessInterrupts(). + */ +static void +ProcessRecoveryConflictInterrupt(ProcSignalReason reason) +{ /* * Don't joggle the elbow of proc_exit */ if (!proc_exit_inprogress) { - RecoveryConflictReason = reason; switch (reason) { case PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK: @@ -3084,9 +3094,9 @@ RecoveryConflictInterrupt(ProcSignalReason reason) if (IsAbortedTransactionBlockState()) return; + RecoveryConflictReason = reason; RecoveryConflictPending = true; QueryCancelPending = true; - InterruptPending = true; break; } @@ -3094,9 +3104,9 @@ RecoveryConflictInterrupt(ProcSignalReason reason) /* FALLTHROUGH */ case PROCSIG_RECOVERY_CONFLICT_DATABASE: + RecoveryConflictReason = reason; RecoveryConflictPending = true; ProcDiePending = true; - InterruptPending = true; break; default: @@ -3115,15 +3125,6 @@ RecoveryConflictInterrupt(ProcSignalReason reason) if (reason == PROCSIG_RECOVERY_CONFLICT_DATABASE) RecoveryConflictRetryable = false; } - - /* - * Set the process latch. This function essentially emulates signal - * handlers like die() and StatementCancelHandler() and it seems prudent - * to behave similarly as they do. - */ - SetLatch(MyLatch); - - errno = save_errno; } /* @@ -3147,6 +3148,22 @@ ProcessInterrupts(void) return; InterruptPending = false; + /* + * Have we been asked to check for a recovery conflict? Processing these + * interrupts may result in RecoveryConflictPending and related variables + * being set, to be handled further down. + */ + for (int i = PROCSIG_RECOVERY_CONFLICT_FIRST; + i <= PROCSIG_RECOVERY_CONFLICT_LAST; + ++i) + { + if (RecoveryConflictInterruptPending[i]) + { + RecoveryConflictInterruptPending[i] = false; + ProcessRecoveryConflictInterrupt(i); + } + } + if (ProcDiePending) { ProcDiePending = false; diff --git a/src/include/storage/procsignal.h b/src/include/storage/procsignal.h index ee636900f3..26d045950c 100644 --- a/src/include/storage/procsignal.h +++ b/src/include/storage/procsignal.h @@ -37,12 +37,14 @@ typedef enum PROCSIG_LOG_MEMORY_CONTEXT, /* ask backend to log the memory contexts */ /* Recovery conflict reasons */ - PROCSIG_RECOVERY_CONFLICT_DATABASE, + PROCSIG_RECOVERY_CONFLICT_FIRST, + PROCSIG_RECOVERY_CONFLICT_DATABASE = PROCSIG_RECOVERY_CONFLICT_FIRST, PROCSIG_RECOVERY_CONFLICT_TABLESPACE, PROCSIG_RECOVERY_CONFLICT_LOCK, PROCSIG_RECOVERY_CONFLICT_SNAPSHOT, PROCSIG_RECOVERY_CONFLICT_BUFFERPIN, PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK, + PROCSIG_RECOVERY_CONFLICT_LAST = PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK, NUM_PROCSIGNALS /* Must be last! */ } ProcSignalReason; diff --git a/src/include/tcop/tcopprot.h b/src/include/tcop/tcopprot.h index 87e408b719..6c3c91aeea 100644 --- a/src/include/tcop/tcopprot.h +++ b/src/include/tcop/tcopprot.h @@ -73,8 +73,7 @@ extern void die(SIGNAL_ARGS); extern void quickdie(SIGNAL_ARGS) pg_attribute_noreturn(); extern void StatementCancelHandler(SIGNAL_ARGS); extern void FloatExceptionHandler(SIGNAL_ARGS) pg_attribute_noreturn(); -extern void RecoveryConflictInterrupt(ProcSignalReason reason); /* called from SIGUSR1 - * handler */ +extern void HandleRecoveryConflictInterrupt(ProcSignalReason reason); extern void ProcessClientReadInterrupt(bool blocked); extern void ProcessClientWriteInterrupt(bool blocked); -- 2.30.2