On Sun, Apr 10, 2022 at 11:00 AM Andres Freund <and...@anarazel.de> wrote:
> On 2022-04-09 14:39:16 -0700, Andres Freund wrote:
> > On 2022-04-09 17:00:41 -0400, Tom Lane wrote:
> > > Thomas Munro <thomas.mu...@gmail.com> writes:
> > > > Unlike most "procsignal" handler routines, RecoveryConflictInterrupt()
> > > > doesn't just set a sig_atomic_t flag and poke the latch.  Is the extra
> > > > stuff it does safe?  For example, is this call stack OK (to pick one
> > > > that jumps out, but not the only one)?
> > >
> > > > procsignal_sigusr1_handler
> > > > -> RecoveryConflictInterrupt
> > > >  -> HoldingBufferPinThatDelaysRecovery
> > > >   -> GetPrivateRefCount
> > > >    -> GetPrivateRefCountEntry
> > > >     -> hash_search(...hash table that might be in the middle of an 
> > > > update...)
> > >
> > > Ugh.  That one was safe before somebody decided we needed a hash table
> > > for buffer refcounts, but it's surely not safe now.
> >
> > Mea culpa. This is 4b4b680c3d6d - from 2014.
>
> Whoa. There's way worse: StandbyTimeoutHandler() calls
> SendRecoveryConflictWithBufferPin(), which calls CancelDBBackends(), which
> acquires lwlocks etc.
>
> Which very plausibly is the cause for the issue I'm investigating in
> https://www.postgresql.org/message-id/20220409220054.fqn5arvbeesmxdg5%40alap3.anarazel.de

Huh.  I wouldn't have started a separate thread for this if I'd
realised I was getting close to the cause of the CI failure... I
thought this was an incidental observation.  Anyway, I made a first
attempt at fixing this SIGUSR1 problem (I think Andres is looking at
the SIGALRM problem in the other thread).

Instead of bothering to create N different XXXPending variables for
the different conflict "reasons", I used an array.  Other than that,
it's much like existing examples.

The existing use of the global variable RecoveryConflictReason seems a
little woolly.  Doesn't it get clobbered every time a signal arrives,
even if we determine that there is no conflict?  Not sure why that's
OK, but anyway, this patch always sets it together with
RecoveryConflictPending = true.
From 1ba60808a23159b8d99cfec70111b6724a55e57b Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.mu...@gmail.com>
Date: Tue, 12 Apr 2022 07:33:59 +1200
Subject: [PATCH] Fix recovery conflict SIGUSR1 handling.

We shouldn't be doing real work in a signal handler.  Move the check
into the next CFI.

(There's a related problem in the recovery conflict SIGALRM handling,
for another patch.)

Discussion: https://postgr.es/m/CA%2BhUKGK3PGKwcKqzoosamn36YW-fsuTdOPPF1i_rtEO%3DnEYKSg%40mail.gmail.com
---
 src/backend/storage/ipc/procsignal.c | 12 +++----
 src/backend/tcop/postgres.c          | 53 ++++++++++++++++++----------
 src/include/storage/procsignal.h     |  4 ++-
 src/include/tcop/tcopprot.h          |  3 +-
 4 files changed, 45 insertions(+), 27 deletions(-)

diff --git a/src/backend/storage/ipc/procsignal.c b/src/backend/storage/ipc/procsignal.c
index f41563a0a4..ce08580b5b 100644
--- a/src/backend/storage/ipc/procsignal.c
+++ b/src/backend/storage/ipc/procsignal.c
@@ -653,22 +653,22 @@ procsignal_sigusr1_handler(SIGNAL_ARGS)
 		HandleLogMemoryContextInterrupt();
 
 	if (CheckProcSignal(PROCSIG_RECOVERY_CONFLICT_DATABASE))
-		RecoveryConflictInterrupt(PROCSIG_RECOVERY_CONFLICT_DATABASE);
+		HandleRecoveryConflictInterrupt(PROCSIG_RECOVERY_CONFLICT_DATABASE);
 
 	if (CheckProcSignal(PROCSIG_RECOVERY_CONFLICT_TABLESPACE))
-		RecoveryConflictInterrupt(PROCSIG_RECOVERY_CONFLICT_TABLESPACE);
+		HandleRecoveryConflictInterrupt(PROCSIG_RECOVERY_CONFLICT_TABLESPACE);
 
 	if (CheckProcSignal(PROCSIG_RECOVERY_CONFLICT_LOCK))
-		RecoveryConflictInterrupt(PROCSIG_RECOVERY_CONFLICT_LOCK);
+		HandleRecoveryConflictInterrupt(PROCSIG_RECOVERY_CONFLICT_LOCK);
 
 	if (CheckProcSignal(PROCSIG_RECOVERY_CONFLICT_SNAPSHOT))
-		RecoveryConflictInterrupt(PROCSIG_RECOVERY_CONFLICT_SNAPSHOT);
+		HandleRecoveryConflictInterrupt(PROCSIG_RECOVERY_CONFLICT_SNAPSHOT);
 
 	if (CheckProcSignal(PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK))
-		RecoveryConflictInterrupt(PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK);
+		HandleRecoveryConflictInterrupt(PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK);
 
 	if (CheckProcSignal(PROCSIG_RECOVERY_CONFLICT_BUFFERPIN))
-		RecoveryConflictInterrupt(PROCSIG_RECOVERY_CONFLICT_BUFFERPIN);
+		HandleRecoveryConflictInterrupt(PROCSIG_RECOVERY_CONFLICT_BUFFERPIN);
 
 	SetLatch(MyLatch);
 
diff --git a/src/backend/tcop/postgres.c b/src/backend/tcop/postgres.c
index 95dc2e2c83..a89066badb 100644
--- a/src/backend/tcop/postgres.c
+++ b/src/backend/tcop/postgres.c
@@ -172,6 +172,7 @@ static bool EchoQuery = false;	/* -E switch */
 static bool UseSemiNewlineNewline = false;	/* -j switch */
 
 /* whether or not, and why, we were canceled by conflict with recovery */
+volatile sig_atomic_t RecoveryConflictInterruptPending[NUM_PROCSIGNALS];
 static bool RecoveryConflictPending = false;
 static bool RecoveryConflictRetryable = true;
 static ProcSignalReason RecoveryConflictReason;
@@ -2993,22 +2994,31 @@ FloatExceptionHandler(SIGNAL_ARGS)
 }
 
 /*
- * RecoveryConflictInterrupt: out-of-line portion of recovery conflict
- * handling following receipt of SIGUSR1. Designed to be similar to die()
- * and StatementCancelHandler(). Called only by a normal user backend
- * that begins a transaction during recovery.
+ * Tell the next CHECK_FOR_INTERRUPTS() to check for a particular type of
+ * recovery conflict.  Runs in a SIGUSR1 handler.
  */
 void
-RecoveryConflictInterrupt(ProcSignalReason reason)
+HandleRecoveryConflictInterrupt(ProcSignalReason reason)
 {
-	int			save_errno = errno;
+	RecoveryConflictInterruptPending[reason] = true;
+	InterruptPending = true;
+	/* latch will be set by procsignal_sigusr1_handler */
+}
 
+/*
+ * Check one recovery conflict reason.  This is called when the corresponding
+ * RecoveryConflictInterruptPending flags is set.  If we decide that a conflict
+ * exists, then RecoveryConflictReason and RecoveryConflictPending will be set,
+ * to be handled later in the same invocation of ProcessInterrupts().
+ */
+static void
+ProcessRecoveryConflictInterrupt(ProcSignalReason reason)
+{
 	/*
 	 * Don't joggle the elbow of proc_exit
 	 */
 	if (!proc_exit_inprogress)
 	{
-		RecoveryConflictReason = reason;
 		switch (reason)
 		{
 			case PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK:
@@ -3084,9 +3094,9 @@ RecoveryConflictInterrupt(ProcSignalReason reason)
 					if (IsAbortedTransactionBlockState())
 						return;
 
+					RecoveryConflictReason = reason;
 					RecoveryConflictPending = true;
 					QueryCancelPending = true;
-					InterruptPending = true;
 					break;
 				}
 
@@ -3094,9 +3104,9 @@ RecoveryConflictInterrupt(ProcSignalReason reason)
 				/* FALLTHROUGH */
 
 			case PROCSIG_RECOVERY_CONFLICT_DATABASE:
+				RecoveryConflictReason = reason;
 				RecoveryConflictPending = true;
 				ProcDiePending = true;
-				InterruptPending = true;
 				break;
 
 			default:
@@ -3115,15 +3125,6 @@ RecoveryConflictInterrupt(ProcSignalReason reason)
 		if (reason == PROCSIG_RECOVERY_CONFLICT_DATABASE)
 			RecoveryConflictRetryable = false;
 	}
-
-	/*
-	 * Set the process latch. This function essentially emulates signal
-	 * handlers like die() and StatementCancelHandler() and it seems prudent
-	 * to behave similarly as they do.
-	 */
-	SetLatch(MyLatch);
-
-	errno = save_errno;
 }
 
 /*
@@ -3147,6 +3148,22 @@ ProcessInterrupts(void)
 		return;
 	InterruptPending = false;
 
+	/*
+	 * Have we been asked to check for a recovery conflict?  Processing these
+	 * interrupts may result in RecoveryConflictPending and related variables
+	 * being set, to be handled further down.
+	 */
+	for (int i = PROCSIG_RECOVERY_CONFLICT_FIRST;
+		 i <= PROCSIG_RECOVERY_CONFLICT_LAST;
+		 ++i)
+	{
+		if (RecoveryConflictInterruptPending[i])
+		{
+			RecoveryConflictInterruptPending[i] = false;
+			ProcessRecoveryConflictInterrupt(i);
+		}
+	}
+
 	if (ProcDiePending)
 	{
 		ProcDiePending = false;
diff --git a/src/include/storage/procsignal.h b/src/include/storage/procsignal.h
index ee636900f3..26d045950c 100644
--- a/src/include/storage/procsignal.h
+++ b/src/include/storage/procsignal.h
@@ -37,12 +37,14 @@ typedef enum
 	PROCSIG_LOG_MEMORY_CONTEXT, /* ask backend to log the memory contexts */
 
 	/* Recovery conflict reasons */
-	PROCSIG_RECOVERY_CONFLICT_DATABASE,
+	PROCSIG_RECOVERY_CONFLICT_FIRST,
+	PROCSIG_RECOVERY_CONFLICT_DATABASE = PROCSIG_RECOVERY_CONFLICT_FIRST,
 	PROCSIG_RECOVERY_CONFLICT_TABLESPACE,
 	PROCSIG_RECOVERY_CONFLICT_LOCK,
 	PROCSIG_RECOVERY_CONFLICT_SNAPSHOT,
 	PROCSIG_RECOVERY_CONFLICT_BUFFERPIN,
 	PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK,
+	PROCSIG_RECOVERY_CONFLICT_LAST = PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK,
 
 	NUM_PROCSIGNALS				/* Must be last! */
 } ProcSignalReason;
diff --git a/src/include/tcop/tcopprot.h b/src/include/tcop/tcopprot.h
index 87e408b719..6c3c91aeea 100644
--- a/src/include/tcop/tcopprot.h
+++ b/src/include/tcop/tcopprot.h
@@ -73,8 +73,7 @@ extern void die(SIGNAL_ARGS);
 extern void quickdie(SIGNAL_ARGS) pg_attribute_noreturn();
 extern void StatementCancelHandler(SIGNAL_ARGS);
 extern void FloatExceptionHandler(SIGNAL_ARGS) pg_attribute_noreturn();
-extern void RecoveryConflictInterrupt(ProcSignalReason reason); /* called from SIGUSR1
-																 * handler */
+extern void HandleRecoveryConflictInterrupt(ProcSignalReason reason);
 extern void ProcessClientReadInterrupt(bool blocked);
 extern void ProcessClientWriteInterrupt(bool blocked);
 
-- 
2.30.2

Reply via email to