On Tue, Jul 26, 2022 at 07:22:52PM -0400, Tom Lane wrote: > Thomas Munro <thomas.mu...@gmail.com> writes: > > ... The regular expression machinery is capable of > > consuming a lot of CPU, and does CANCEL_REQUESTED(nfa->v->re) > > frequently to avoid getting stuck. With the patch as it stands, that > > would never be true. > > Surely that can't be too hard to fix. We might have to refactor > the code around QueryCancelPending a little bit so that callers > can ask "do we need a query cancel now?" without actually triggering > a longjmp ... but why would that be problematic?
It could work. The problems are like those of making code safe to run in a signal handler. You can use e.g. snprintf in rcancelrequested(), but you still can't use palloc() or ereport(). I see at least these strategies: 1. Accept that recovery conflict checks run after a regex call completes. 2. Have rcancelrequested() return true unconditionally if we need a conflict check. If there's no actual conflict, restart the regex. 3. Have rcancelrequested() run the conflict check, including elog-using PostgreSQL code. On longjmp(), accept the leak of regex mallocs. 4. Have rcancelrequested() run the conflict check, including elog-using PostgreSQL code. On longjmp(), escalate to FATAL. 5. Write the conflict check code to dutifully avoid longjmp(). 6. Convert src/backend/regex to use palloc, so longjmp() is fine. I would tend to pick (3). (6) could come later and remove the drawback of (3). Does one of those unblock the patch, or not? === I found this thread because $SUBJECT is causing more buildfarm failures lately. Here are just the ones with symptom "timed out waiting for match: (?^:User was holding a relation lock for too long)": sysname │ snapshot │ branch │ bfurl ───────────┼─────────────────────┼───────────────┼──────────────────────────────────────────────────────────────────────────────────────────────── wrasse │ 2022-09-16 09:19:06 │ REL_15_STABLE │ https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=wrasse&dt=2022-09-16%2009%3A19%3A06 francolin │ 2022-09-24 02:02:23 │ REL_15_STABLE │ https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=francolin&dt=2022-09-24%2002%3A02%3A23 wrasse │ 2022-10-19 08:49:16 │ HEAD │ https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=wrasse&dt=2022-10-19%2008%3A49%3A16 wrasse │ 2022-11-16 16:59:23 │ HEAD │ https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=wrasse&dt=2022-11-16%2016%3A59%3A23 wrasse │ 2022-11-17 09:58:48 │ REL_15_STABLE │ https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=wrasse&dt=2022-11-17%2009%3A58%3A48 wrasse │ 2022-11-21 22:17:20 │ REL_15_STABLE │ https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=wrasse&dt=2022-11-21%2022%3A17%3A20 wrasse │ 2022-11-22 21:52:26 │ REL_15_STABLE │ https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=wrasse&dt=2022-11-22%2021%3A52%3A26 wrasse │ 2022-11-25 09:16:44 │ REL_15_STABLE │ https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=wrasse&dt=2022-11-25%2009%3A16%3A44 wrasse │ 2022-12-04 23:33:26 │ HEAD │ https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=wrasse&dt=2022-12-04%2023%3A33%3A26 wrasse │ 2022-12-07 11:48:54 │ HEAD │ https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=wrasse&dt=2022-12-07%2011%3A48%3A54 wrasse │ 2022-12-07 20:58:49 │ HEAD │ https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=wrasse&dt=2022-12-07%2020%3A58%3A49 wrasse │ 2022-12-09 12:19:40 │ HEAD │ https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=wrasse&dt=2022-12-09%2012%3A19%3A40 wrasse │ 2022-12-09 15:29:45 │ HEAD │ https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=wrasse&dt=2022-12-09%2015%3A29%3A45 wrasse │ 2022-12-15 09:29:52 │ HEAD │ https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=wrasse&dt=2022-12-15%2009%3A29%3A52 wrasse │ 2022-12-23 07:37:06 │ REL_15_STABLE │ https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=wrasse&dt=2022-12-23%2007%3A37%3A06 wrasse │ 2022-12-23 10:32:05 │ HEAD │ https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=wrasse&dt=2022-12-23%2010%3A32%3A05 wrasse │ 2022-12-23 17:47:17 │ HEAD │ https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=wrasse&dt=2022-12-23%2017%3A47%3A17 (17 rows) I can reproduce that symptom reliably, on GNU/Linux, with the attached patch adding sleeps. The key log bit: 2022-09-16 11:50:37.338 CEST [15022:4] 031_recovery_conflict.pl LOG: statement: BEGIN; 2022-09-16 11:50:37.339 CEST [15022:5] 031_recovery_conflict.pl LOG: statement: LOCK TABLE test_recovery_conflict_table1 IN ACCESS SHARE MODE; 2022-09-16 11:50:37.341 CEST [15022:6] 031_recovery_conflict.pl LOG: statement: SELECT 1; 2022-09-16 11:50:38.076 CEST [14880:17] LOG: recovery still waiting after 11.482 ms: recovery conflict on lock 2022-09-16 11:50:38.076 CEST [14880:18] DETAIL: Conflicting process: 15022. 2022-09-16 11:50:38.076 CEST [14880:19] CONTEXT: WAL redo at 0/34243F0 for Standby/LOCK: xid 733 db 16385 rel 16386 2022-09-16 11:50:38.196 CEST [15022:7] 031_recovery_conflict.pl FATAL: terminating connection due to conflict with recovery 2022-09-16 11:50:38.196 CEST [15022:8] 031_recovery_conflict.pl DETAIL: User transaction caused buffer deadlock with recovery. 2022-09-16 11:50:38.196 CEST [15022:9] 031_recovery_conflict.pl HINT: In a moment you should be able to reconnect to the database and repeat your command. 2022-09-16 11:50:38.197 CEST [15022:10] 031_recovery_conflict.pl LOG: disconnection: session time: 0:00:01.041 user=nm database=test_db host=[local] 2022-09-16 11:50:38.198 CEST [14880:20] LOG: recovery finished waiting after 132.886 ms: recovery conflict on lock The second DETAIL should be "User was holding a relation lock for too long." The backend in question is idle in transaction. RecoveryConflictInterrupt() for PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK won't see IsWaitingForLock(), so it will find no conflict. However, RecoveryConflictReason remains clobbered, hence the wrong DETAIL message. Incidentally, the affected test contains comment "# DROP TABLE containing block which standby has in a pinned buffer". The standby holds no pin at that moment; the LOCK TABLE pins system catalog pages, but it drops every pin it acquires.
Author: Noah Misch <n...@leadboat.com> Commit: Noah Misch <n...@leadboat.com> diff --git a/src/backend/storage/ipc/procsignal.c b/src/backend/storage/ipc/procsignal.c index 7767657..a1c0d38 100644 --- a/src/backend/storage/ipc/procsignal.c +++ b/src/backend/storage/ipc/procsignal.c @@ -638,6 +638,8 @@ void procsignal_sigusr1_handler(SIGNAL_ARGS) { int save_errno = errno; + static bool slept = false; + bool got; if (CheckProcSignal(PROCSIG_CATCHUP_INTERRUPT)) HandleCatchupInterrupt(); @@ -663,13 +665,21 @@ procsignal_sigusr1_handler(SIGNAL_ARGS) if (CheckProcSignal(PROCSIG_RECOVERY_CONFLICT_TABLESPACE)) RecoveryConflictInterrupt(PROCSIG_RECOVERY_CONFLICT_TABLESPACE); + got = CheckProcSignal(PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK); + if (got && !slept) + { + slept = true; + /* probably must exceed max_standby_streaming_delay */ + pg_usleep(60 * 1000); + } + if (CheckProcSignal(PROCSIG_RECOVERY_CONFLICT_LOCK)) RecoveryConflictInterrupt(PROCSIG_RECOVERY_CONFLICT_LOCK); if (CheckProcSignal(PROCSIG_RECOVERY_CONFLICT_SNAPSHOT)) RecoveryConflictInterrupt(PROCSIG_RECOVERY_CONFLICT_SNAPSHOT); - if (CheckProcSignal(PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK)) + if (got) RecoveryConflictInterrupt(PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK); if (CheckProcSignal(PROCSIG_RECOVERY_CONFLICT_BUFFERPIN)) diff --git a/src/backend/storage/ipc/standby.c b/src/backend/storage/ipc/standby.c index f43229d..204ae80 100644 --- a/src/backend/storage/ipc/standby.c +++ b/src/backend/storage/ipc/standby.c @@ -854,6 +854,7 @@ ResolveRecoveryConflictWithBufferPin(void) * is basically no so long. But we should fix this? */ SendRecoveryConflictWithBufferPin(PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK); + pg_usleep(90 * 1000); } /* diff --git a/src/backend/tcop/postgres.c b/src/backend/tcop/postgres.c index 01d264b..854e0db 100644 --- a/src/backend/tcop/postgres.c +++ b/src/backend/tcop/postgres.c @@ -3003,6 +3003,9 @@ RecoveryConflictInterrupt(ProcSignalReason reason) { int save_errno = errno; + /* Not elog(), which would process interrupts. */ + fprintf(stderr, "reason = %d inprog=%d\n", reason, proc_exit_inprogress); + /* * Don't joggle the elbow of proc_exit */