Thanks for reviewing and adding your thoughts, My comments are inline.

On Fri, Jul 17, 2020 at 1:21 PM Bharath Rupireddy
<bharath.rupireddyforpostg...@gmail.com> wrote:
>
> The same hang issue can occur(though I'm not able to back it up with a
> use case), in the cases from wherever the EmitErrorReport() is called
> from "if (sigsetjmp(local_sigjmp_buf, 1) != 0)" block, such as
> autovacuum.c, bgwriter.c, bgworker.c, checkpointer.c, walwriter.c, and
> postgres.c.
>

I'm not sure if this can occur in other cases.

> >
> > One of the fixes could be to call BackgroundWorkerUnblockSignals just
> > after sigsetjmp. I'm not sure if this is the best solution.
> > Robert & myself had a discussion about the problem yesterday. We felt
> > this is a genuine problem with the parallel worker error handling and
> > need to be fixed.
> >
>
> Note that, in all sigsetjmp blocks, we intentionally
> HOLD_INTERRUPTS(), to not cause any issues while performing error
> handling, I'm concerned here that now, if we directly call
> BackgroundWorkerUnblockSignals() which will open up all the signals
> and our main intention of holding interrupts or block signals may go
> away.
>
> Since the main problem for this hang issue is because of blocking
> SIGUSR1, in sigsetjmp, can't we just only unblock only the SIGUSR1,
> instead of unblocking all signals? I tried this with parallel copy
> hang, the issue is resolved.
>

On putting further thoughts on this, I feel just unblocking SIGUSR1
would be the right approach in this case. I'm attaching a new patch
which unblocks SIGUSR1 signal. I have verified that the original issue
with WIP parallel copy patch gets fixed. I have made changes only in
bgworker.c as we require the parallel worker to receive this signal
and continue processing. I have not included the changes for other
processes as I'm not sure if this scenario is applicable for other
processes.

Regards,
Vignesh
EnterpriseDB: http://www.enterprisedb.com
From b283bdd47e3bc132c1b8ef0ea59d1c12cb8c0ffb Mon Sep 17 00:00:00 2001
From: Vignesh C <vignes...@gmail.com>
Date: Thu, 23 Jul 2020 12:35:45 +0530
Subject: [PATCH] Fix for Parallel worker hangs while handling errors.

Worker is not able to receive the signals while processing error flow. Worker
hangs in this case because when the worker is started the signals will be
masked using sigprocmask. Unblocking of signals is done by calling
BackgroundWorkerUnblockSignals in ParallelWorkerMain. Now due to error
handling the worker has jumped to setjmp in StartBackgroundWorker function.
Here the signals are in blocked state, hence the signal is not received by the
worker process.
---
 src/backend/postmaster/bgworker.c | 11 +++++++++++
 1 file changed, 11 insertions(+)

diff --git a/src/backend/postmaster/bgworker.c b/src/backend/postmaster/bgworker.c
index beb5e85..a0af580 100644
--- a/src/backend/postmaster/bgworker.c
+++ b/src/backend/postmaster/bgworker.c
@@ -747,6 +747,17 @@ StartBackgroundWorker(void)
 	 */
 	if (sigsetjmp(local_sigjmp_buf, 1) != 0)
 	{
+		/*
+		 * Unblock SIGUSR1 signal, it was blocked when the postmaster forker us.
+		 * In case of parallel workers, leader process will send SIGUSR1 signal
+		 * to the worker process(worker process will be in waiting state as
+		 * there is no space available) to indicate shared memory space is freed
+		 * up. Once the signal is received worker process will start populating
+		 * the error message further.
+		 */
+		sigdelset(&BlockSig, SIGUSR1);
+		PG_SETMASK(&BlockSig);
+
 		/* Since not using PG_TRY, must reset error stack by hand */
 		error_context_stack = NULL;
 
-- 
1.8.3.1

Reply via email to