Re: Parallel worker hangs while handling errors.

vignesh C Fri, 24 Jul 2020 18:33:20 -0700

On Fri, Jul 24, 2020 at 12:41 PM Bharath Rupireddy
<bharath.rupireddyforpostg...@gmail.com> wrote:
>
> On Thu, Jul 23, 2020 at 12:54 PM vignesh C <vignes...@gmail.com> wrote:
> >
> > Thanks for reviewing and adding your thoughts, My comments are inline.
> >
> > On Fri, Jul 17, 2020 at 1:21 PM Bharath Rupireddy
> > <bharath.rupireddyforpostg...@gmail.com> wrote:
> > >
> > > The same hang issue can occur(though I'm not able to back it up with a
> > > use case), in the cases from wherever the EmitErrorReport() is called
> > > from "if (sigsetjmp(local_sigjmp_buf, 1) != 0)" block, such as
> > > autovacuum.c, bgwriter.c, bgworker.c, checkpointer.c, walwriter.c, and
> > > postgres.c.
> > >
> >
> > I'm not sure if this can occur in other cases.
> >
>
> I checked further on this point: Yes, it can't occur for the other
> cases, as mq_putmessage() gets only used for parallel
> workers(ParallelWorkerMain()  --> pq_redirect_to_shm_mq()).
>
> >
> > > Note that, in all sigsetjmp blocks, we intentionally
> > > HOLD_INTERRUPTS(), to not cause any issues while performing error
> > > handling, I'm concerned here that now, if we directly call
> > > BackgroundWorkerUnblockSignals() which will open up all the signals
> > > and our main intention of holding interrupts or block signals may go
> > > away.
> > >
> > > Since the main problem for this hang issue is because of blocking
> > > SIGUSR1, in sigsetjmp, can't we just only unblock only the SIGUSR1,
> > > instead of unblocking all signals? I tried this with parallel copy
> > > hang, the issue is resolved.
> > >
> >
> > On putting further thoughts on this, I feel just unblocking SIGUSR1
> > would be the right approach in this case. I'm attaching a new patch
> > which unblocks SIGUSR1 signal. I have verified that the original issue
> > with WIP parallel copy patch gets fixed. I have made changes only in
> > bgworker.c as we require the parallel worker to receive this signal
> > and continue processing. I have not included the changes for other
> > processes as I'm not sure if this scenario is applicable for other
> > processes.
> >
>
> Since we are sure that this hang issue can occur only for parallel
> workers, and the change in StartBackgroundWorker's sigsetjmp's block
> should only be made for parallel worker cases. And also there can be a
> lot of other callbacks execution and processing in proc_exit() for
> which we might not need the SIGUSR1 unblocked. So, let's undo the
> unblocking right after EmitErrorReport() to not cause any new issues.
>
> Attaching a modified v2 patch: it has the unblocking for only for
> parallel workers, undoing it after EmitErrorReport(), and some
> adjustments in the comment.
>


I have made slight changes on top of the patch to remove duplicate
code, attached v3 patch for the same.
The parallel worker hang issue gets resolved, make check & make
check-world passes.

Regards,
Vignesh
EnterpriseDB: http://www.enterprisedb.com

From b48849d091302dfdac4fc004195b3c532fdb9dcb Mon Sep 17 00:00:00 2001
From: Vignesh C <vignes...@gmail.com>
Date: Sat, 25 Jul 2020 06:38:40 +0530
Subject: [PATCH v3] Fix for Parallel worker hangs while handling errors.

Worker is not able to receive the signals while processing error flow. Worker
hangs in this case because when the worker is started the signals will be
masked using sigprocmask. Unblocking of signals is done by calling
BackgroundWorkerUnblockSignals in ParallelWorkerMain. Now due to error
handling the worker has jumped to setjmp in StartBackgroundWorker function.
Here the signals are in blocked state, hence the signal is not received by the
worker process.

Authors: Vignesh C, Bharath Rupireddy
---
 src/backend/postmaster/bgworker.c | 35 +++++++++++++++++++++++++++++++++++
 1 file changed, 35 insertions(+)

diff --git a/src/backend/postmaster/bgworker.c b/src/backend/postmaster/bgworker.c
index beb5e85..0b9f214 100644
--- a/src/backend/postmaster/bgworker.c
+++ b/src/backend/postmaster/bgworker.c
@@ -671,6 +671,23 @@ bgworker_sigusr1_handler(SIGNAL_ARGS)
 }
 
 /*
+ * update_parallel_worker_sigmask - add or remove a signal from sigmask.
+ */
+static void
+update_parallel_worker_sigmask(BackgroundWorker *worker, int signum,
+							   bool isadd)
+{
+	if ((worker->bgw_flags & BGWORKER_CLASS_PARALLEL) != 0)
+	{
+		if (isadd)
+			sigaddset(&BlockSig, signum);
+		else
+			sigdelset(&BlockSig, signum);
+		PG_SETMASK(&BlockSig);
+	}
+}
+
+/*
  * Start a new background worker
  *
  * This is the main entry point for background worker, to be called from
@@ -747,6 +764,16 @@ StartBackgroundWorker(void)
 	 */
 	if (sigsetjmp(local_sigjmp_buf, 1) != 0)
 	{
+		/*
+		 * In case of parallel workers, unblock SIGUSR1 signal, it was blocked
+		 * when the postmaster forked us. Leader process will send SIGUSR1 signal
+		 * to the worker process(worker process will be in waiting state as
+		 * there is no space available) to indicate shared memory space is freed
+		 * up. Once the signal is received worker process will start populating
+		 * the error message further.
+		 */
+		update_parallel_worker_sigmask(worker, SIGUSR1, false);
+
 		/* Since not using PG_TRY, must reset error stack by hand */
 		error_context_stack = NULL;
 
@@ -757,6 +784,14 @@ StartBackgroundWorker(void)
 		EmitErrorReport();
 
 		/*
+		 * Undo the unblocking of SIGUSR1 which was done above, as to
+		 * not cause any further issues from unblocking SIGUSR1 during
+		 * the execution of callbacks and other processing that will be
+		 * done during proc_exit().
+		 */
+		update_parallel_worker_sigmask(worker, SIGUSR1, true);
+
+		/*
 		 * Do we need more cleanup here?  For shmem-connected bgworkers, we
 		 * will call InitProcess below, which will install ProcKill as exit
 		 * callback.  That will take care of releasing locks, etc.
-- 
1.8.3.1

Re: Parallel worker hangs while handling errors.

Reply via email to