Parallel worker hangs while handling errors.

vignesh C Fri, 03 Jul 2020 02:12:06 -0700

Hi,

Parallel worker hangs while handling errors.


Analysis:
When there is an error in the parallel worker process, we will call
ereport/elog with the error message. Worker will then jump from
errfinish to setjmp in StartBackgroundWorker function which was set
earlier. Then the worker process will then send the error message
through the shared memory to the leader process. Shared memory size is
ok 16K, if the error message is less than 16K it works fine. If there
is a bigger error message, the worker process will wait for the leader
process to read the message, free up some memory in shared memory and
set the latch. The worker will be waiting at the below back trace:
#4  0x000000000090480c in WaitLatch (latch=0x7f2b39f6b454,
wakeEvents=33, timeout=0, wait_event_info=134217753) at latch.c:368
#5  0x0000000000787c7f in mq_putmessage (msgtype=69 'E', s=0x2f24350
"SERROR", len=230015) at pqmq.c:171
#6  0x000000000078712e in pq_endmessage (buf=0x7ffe721c4370) at pqformat.c:301
#7  0x0000000000ac1749 in send_message_to_frontend (edata=0xfe91a0
<errordata>) at elog.c:3327
#8  0x0000000000abdf5b in EmitErrorReport () at elog.c:1460

Leader process then identifies that there are some messages that need
to be processed, it copies the messages and sets the latch so that the
worker process can copy the remaining message from the below function:
shm_mq_inc_bytes_read -> SetLatch(&sender->procLatch);, Worker is not
able to receive any signal at this point of time & hangs infinitely
Worker hangs in this case because when the worker is started the
signals will be masked using sigprocmask. Unblocking of signals is
done by calling BackgroundWorkerUnblockSignals in ParallelWorkerMain.
Now due to error handling the worker has jumped to setjmp in
StartBackgroundWorker function. Here the signals are in a blocked
state, hence the signal is not received by the worker process.

One of the fixes could be to call BackgroundWorkerUnblockSignals just
after sigsetjmp. I'm not sure if this is the best solution.
Robert & myself had a discussion about the problem yesterday. We felt
this is a genuine problem with the parallel worker error handling and
need to be fixed.
I could reproduce this issue when there is an error during copy of
toast data using parallel copy, this project is an in-progress
project. I don't have a test case to reproduce on the head. Any
suggestions for a test case on head?
The Attached patch has the fix for the same.

Thoughts?

Regards,
Vignesh
EnterpriseDB: http://www.enterprisedb.com

From a2177fdb0896160f209db4eebc5b4d80eb341e42 Mon Sep 17 00:00:00 2001
From: Vignesh C <vignesh21@gmail.com>
Date: Fri, 3 Jul 2020 12:18:55 +0530
Subject: [PATCH] Fix for Parallel worker hangs while handling errors.

Worker is not able to receive the signals while processing error flow. Worker
hangs in this case because when the worker is started the signals will be
masked using sigprocmask. Unblocking of signals is done by calling
BackgroundWorkerUnblockSignals in ParallelWorkerMain. Now due to error
handling the worker has jumped to setjmp in StartBackgroundWorker function.
Here the signals are in blocked state, hence the signal is not received by the
worker process.
---
 src/backend/postmaster/bgworker.c | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/src/backend/postmaster/bgworker.c b/src/backend/postmaster/bgworker.c
index beb5e85..9663907 100644
--- a/src/backend/postmaster/bgworker.c
+++ b/src/backend/postmaster/bgworker.c
@@ -747,6 +747,11 @@ StartBackgroundWorker(void)
 	 */
 	if (sigsetjmp(local_sigjmp_buf, 1) != 0)
 	{
+		/*
+		 * Unblock signals (they were blocked when the postmaster forked us)
+		 */
+		BackgroundWorkerUnblockSignals();
+
 		/* Since not using PG_TRY, must reset error stack by hand */
 		error_context_stack = NULL;
 
-- 
1.8.3.1

Parallel worker hangs while handling errors.

Reply via email to