Hi, Parallel worker hangs while handling errors.
Analysis: When there is an error in the parallel worker process, we will call ereport/elog with the error message. Worker will then jump from errfinish to setjmp in StartBackgroundWorker function which was set earlier. Then the worker process will then send the error message through the shared memory to the leader process. Shared memory size is ok 16K, if the error message is less than 16K it works fine. If there is a bigger error message, the worker process will wait for the leader process to read the message, free up some memory in shared memory and set the latch. The worker will be waiting at the below back trace: #4 0x000000000090480c in WaitLatch (latch=0x7f2b39f6b454, wakeEvents=33, timeout=0, wait_event_info=134217753) at latch.c:368 #5 0x0000000000787c7f in mq_putmessage (msgtype=69 'E', s=0x2f24350 "SERROR", len=230015) at pqmq.c:171 #6 0x000000000078712e in pq_endmessage (buf=0x7ffe721c4370) at pqformat.c:301 #7 0x0000000000ac1749 in send_message_to_frontend (edata=0xfe91a0 <errordata>) at elog.c:3327 #8 0x0000000000abdf5b in EmitErrorReport () at elog.c:1460 Leader process then identifies that there are some messages that need to be processed, it copies the messages and sets the latch so that the worker process can copy the remaining message from the below function: shm_mq_inc_bytes_read -> SetLatch(&sender->procLatch);, Worker is not able to receive any signal at this point of time & hangs infinitely Worker hangs in this case because when the worker is started the signals will be masked using sigprocmask. Unblocking of signals is done by calling BackgroundWorkerUnblockSignals in ParallelWorkerMain. Now due to error handling the worker has jumped to setjmp in StartBackgroundWorker function. Here the signals are in a blocked state, hence the signal is not received by the worker process. One of the fixes could be to call BackgroundWorkerUnblockSignals just after sigsetjmp. I'm not sure if this is the best solution. Robert & myself had a discussion about the problem yesterday. We felt this is a genuine problem with the parallel worker error handling and need to be fixed. I could reproduce this issue when there is an error during copy of toast data using parallel copy, this project is an in-progress project. I don't have a test case to reproduce on the head. Any suggestions for a test case on head? The Attached patch has the fix for the same. Thoughts? Regards, Vignesh EnterpriseDB: http://www.enterprisedb.com
From a2177fdb0896160f209db4eebc5b4d80eb341e42 Mon Sep 17 00:00:00 2001 From: Vignesh C <vignesh21@gmail.com> Date: Fri, 3 Jul 2020 12:18:55 +0530 Subject: [PATCH] Fix for Parallel worker hangs while handling errors. Worker is not able to receive the signals while processing error flow. Worker hangs in this case because when the worker is started the signals will be masked using sigprocmask. Unblocking of signals is done by calling BackgroundWorkerUnblockSignals in ParallelWorkerMain. Now due to error handling the worker has jumped to setjmp in StartBackgroundWorker function. Here the signals are in blocked state, hence the signal is not received by the worker process. --- src/backend/postmaster/bgworker.c | 5 +++++ 1 file changed, 5 insertions(+) diff --git a/src/backend/postmaster/bgworker.c b/src/backend/postmaster/bgworker.c index beb5e85..9663907 100644 --- a/src/backend/postmaster/bgworker.c +++ b/src/backend/postmaster/bgworker.c @@ -747,6 +747,11 @@ StartBackgroundWorker(void) */ if (sigsetjmp(local_sigjmp_buf, 1) != 0) { + /* + * Unblock signals (they were blocked when the postmaster forked us) + */ + BackgroundWorkerUnblockSignals(); + /* Since not using PG_TRY, must reset error stack by hand */ error_context_stack = NULL; -- 1.8.3.1