I wrote: > What we've apparently got here is that signals were received > so fast that the postmaster ran out of stack space. I remember > Andres complaining about this as a theoretical threat, but I > hadn't seen it in the wild before.
> I haven't finished investigating though, as there are some things > that remain to be explained. I still don't have a good explanation for why this only seems to happen in the pg_upgrade test sequence. However, I did notice something very interesting: the postmaster crashes after consuming only about 1MB of stack space. This is despite the prevailing setting of "ulimit -s" being 8192 (8MB). I also confirmed that the value of max_stack_depth within the crashed process is 2048, which implies that get_stack_depth_rlimit got some value larger than 2MB from getrlimit(RLIMIT_STACK). And yet, here we have a crash, and the process memory map confirms that only 1MB was allocated in the stack region. So it's really hard to explain that as anything except a kernel bug: sometimes, the kernel doesn't give us as much stack as it promised it would. And the machine is not loaded enough for there to be any rational resource-exhaustion excuse for that. This matches up with the intermittent infinite_recurse failures we've been seeing in the buildfarm. Those are happening across a range of systems, but they're (almost) all Linux-based ppc64, suggesting that there's a longstanding arch-specific kernel bug involved. For reference, I scraped the attached list of such failures in the last three months. I wonder whether we can get the attention of any kernel hackers about that. Anyway, as to what to do about it --- it occurred to me to wonder why we are relying on having the signal handlers block and unblock signals manually, when we could tell sigaction() that we'd like signals blocked. It is reasonable to expect that the signal support is designed to not recursively consume stack space in the face of a series of signals, while the way we are doing it clearly opens us up to recursive space consumption. The stack trace I showed before proves that the recursion happens at the points where the signal handlers unblock signals. As a quick hack I made the attached patch, and it seems to fix the problem on wobbegong's host. I don't see crashes any more, and watching the postmaster's stack space consumption, it stays comfortably at a tad under 200KB (probably the default initial allocation), while without the patch it tends to blow up to 700K or more even in runs that don't crash. This patch isn't committable as-is because it will (I suppose) break things on Windows; we still need the old way there for lack of sigaction(). But that could be fixed with a few #ifdefs. I'm also kind of tempted to move pqsignal_no_restart into backend/libpq/pqsignal.c (where BlockSig is defined) and maybe rename it, but I'm not sure to what. This issue might go away if we switched to a postmaster implementation that doesn't do work in the signal handlers, but I'm not entirely convinced of that. The existing handlers don't seem to consume a lot of stack space in themselves (there's not many local variables in them). The bulk of the stack consumption is seemingly in the platform's signal infrastructure, so that we might still have a stack consumption issue even with fairly trivial handlers, if we don't tell sigaction to block signals. In any case, this fix seems potentially back-patchable, while we surely wouldn't risk back-patching a postmaster rewrite. Comments? regards, tom lane
sysname | architecture | operating_system | sys_owner | branch | snapshot | stage | err --------------+------------------+--------------------+-----------------+---------------+---------------------+-----------------+--------------------------------------------------------------------------------------------------------------- cavefish | ppc64le (POWER9) | Ubuntu | Mark Wong | HEAD | 2019-07-13 03:49:38 | pg_upgradeCheck | 2019-07-13 04:01:23.437 UTC [9365:71] DETAIL: Failed process was running: select infinite_recurse(); pintail | ppc64le (POWER9) | Debian GNU/Linux | Mark Wong | REL_12_STABLE | 2019-07-13 19:36:51 | Check | 2019-07-13 19:39:29.013 UTC [31086:5] DETAIL: Failed process was running: select infinite_recurse(); bonito | ppc64le (POWER9) | Fedora | Mark Wong | HEAD | 2019-07-19 23:13:01 | Check | 2019-07-19 23:16:33.330 UTC [24191:70] DETAIL: Failed process was running: select infinite_recurse(); takin | ppc64le | opensuse | Mark Wong | HEAD | 2019-07-24 08:24:56 | Check | 2019-07-24 08:28:01.735 UTC [16366:75] DETAIL: Failed process was running: select infinite_recurse(); quokka | ppc64 | RHEL | Sandeep Thakkar | HEAD | 2019-07-31 02:00:07 | pg_upgradeCheck | 2019-07-31 03:04:04.043 BST [5d40f709.776a:5] DETAIL: Failed process was running: select infinite_recurse(); elasmobranch | ppc64le (POWER9) | openSUSE Leap | Mark Wong | HEAD | 2019-08-01 03:13:38 | Check | 2019-08-01 03:19:05.394 UTC [22888:62] DETAIL: Failed process was running: select infinite_recurse(); buri | ppc64le (POWER9) | CentOS Linux | Mark Wong | HEAD | 2019-08-02 00:10:23 | Check | 2019-08-02 00:17:11.075 UTC [28222:73] DETAIL: Failed process was running: select infinite_recurse(); urocryon | ppc64le | debian | Mark Wong | HEAD | 2019-08-02 05:43:46 | Check | 2019-08-02 05:51:51.944 UTC [2724:64] DETAIL: Failed process was running: select infinite_recurse(); batfish | ppc64le | Ubuntu | Mark Wong | HEAD | 2019-08-04 19:02:36 | pg_upgradeCheck | 2019-08-04 19:08:11.728 UTC [23899:79] DETAIL: Failed process was running: select infinite_recurse(); buri | ppc64le (POWER9) | CentOS Linux | Mark Wong | REL_12_STABLE | 2019-08-07 00:03:29 | pg_upgradeCheck | 2019-08-07 00:11:24.500 UTC [1405:5] DETAIL: Failed process was running: select infinite_recurse(); quokka | ppc64 | RHEL | Sandeep Thakkar | REL_12_STABLE | 2019-08-08 02:43:45 | pg_upgradeCheck | 2019-08-08 03:47:38.115 BST [5d4b8d3f.cdd7:5] DETAIL: Failed process was running: select infinite_recurse(); quokka | ppc64 | RHEL | Sandeep Thakkar | HEAD | 2019-08-08 14:00:08 | Check | 2019-08-08 15:02:59.770 BST [5d4c2b88.cad9:5] DETAIL: Failed process was running: select infinite_recurse(); mereswine | ARMv7 | Linux debian-armhf | Clarence Ho | REL_11_STABLE | 2019-08-11 02:10:12 | InstallCheck-C | 2019-08-11 02:36:10.159 PDT [5004:4] DETAIL: Failed process was running: select infinite_recurse(); takin | ppc64le | opensuse | Mark Wong | HEAD | 2019-08-11 08:02:48 | Check | 2019-08-11 08:05:57.789 UTC [11500:67] DETAIL: Failed process was running: select infinite_recurse(); mereswine | ARMv7 | Linux debian-armhf | Clarence Ho | REL_12_STABLE | 2019-08-11 09:52:46 | pg_upgradeCheck | 2019-08-11 04:21:16.756 PDT [6804:5] DETAIL: Failed process was running: select infinite_recurse(); mereswine | ARMv7 | Linux debian-armhf | Clarence Ho | HEAD | 2019-08-11 11:29:27 | pg_upgradeCheck | 2019-08-11 07:15:28.454 PDT [9954:76] DETAIL: Failed process was running: select infinite_recurse(); demoiselle | ppc64le (POWER9) | openSUSE Leap | Mark Wong | HEAD | 2019-08-11 14:51:38 | pg_upgradeCheck | 2019-08-11 14:57:29.422 UTC [9436:70] DETAIL: Failed process was running: select infinite_recurse(); buri | ppc64le (POWER9) | CentOS Linux | Mark Wong | HEAD | 2019-08-15 00:09:57 | Check | 2019-08-15 00:17:43.282 UTC [2831:68] DETAIL: Failed process was running: select infinite_recurse(); urocryon | ppc64le | debian | Mark Wong | HEAD | 2019-08-19 06:28:34 | Check | 2019-08-19 06:39:25.749 UTC [26357:66] DETAIL: Failed process was running: select infinite_recurse(); urocryon | ppc64le | debian | Mark Wong | HEAD | 2019-08-21 06:34:47 | Check | 2019-08-21 06:37:39.089 UTC [14505:73] DETAIL: Failed process was running: select infinite_recurse(); demoiselle | ppc64le (POWER9) | openSUSE Leap | Mark Wong | REL_12_STABLE | 2019-09-04 14:42:08 | pg_upgradeCheck | 2019-09-04 14:56:15.219 UTC [11008:5] DETAIL: Failed process was running: select infinite_recurse(); pintail | ppc64le (POWER9) | Debian GNU/Linux | Mark Wong | REL_12_STABLE | 2019-09-07 19:22:48 | pg_upgradeCheck | 2019-09-07 19:27:20.789 UTC [25645:5] DETAIL: Failed process was running: select infinite_recurse(); quokka | ppc64 | RHEL | Sandeep Thakkar | REL_12_STABLE | 2019-09-10 02:00:15 | Check | 2019-09-10 03:03:17.711 BST [5d77045a.5776:5] DETAIL: Failed process was running: select infinite_recurse(); buri | ppc64le (POWER9) | CentOS Linux | Mark Wong | HEAD | 2019-09-17 23:12:33 | Check | 2019-09-17 23:19:45.769 UTC [20920:77] DETAIL: Failed process was running: select infinite_recurse(); shoveler | ppc64le (POWER8) | Debian GNU/Linux | Mark Wong | HEAD | 2019-09-18 13:52:55 | Check | 2019-09-18 13:56:11.273 UTC [563:71] DETAIL: Failed process was running: select infinite_recurse(); buri | ppc64le (POWER9) | CentOS Linux | Mark Wong | HEAD | 2019-09-19 00:01:54 | Check | 2019-09-19 00:09:30.734 UTC [11775:67] DETAIL: Failed process was running: select infinite_recurse(); gadwall | ppc64le (POWER9) | Debian GNU/Linux | Mark Wong | HEAD | 2019-09-21 12:26:50 | Check | 2019-09-21 12:31:16.199 UTC [7119:70] DETAIL: Failed process was running: select infinite_recurse(); quokka | ppc64 | RHEL | Sandeep Thakkar | HEAD | 2019-09-24 14:00:11 | pg_upgradeCheck | 2019-09-24 15:04:49.272 BST [5d8a2276.cba9:5] DETAIL: Failed process was running: select infinite_recurse(); urocryon | ppc64le | debian | Mark Wong | HEAD | 2019-09-25 06:24:24 | Check | 2019-09-25 06:31:54.876 UTC [26608:76] DETAIL: Failed process was running: select infinite_recurse(); pintail | ppc64le (POWER9) | Debian GNU/Linux | Mark Wong | HEAD | 2019-09-26 19:33:59 | Check | 2019-09-26 19:39:25.850 UTC [6259:69] DETAIL: Failed process was running: select infinite_recurse(); shoveler | ppc64le (POWER8) | Debian GNU/Linux | Mark Wong | HEAD | 2019-09-28 13:54:16 | Check | 2019-09-28 13:59:02.354 UTC [7586:71] DETAIL: Failed process was running: select infinite_recurse(); buri | ppc64le (POWER9) | CentOS Linux | Mark Wong | REL_12_STABLE | 2019-09-28 23:14:23 | pg_upgradeCheck | 2019-09-28 23:22:13.987 UTC [20133:5] DETAIL: Failed process was running: select infinite_recurse(); gadwall | ppc64le (POWER9) | Debian GNU/Linux | Mark Wong | HEAD | 2019-10-02 12:44:49 | Check | 2019-10-02 12:50:17.823 UTC [10840:76] DETAIL: Failed process was running: select infinite_recurse(); cavefish | ppc64le (POWER9) | Ubuntu | Mark Wong | HEAD | 2019-10-04 04:37:58 | Check | 2019-10-04 04:46:03.804 UTC [27768:69] DETAIL: Failed process was running: select infinite_recurse(); cavefish | ppc64le (POWER9) | Ubuntu | Mark Wong | HEAD | 2019-10-07 03:22:37 | pg_upgradeCheck | 2019-10-07 03:28:05.031 UTC [2991:68] DETAIL: Failed process was running: select infinite_recurse(); bufflehead | ppc64le (POWER8) | openSUSE Leap | Mark Wong | HEAD | 2019-10-09 20:46:56 | pg_upgradeCheck | 2019-10-09 20:51:47.408 UTC [18136:86] DETAIL: Failed process was running: select infinite_recurse(); vulpes | ppc64le | fedora | Mark Wong | HEAD | 2019-10-11 08:53:50 | Check | 2019-10-11 08:57:59.370 UTC [14908:77] DETAIL: Failed process was running: select infinite_recurse(); shoveler | ppc64le (POWER8) | Debian GNU/Linux | Mark Wong | HEAD | 2019-10-11 13:54:38 | pg_upgradeCheck | 2019-10-11 14:01:53.903 UTC [5911:76] DETAIL: Failed process was running: select infinite_recurse(); (38 rows)
diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c index 85f15a5..fff83b7 100644 --- a/src/backend/postmaster/postmaster.c +++ b/src/backend/postmaster/postmaster.c @@ -2640,8 +2640,6 @@ SIGHUP_handler(SIGNAL_ARGS) { int save_errno = errno; - PG_SETMASK(&BlockSig); - if (Shutdown <= SmartShutdown) { ereport(LOG, @@ -2700,8 +2698,6 @@ SIGHUP_handler(SIGNAL_ARGS) #endif } - PG_SETMASK(&UnBlockSig); - errno = save_errno; } @@ -2714,8 +2710,6 @@ pmdie(SIGNAL_ARGS) { int save_errno = errno; - PG_SETMASK(&BlockSig); - ereport(DEBUG2, (errmsg_internal("postmaster received signal %d", postgres_signal_arg))); @@ -2880,8 +2874,6 @@ pmdie(SIGNAL_ARGS) break; } - PG_SETMASK(&UnBlockSig); - errno = save_errno; } @@ -2895,8 +2887,6 @@ reaper(SIGNAL_ARGS) int pid; /* process id of dead child process */ int exitstatus; /* its exit status */ - PG_SETMASK(&BlockSig); - ereport(DEBUG4, (errmsg_internal("reaping dead processes"))); @@ -3212,8 +3202,6 @@ reaper(SIGNAL_ARGS) PostmasterStateMachine(); /* Done with signal handler */ - PG_SETMASK(&UnBlockSig); - errno = save_errno; } @@ -5114,8 +5102,6 @@ sigusr1_handler(SIGNAL_ARGS) { int save_errno = errno; - PG_SETMASK(&BlockSig); - /* Process background worker state change. */ if (CheckPostmasterSignal(PMSIGNAL_BACKGROUND_WORKER_CHANGE)) { @@ -5272,8 +5258,6 @@ sigusr1_handler(SIGNAL_ARGS) signal_child(StartupPID, SIGUSR2); } - PG_SETMASK(&UnBlockSig); - errno = save_errno; } diff --git a/src/port/pqsignal.c b/src/port/pqsignal.c index ecb9ca2..93a039b 100644 --- a/src/port/pqsignal.c +++ b/src/port/pqsignal.c @@ -65,7 +65,11 @@ pqsignal(int signo, pqsigfunc func) * * On Windows, this would be identical to pqsignal(), so don't bother. */ -#ifndef WIN32 +#ifndef FRONTEND + +extern sigset_t UnBlockSig, + BlockSig, + StartupBlockSig; pqsigfunc pqsignal_no_restart(int signo, pqsigfunc func) @@ -74,7 +78,7 @@ pqsignal_no_restart(int signo, pqsigfunc func) oact; act.sa_handler = func; - sigemptyset(&act.sa_mask); + act.sa_mask = BlockSig; act.sa_flags = 0; #ifdef SA_NOCLDSTOP if (signo == SIGCHLD)