I noticed this recent crash on lorikeet: https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=lorikeet&dt=2021-05-05%2009%3A19%3A29
The relevant bits of the log seem to be 2021-05-05 05:36:22.011 EDT [60926716.bb24:1] ERROR: could not map dynamic shared memory segment ... 2021-05-05 05:36:22.013 EDT [609266c5.b793:4] LOG: background worker "parallel worker" (PID 47908) exited with exit code 1 ... TRAP: FailedAssertion("BackgroundWorkerData->parallel_register_count - BackgroundWorkerData->parallel_terminate_count <= MAX_PARALLEL_WORKER_LIMIT", File: "/home/andrew/bf/root/REL_13_STABLE/pgsql.build/../pgsql/src/backend/postmaster/bgworker.c", Line: 1016) *** starting debugger for pid 47743, tid 1264 2021-05-05 05:36:26.629 EDT [609266c5.b793:5] LOG: server process (PID 47743) exited with exit code 127 So we had a parallel worker fail to start, whereupon its leader went down with an assertion failure. I know that the parallel-worker code is held together with chewing gum and baling wire, but that's a bit much. Looking at the indicated code, we find /* * If this is a parallel worker, check whether there are already too many * parallel workers; if so, don't register another one. Our view of * parallel_terminate_count may be slightly stale, but that doesn't really * matter: we would have gotten the same result if we'd arrived here * slightly earlier anyway. There's no help for it, either, since the * postmaster must not take locks; a memory barrier wouldn't guarantee * anything useful. */ if (parallel && (BackgroundWorkerData->parallel_register_count - BackgroundWorkerData->parallel_terminate_count) >= max_parallel_workers) { Assert(BackgroundWorkerData->parallel_register_count - BackgroundWorkerData->parallel_terminate_count <= MAX_PARALLEL_WORKER_LIMIT); LWLockRelease(BackgroundWorkerLock); return false; } I would like to know on what grounds that Assert isn't insanity? We just finished pointing out that we might see an old parallel_terminate_count, which ISTM says specifically that parallel_register_count minus parallel_terminate_count might be larger than expected. Admittedly, it seems unlikely that the difference could exceed MAX_PARALLEL_WORKER_LIMIT = 1024 in a regression test run where the limit on number of parallel workers is only 8. What I think is more likely, given that these counters are unsigned, is that the difference was actually negative. Which could be a bug, or it could be an expectable race condition, or it could just be some flakiness on lorikeet's part (that machine has had a lot of issues lately). I trawled the buildfarm logs going back 180 days, and found no other instances of this assertion, which seems to be evidence in favor of the "lorikeet got flaky" theory. But it's not proof. In any case, I see zero value in this assertion, so I propose we remove it. If we don't remove it, it needs serious revision, because it seems absolutely obvious to me that it could trigger when there is nothing wrong. A system pushing the limit of number of parallel workers would be at considerable risk. regards, tom lane