On 21 August 2017 at 10:57, Craig Ringer <cr...@2ndquadrant.com> wrote:
> Hi all > > I've noticed a possible bug / design limitation where shm_mq_wait_internal > sleep in a latch wait forever, and the postmaster gets stuck waiting for > the bgworker the wait is running in to exit. > > This happens when the shm_mq does not have an associated bgworker handle > registered because the other end is not known at mq creation time or is a > normal backend not a bgworker. So a BGW handle cannot be passed. > > shm_mq_wait_internal() will CHECK_FOR_INTERRUPTS() when its latch wait is > interrupted by a SIGTERM. But it doesn't actually respond to SIGTERM in any > way; it just merrily resets its latch and keeps looping. > > It will bail out correctly on SIGQUIT. > > If the proc waiting to attach was known at queue creation time and was a > bgworker, we'd pass a bgworker handle and the mq would notice it failed to > start and stop waiting. There's only a problem if no bgworker handle can be > supplied. > > The underlying problem is that CHECK_FOR_INTERRUPTS() doesn't care about > SIGTERM or have any way to test for it. And we don't have any global > management of SIGTERM like we do SIGQUIT so the shm_mq_wait_internal loop > can't test for it. > > The only ways I can see to fix this are: > > * Generalize SIGTERM handling across postgres, so there's a global > "got_SIGTERM" global that shm_mq_wait_internal can test to break out of its > loop, and every backend's signal handler must set it. Lots of churn. > > * In a proc's signal handler, use globals set before entry and after exit > from shm_mq operations to detect if we're currently in shm_mq and promote > SIGTERM to SIGQUIT by sending a new signal to ourselves. Or set up state so > CHECK_FOR_INTERRUPTS() will notice when the handler returns. > > * Allow passing of a *bool that tests for SIGTERM, or a function pointer > called on each iteration to test whether looping should continue, to be > passed to shm_mq_attach. So if you can't supply a bgw handle, you supply > that instead. Provide a shm_mq_set_handle equivalent for it too. > > Any objections to the last approach? > BTW, you can work around it in extension code for existing versions with something like this in your bgworker: volatile bool in_shm_mq = false; void my_handle_sigterm(SIGNAL_ARGS) { ... if (in_shm_mq) { /* * shm_mq can get stuck in shm_mq_wait_internal on SIGTERM; see * https://www.postgresql.org/message-id/CAMsr+YHmm=01lsueyr6ydz8clgfnk_fgdgi+qxujf+jelpv...@mail.gmail.com * * To work around this we keep track of whether we're in shmem_mq * and generate a fake interrupt for CHECK_FOR_INTERRUPTS() to * process if so. * * The guard around in_shm_mq may not be necessary, but without * it any SIGTERM will likely cause ereport(FATAL) with * "terminating connection due to administrator command" * which isn't ideal. */ InterruptPending = true; ProcDiePending = true; } .... } inline static shm_mq_handle * myext_shm_mq_attach(shm_mq *mq, dsm_segment *seg, BackgroundWorkerHandle *handle) { shm_mq_handle *ret; in_shm_mq = true; ret = shm_mq_attach(mq, seg, handle); in_shm_mq = false; return ret; } /* repeated for shm_mq_receive, shm_mq_send, shm_mq_sendv, shm_mq_wait_for_attach */ You can instead use non-blocking sends instead, and sleep on your own latch, doing the same work as shm_mq_wait_internal yourself. -- Craig Ringer http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services