On 2020/09/18 9:30, Thomas Munro wrote:
On Thu, Sep 17, 2020 at 10:47 PM Heikki Linnakangas <hlinn...@iki.fi> wrote:
Hmm, so for speedy response to postmaster death, you're relying on the
loops to have other postmaster-death checks besides
HandleStartupProcInterrupts(), in the form of WL_EXIT_ON_PM_DEATH. That
seems a bit fragile, at the very least it needs a comment in
HandleStartupProcInterrupts() to call it out.
Surely that's the direction we want to go in, though, no? Commit
cfdf4dc4 was intended to standardise the way we react to postmaster
death where waiting is involved. I updated the comment in
HandleStartupProcInterrupts() to highlight that the
PostmasterIsAlive() check in there is only for the benefit of
CPU-bound loops.
Note that there's one more loop in ShutdownWalRcv() that uses pg_usleep().
Updating that one required me to invent a new wait_event for
pg_stat_activity, which seems like progress.
Unfortunately, while I was doing that I realised that WaitLatch()
without WL_SET_LATCH was broken by commit 3347c982bab (in master
only), in a way not previously reached by the tests. So 0001 is a
patch to fix that.
- pgstat_report_wait_start(WAIT_EVENT_RECOVERY_PAUSE);
- pg_usleep(1000000L); /* 1000 ms */
- pgstat_report_wait_end();
+ WaitLatch(NULL, WL_EXIT_ON_PM_DEATH | WL_TIMEOUT, 1000,
+ WAIT_EVENT_RECOVERY_PAUSE);
This change may cause at most one second delay against the standby
promotion request during WAL replay pause? It's only one second,
but I'd like to avoid this (unnecessary) wait to shorten the failover time
as much as possible basically. So what about using WL_SET_LATCH here?
But when using WL_SET_LATCH, one concern is that walreceiver can
wake up the startup process too frequently even during WAL replay pause.
This is also problematic? I'm ok with this, but if not, using pg_usleep()
might be better as the original code does.
Regards,
--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION