On Tue, Jun 24, 2025 at 9:53 PM Tom Lane <t...@sss.pgh.pa.us> wrote: > > Amit Kapila <amit.kapil...@gmail.com> writes: > > On Tue, Jun 24, 2025 at 5:26 AM Tom Lane <t...@sss.pgh.pa.us> wrote: > >> 1. WaitForReplicationWorkerAttach sometimes has to clear a process > >> latch event so that it can keep waiting for the worker to launch. > >> It neglects to set the latch again, allowing ApplyLauncherMain > >> to miss events. > > > There was a previous discussion to fix this behavior. Heikki has > > proposed a similar fix for this, but at the caller. See the patch > > attached in email [1]. > > Ah, thank you for that link. I vaguely recalled that we'd discussed > this strange behavior before, but I did not realize that anyone had > diagnosed a cause. I don't much like any of the patches proposed > in that thread though --- they seem overcomplicated or off-point. > > I do not think we can switch to having two latches here. The > only reason WaitForReplicationWorkerAttach needs to pay attention > to the process latch at all is that it needs to service > CHECK_FOR_INTERRUPTS() conditions in a timely fashion, which is > also true in the ApplyLauncherMain loop. We can't realistically > expect other processes to signal different latches depending on > where the launcher is waiting, so those cases have to be triggered > by the same latch. > > However, WaitForReplicationWorkerAttach can't service any > latch-setting conditions other than those managed by > CHECK_FOR_INTERRUPTS(). So if CHECK_FOR_INTERRUPTS() returns, > we have some other triggering condition, which is the business > of some outer code level to deal with. Having it re-set the latch > to allow that to happen promptly after it returns seems like a > pretty straightforward answer to me.
Yeah that makes sense, we can not simply wait on another latch which is not managed by CHECK_FOR_INTERRUPTS(), and IMHO the proposed patch looks simple for resolving this issue. -- Regards, Dilip Kumar Google