Amit Kapila <amit.kapil...@gmail.com> writes: > On Tue, Jun 24, 2025 at 5:26 AM Tom Lane <t...@sss.pgh.pa.us> wrote: >> 1. WaitForReplicationWorkerAttach sometimes has to clear a process >> latch event so that it can keep waiting for the worker to launch. >> It neglects to set the latch again, allowing ApplyLauncherMain >> to miss events.
> There was a previous discussion to fix this behavior. Heikki has > proposed a similar fix for this, but at the caller. See the patch > attached in email [1]. Ah, thank you for that link. I vaguely recalled that we'd discussed this strange behavior before, but I did not realize that anyone had diagnosed a cause. I don't much like any of the patches proposed in that thread though --- they seem overcomplicated or off-point. I do not think we can switch to having two latches here. The only reason WaitForReplicationWorkerAttach needs to pay attention to the process latch at all is that it needs to service CHECK_FOR_INTERRUPTS() conditions in a timely fashion, which is also true in the ApplyLauncherMain loop. We can't realistically expect other processes to signal different latches depending on where the launcher is waiting, so those cases have to be triggered by the same latch. However, WaitForReplicationWorkerAttach can't service any latch-setting conditions other than those managed by CHECK_FOR_INTERRUPTS(). So if CHECK_FOR_INTERRUPTS() returns, we have some other triggering condition, which is the business of some outer code level to deal with. Having it re-set the latch to allow that to happen promptly after it returns seems like a pretty straightforward answer to me. I do note Heikki's concern about whether we could get rid of the sleep-and-retry looping in WaitForReplicationWorkerAttach in favor of getting signaled somehow, and I agree with that as a possible future improvement. But I don't especially see why that demands another latch; in fact, unless we want to teach WaitLatch to be able to wait on more than one latch, it *can't* be a separate latch from the one that receives CHECK_FOR_INTERRUPTS() conditions. >> 4. In process_syncing_tables_for_apply (the other caller of >> logicalrep_worker_launch), it seems okay to ignore the >> result of logicalrep_worker_launch, but I think it should >> fill hentry->last_start_time before not after the call. > With this, won't we end up retrying to launch the worker sooner if the > launch took time, but still failed to launch the worker? That code already does update last_start_time unconditionally, and I think that's the right behavior for the same reason that it's right for ApplyLauncherMain to do ApplyLauncherSetWorkerStartTime whether or not logicalrep_worker_launch succeeds. If the worker launch fails, we don't want to retry instantly, we want to wait wal_retrieve_retry_interval before retrying. My desire to change this code is just based on the idea that it's not clear what else if anything looks at this hashtable, and by the time that logicalrep_worker_launch returns the system state could be a lot different. (For instance, the worker could have started and failed already.) So, just as in ApplyLauncherMain, I'd rather store the start time before calling logicalrep_worker_launch. BTW, it strikes me that if we're going to leave process_syncing_tables_for_apply() ignoring the result of logicalrep_worker_launch, it'd be smart to insert an explicit (void) cast to show that that's intentional. Otherwise Coverity is likely to complain about how we're ignoring the result in one place and not the other. regards, tom lane