Re: Logrep launcher race conditions leading to slow tests

Tom Lane Tue, 24 Jun 2025 09:31:51 -0700

Amit Kapila <[email protected]> writes:
> On Tue, Jun 24, 2025 at 5:26 AM Tom Lane <[email protected]> wrote:
>> 1. WaitForReplicationWorkerAttach sometimes has to clear a process
>> latch event so that it can keep waiting for the worker to launch.
>> It neglects to set the latch again, allowing ApplyLauncherMain
>> to miss events.


> There was a previous discussion to fix this behavior. Heikki has
> proposed a similar fix for this, but at the caller. See the patch
> attached in email [1].

Ah, thank you for that link.  I vaguely recalled that we'd discussed
this strange behavior before, but I did not realize that anyone had
diagnosed a cause.  I don't much like any of the patches proposed
in that thread though --- they seem overcomplicated or off-point.

I do not think we can switch to having two latches here.  The
only reason WaitForReplicationWorkerAttach needs to pay attention
to the process latch at all is that it needs to service
CHECK_FOR_INTERRUPTS() conditions in a timely fashion, which is
also true in the ApplyLauncherMain loop.  We can't realistically
expect other processes to signal different latches depending on
where the launcher is waiting, so those cases have to be triggered
by the same latch.

However, WaitForReplicationWorkerAttach can't service any
latch-setting conditions other than those managed by
CHECK_FOR_INTERRUPTS().  So if CHECK_FOR_INTERRUPTS() returns,
we have some other triggering condition, which is the business
of some outer code level to deal with.  Having it re-set the latch
to allow that to happen promptly after it returns seems like a
pretty straightforward answer to me.

I do note Heikki's concern about whether we could get rid of the
sleep-and-retry looping in WaitForReplicationWorkerAttach in
favor of getting signaled somehow, and I agree with that as a
possible future improvement.  But I don't especially see why that
demands another latch; in fact, unless we want to teach WaitLatch
to be able to wait on more than one latch, it *can't* be a
separate latch from the one that receives CHECK_FOR_INTERRUPTS()
conditions.

>> 4. In process_syncing_tables_for_apply (the other caller of
>> logicalrep_worker_launch), it seems okay to ignore the
>> result of logicalrep_worker_launch, but I think it should
>> fill hentry->last_start_time before not after the call.

> With this, won't we end up retrying to launch the worker sooner if the
> launch took time, but still failed to launch the worker?

That code already does update last_start_time unconditionally, and
I think that's the right behavior for the same reason that it's
right for ApplyLauncherMain to do ApplyLauncherSetWorkerStartTime
whether or not logicalrep_worker_launch succeeds.  If the worker
launch fails, we don't want to retry instantly, we want to wait
wal_retrieve_retry_interval before retrying.  My desire to change
this code is just based on the idea that it's not clear what else
if anything looks at this hashtable, and by the time that
logicalrep_worker_launch returns the system state could be a lot
different.  (For instance, the worker could have started and
failed already.)  So, just as in ApplyLauncherMain, I'd rather
store the start time before calling logicalrep_worker_launch.

BTW, it strikes me that if we're going to leave
process_syncing_tables_for_apply() ignoring the result of
logicalrep_worker_launch, it'd be smart to insert an explicit
(void) cast to show that that's intentional.  Otherwise Coverity
is likely to complain about how we're ignoring the result in
one place and not the other.

                        regards, tom lane

Re: Logrep launcher race conditions leading to slow tests

Reply via email to