Re: Logrep launcher race conditions leading to slow tests

Ashutosh Bapat Tue, 24 Jun 2025 23:24:07 -0700

On Tue, Jun 24, 2025 at 5:26 AM Tom Lane <[email protected]> wrote:
>
> I've been annoyed for awhile because, while a parallel check-world
> run usually takes a bit over a minute on my machine, sometimes it
> takes between three and four minutes.  I was finally able to
> track down what is happening, and it's this: sometimes one or
> another of the src/test/subscription tests takes an extra three
> minutes because the logical replication launcher is sleeping
> instead of launching the next task.  It eventually reaches its
> hard-wired maximum wait of DEFAULT_NAPTIME_PER_CYCLE (3min),
> wakes up and notices it has something to do, and then we're
> on our merry way again.  I'm not sure how often this is a problem
> in the real world, but it happens often enough to be annoying
> during development.
>
> There are two distinct bugs involved:
>
> 1. WaitForReplicationWorkerAttach sometimes has to clear a process
> latch event so that it can keep waiting for the worker to launch.
> It neglects to set the latch again, allowing ApplyLauncherMain
> to miss events.


Agreed.

>
> 2. ApplyLauncherMain ignores a failure return from
> logicalrep_worker_launch, which is bad because (unless it has
> another worker launch pending) it will then sleep for
> DEFAULT_NAPTIME_PER_CYCLE before reconsidering.  What it ought to do
> is try again after wal_retrieve_retry_interval.  This situation can
> arise in resource-exhaustion cases (cf. the early exits in
> logicalrep_worker_launch), but what's more likely in the regression
> tests is that the worker stops due to some error condition before
> WaitForReplicationWorkerAttach sees it attached, which is then duly
> reported as a failure.

Agreed.

>
> It's possible to make the test slowness extremely reproducible
> with this change, which widens the race condition window for
> both problems:
>
> diff --git a/src/backend/replication/logical/launcher.c 
> b/src/backend/replication/logical/launcher.c
> index 1c3c051403d..724e82bcdc1 100644
> --- a/src/backend/replication/logical/launcher.c
> +++ b/src/backend/replication/logical/launcher.c
> @@ -214,7 +214,7 @@ WaitForReplicationWorkerAttach(LogicalRepWorker *worker,
>                  */
>                 rc = WaitLatch(MyLatch,
>                                            WL_LATCH_SET | WL_TIMEOUT | 
> WL_EXIT_ON_PM_DEATH,
> -                                          10L, WAIT_EVENT_BGWORKER_STARTUP);
> +                                          1000L, 
> WAIT_EVENT_BGWORKER_STARTUP);
>
>                 if (rc & WL_LATCH_SET)
>                 {
>
> I don't recommend that as a permanent change, but it's helpful
> for testing the attached patch.

Thanks. That makes reproduction easier.

>
> In the attached, I made two other non-cosmetic changes:
>
> 3. In WaitForReplicationWorkerAttach, capture worker->in_use
> before not after releasing LogicalRepWorkerLock.  Maybe there
> is a reason why that's not a dangerous race condition, but
> it sure is un-obvious to me.

Looking at the code this seems to be dangerous. We will return from
the condition if worker->in_use = false which means that the slot is
free. By the time we release LogicalRepWorkerLock and read the value
of worker->in_use, it may have been filled with some other worker and
thus return true instead of false, and the caller would falsely assume
that the worker was successfully launched. That itself might cause the
launcher to not start the worker again. That's possibly rare but not
impossible.

+1 for this change.

>
> 4. In process_syncing_tables_for_apply (the other caller of
> logicalrep_worker_launch), it seems okay to ignore the
> result of logicalrep_worker_launch, but I think it should
> fill hentry->last_start_time before not after the call.
> Otherwise we might be changing a hashtable entry that's
> no longer relevant to this worker.

A hash entry is associated with a table, not the worker. In case the
worker fails to launch it records the time when worker launch for that
table was attempted so that next attempt could be well-spaced in time.
I am not able your last statement, what is the entry's relevance to
the worker.

But your change makes this code similar to ApplyLauncherMain(), which
deals with subscriptions. +1 for the consistency.

--
Best Wishes,
Ashutosh Bapat

Re: Logrep launcher race conditions leading to slow tests

Reply via email to