On Fri, 5 Jul 2024 at 18:38, Heikki Linnakangas <hlinn...@iki.fi> wrote: > > On 05/07/2024 14:07, vignesh C wrote: > > On Thu, 4 Jul 2024 at 16:52, Heikki Linnakangas <hlinn...@iki.fi> wrote: > >> > >> I'm don't quite understand the problem we're trying to fix: > >> > >>> Currently the launcher's latch is used for the following: a) worker > >>> process attach b) worker process exit and c) subscription creation. > >>> Since this same latch is used for multiple cases, the launcher process > >>> is not able to handle concurrent scenarios like: a) Launcher started a > >>> new apply worker and waiting for apply worker to attach and b) create > >>> subscription sub2 sending launcher wake up signal. In this scenario, > >>> both of them will set latch of the launcher process, the launcher > >>> process is not able to identify that both operations have occurred 1) > >>> worker is attached 2) subscription is created and apply worker should > >>> be started. As a result the apply worker does not get started for the > >>> new subscription created immediately and gets started after the > >>> timeout of 180 seconds. > >> > >> I don't see how we could miss a notification. Yes, both events will set > >> the same latch. Why is that a problem? > > > > The launcher process waits for the apply worker to attach at > > WaitForReplicationWorkerAttach function. The launcher's latch is > > getting set concurrently for another create subscription and apply > > worker attached. The launcher now detects the latch is set while > > waiting at WaitForReplicationWorkerAttach, it resets the latch and > > proceed to the main loop and waits for DEFAULT_NAPTIME_PER_CYCLE(as > > the latch has already been reset). Further details are provided below. > > > > The loop will see that the new > >> worker has attached, and that the new subscription has been created, and > >> process both events. Right? > > > > Since the latch is reset at WaitForReplicationWorkerAttach, it skips > > processing the create subscription event. > > > > Slightly detailing further: > > In the scenario when we execute two concurrent create subscription > > commands, first CREATE SUBSCRIPTION sub1, followed immediately by > > CREATE SUBSCRIPTION sub2. > > In a few random scenarios, the following events may unfold: > > After the first create subscription command(sub1), the backend will > > set the launcher's latch because of ApplyLauncherWakeupAtCommit. > > Subsequently, the launcher process will reset the latch and identify > > the addition of a new subscription in the pg_subscription list. The > > launcher process will proceed to request the postmaster to start the > > apply worker background process (sub1) and request the postmaster to > > notify the launcher once the apply worker(sub1) has been started. > > Launcher will then wait for the apply worker(sub1) to attach in the > > WaitForReplicationWorkerAttach function. > > Meanwhile, the second CREATE SUBSCRIPTION command (sub2) which was > > executed concurrently, also set the launcher's latch(because of > > ApplyLauncherWakeupAtCommit). > > Simultaneously when the launcher remains in the > > WaitForReplicationWorkerAttach function waiting for apply worker of > > sub1 to start, the postmaster will start the apply worker for > > subscription sub1 and send a SIGUSR1 signal to the launcher process > > via ReportBackgroundWorkerPID. Upon receiving this signal, the > > launcher process will then set its latch. > > > > Now, the launcher's latch has been set concurrently after the apply > > worker for sub1 is started and the execution of the CREATE > > SUBSCRIPTION sub2 command. > > > > At this juncture, the launcher, which had been awaiting the attachment > > of the apply worker, detects that the latch is set and proceeds to > > reset it. The reset operation of the latch occurs within the following > > section of code in WaitForReplicationWorkerAttach: > > ... > > rc = WaitLatch(MyLatch, > > WL_LATCH_SET | WL_TIMEOUT | WL_EXIT_ON_PM_DEATH, > > 10L, WAIT_EVENT_BGWORKER_STARTUP); > > > > if (rc & WL_LATCH_SET) > > { > > ResetLatch(MyLatch); > > CHECK_FOR_INTERRUPTS(); > > } > > ... > > > > After resetting the latch here, the activation signal intended to > > start the apply worker for subscription sub2 is no longer present. The > > launcher will return to the ApplyLauncherMain function, where it will > > await the DEFAULT_NAPTIME_PER_CYCLE, which is 180 seconds, before > > processing the create subscription request (i.e., creating a new apply > > worker for sub2). > > The issue arises from the latch being reset in > > WaitForReplicationWorkerAttach, which can occasionally delay the > > synchronization of table data for the subscription. > > Ok, I see it now. Thanks for the explanation. So the problem isn't using > the same latch for different purposes per se. It's that we're trying to > use it in a nested fashion, resetting it in the inner loop. > > Looking at the proposed patch more closely: > > > @@ -221,13 +224,13 @@ WaitForReplicationWorkerAttach(LogicalRepWorker > > *worker, > > * We need timeout because we generally don't get notified > > via latch > > * about the worker attach. But we don't expect to have to > > wait long. > > */ > > - rc = WaitLatch(MyLatch, > > + rc = WaitLatch(&worker->launcherWakeupLatch, > > WL_LATCH_SET | WL_TIMEOUT | > > WL_EXIT_ON_PM_DEATH, > > 10L, WAIT_EVENT_BGWORKER_STARTUP); > > > > if (rc & WL_LATCH_SET) > > { > > - ResetLatch(MyLatch); > > + ResetLatch(&worker->launcherWakeupLatch); > > CHECK_FOR_INTERRUPTS(); > > } > > } > > The comment here is now outdated, right? The patch adds an explicit > SetLatch() call to ParallelApplyWorkerMain(), to notify the launcher > about the attachment.
I will update the comment for this. > Is the launcherWakeupLatch field protected by some lock, to protect > which process owns it? OwnLatch() is called in > logicalrep_worker_stop_internal() without holding a lock for example. Is > there a guarantee that only one process can do that at the same time? I have analyzed a few scenarios, I will analyze the remaining scenarios and update on this. > What happens if a process never calls DisownLatch(), e.g. because it > bailed out with an error while waiting for the worker to attach or stop? I tried a lot of scenarios by erroring out after ownlatch call and did not hit the panic code of OwnLatch yet. I will try a few more scenarios that I have in mind and see if we can hit the PANIC in OwnLatch and update on this. > As an alternative, smaller fix, I think we could do the attached. It > forces the launcher's main loop to do another iteration, if it calls > logicalrep_worker_launch(). That extra iteration should pick up any > missed notifications. This also works. I had another solution in similar lines like you proposed at [1], where we don't reset the latch at the inner loop. I'm not sure which solution we should proceed with. Thoughts? > Your solution with an additional latch seems better because it makes > WaitForReplicationWorkerAttach() react more quickly, without the 10 s > wait. I'm surprised we have that in the first place, 10 s seems like a > pretty long time to wait for a parallel apply worker to start. Why was > that ever OK? Here 10 means 10 milliseconds, so it just waits for 10 milliseconds before going to the next iteration. [1] - https://www.postgresql.org/message-id/CALDaNm10R7L0Dxq%2B-J%3DPp3AfM_yaokpbhECvJ69QiGH8-jQquw%40mail.gmail.com Regards, Vignesh