Re: Fix 035_standby_logical_decoding.pl race conditions

2025-04-10 Thread Bertrand Drouvot
Hi, On Tue, Apr 08, 2025 at 03:27:41PM +0900, Michael Paquier wrote: > On Tue, Apr 08, 2025 at 06:19:02AM +, Bertrand Drouvot wrote: > > - A new injection_points_wakeup_detach() function that is holding the > > spinlock > > during the whole duration to ensure that no process can wait in betwe

Re: Fix 035_standby_logical_decoding.pl race conditions

2025-04-09 Thread Bertrand Drouvot
Hi, On Tue, Apr 08, 2025 at 02:13:20PM +0900, Michael Paquier wrote: > On Tue, Apr 08, 2025 at 02:00:35AM +, Hayato Kuroda (Fujitsu) wrote: > > Your patch looks good to me and it could pass on my env. PSA patches for > > PG16. > > Patch for PG17 is not changed, just renamed. > > @@ -1287,6 +

Re: Fix 035_standby_logical_decoding.pl race conditions

2025-04-09 Thread Michael Paquier
On Wed, Apr 09, 2025 at 12:07:31PM +0530, Amit Kapila wrote: > On Wed, Apr 9, 2025 at 11:24 AM Bertrand Drouvot > wrote: > I can't think of a good reason to have this DEBUG1 as there is no > predictability of it getting generated even with tests using an > injection point. OTOH, I don't have any o

Re: Fix 035_standby_logical_decoding.pl race conditions

2025-04-09 Thread Amit Kapila
On Wed, Apr 9, 2025 at 11:24 AM Bertrand Drouvot wrote: > > On Wed, Apr 09, 2025 at 12:03:06PM +0900, Michael Paquier wrote: > > On Tue, Apr 08, 2025 at 06:42:53AM +, Bertrand Drouvot wrote: > > > Fully agree. Will need to find another way to prevent a process to wait > > > between the > > >

Re: Fix 035_standby_logical_decoding.pl race conditions

2025-04-08 Thread Bertrand Drouvot
Hi, On Wed, Apr 09, 2025 at 12:03:06PM +0900, Michael Paquier wrote: > On Tue, Apr 08, 2025 at 06:42:53AM +, Bertrand Drouvot wrote: > > Fully agree. Will need to find another way to prevent a process to wait > > between the > > wakeup and the detach. I'll open a dedicated thread. > > By the

Re: Fix 035_standby_logical_decoding.pl race conditions

2025-04-08 Thread Michael Paquier
On Tue, Apr 08, 2025 at 06:42:53AM +, Bertrand Drouvot wrote: > Fully agree. Will need to find another way to prevent a process to wait > between the > wakeup and the detach. I'll open a dedicated thread. By the way, there is a small thing that's itching me a bit about the change done in LogS

Re: Fix 035_standby_logical_decoding.pl race conditions

2025-04-08 Thread Michael Paquier
On Tue, Apr 08, 2025 at 06:19:02AM +, Bertrand Drouvot wrote: > - A new injection_points_wakeup_detach() function that is holding the spinlock > during the whole duration to ensure that no process can wait in between the > wakeup and the detach. That would not a correct spinlock use. injectio

Re: Fix 035_standby_logical_decoding.pl race conditions

2025-04-08 Thread Andres Freund
Hi, On 2025-04-08 02:00:35 +, Hayato Kuroda (Fujitsu) wrote: > > I have changed quite a few comments and commit message for the PG17 > > patch in the attached. Can you update PG16 patch based on this and > > also use the same commit message as used in attached for all the three > > patches? >

Re: Fix 035_standby_logical_decoding.pl race conditions

2025-04-07 Thread Michael Paquier
On Tue, Apr 08, 2025 at 02:00:35AM +, Hayato Kuroda (Fujitsu) wrote: > Your patch looks good to me and it could pass on my env. PSA patches for PG16. > Patch for PG17 is not changed, just renamed. @@ -1287,6 +1288,17 @@ LogStandbySnapshot(void) Assert(XLogStandbyInfoActive()); +#ifdef

RE: Fix 035_standby_logical_decoding.pl race conditions

2025-04-07 Thread Hayato Kuroda (Fujitsu)
Dear Amit, > > I have changed quite a few comments and commit message for the PG17 > patch in the attached. Can you update PG16 patch based on this and > also use the same commit message as used in attached for all the three > patches? Your patch looks good to me and it could pass on my env. PSA

Re: Fix 035_standby_logical_decoding.pl race conditions

2025-04-07 Thread Amit Kapila
On Thu, Apr 3, 2025 at 11:29 AM Hayato Kuroda (Fujitsu) wrote: > I have changed quite a few comments and commit message for the PG17 patch in the attached. Can you update PG16 patch based on this and also use the same commit message as used in attached for all the three patches? -- With Regards

Re: Fix 035_standby_logical_decoding.pl race conditions

2025-04-07 Thread Bertrand Drouvot
Hi, On Mon, Apr 07, 2025 at 03:16:07PM +0530, Amit Kapila wrote: > On Thu, Apr 3, 2025 at 3:15 PM Amit Kapila wrote: > > Hmm, but adding some additional smarts also makes this test less easy > > to backpatch. I see your points related to the benefits, but I still > > mildly prefer to go with the

Re: Fix 035_standby_logical_decoding.pl race conditions

2025-04-07 Thread Amit Kapila
On Thu, Apr 3, 2025 at 3:15 PM Amit Kapila wrote: > > On Thu, Apr 3, 2025 at 12:29 PM Bertrand Drouvot > wrote: > > > > On Thu, Apr 03, 2025 at 05:34:10AM +, Hayato Kuroda (Fujitsu) wrote: > > > Dear Bertrand, Amit, > > > > > > > > I do prefer v5-PG17-2 as it is "closer" to HEAD. That said, I

Re: Fix 035_standby_logical_decoding.pl race conditions

2025-04-07 Thread Bertrand Drouvot
Hi Kuroda-san, On Mon, Apr 07, 2025 at 06:15:13AM +, Hayato Kuroda (Fujitsu) wrote: > I had been debugging and found the case that VACUUM FULL also has a timing > issue. > This means the we cannot keep the testcase. > > PSA the reproducer for PG17. IIUC this can happen even in PG16. > I cons

RE: Fix 035_standby_logical_decoding.pl race conditions

2025-04-06 Thread Hayato Kuroda (Fujitsu)
Dear Bertrand, > I wonder if we could not keep this test and make the slot active for the > vacuum full case. Looking at drongo's failure in [1], there is no occurence > of "vacuum full" and that's probably linked to Andres's explanation in [2]: > > " > a VACUUM FULL on pg_class is > used, which

Re: Fix 035_standby_logical_decoding.pl race conditions

2025-04-05 Thread Amit Kapila
On Thu, Apr 3, 2025 at 11:04 AM Hayato Kuroda (Fujitsu) wrote: > > Dear Bertrand, Amit, > > > > I do prefer v5-PG17-2 as it is "closer" to HEAD. That said, I think that > > > we should > > > keep the slots active and only avoid doing the checks for them (they are > > invalidated > > > that's fine

Re: Fix 035_standby_logical_decoding.pl race conditions

2025-04-05 Thread Bertrand Drouvot
On Wed, Apr 02, 2025 at 12:13:52PM +, Hayato Kuroda (Fujitsu) wrote: > Dear Amit, Bertrand, > > > The other idea to simplify the changes for backbranches: > > sub reactive_slots_change_hfs_and_wait_for_xmins > > { > > ... > > +  my ($slot_prefix, $hsf, $invalidated, $needs_active_slot) = @_; >

Re: Fix 035_standby_logical_decoding.pl race conditions

2025-04-05 Thread Bertrand Drouvot
Hi, On Thu, Apr 03, 2025 at 05:34:10AM +, Hayato Kuroda (Fujitsu) wrote: > Dear Bertrand, Amit, > > > > I do prefer v5-PG17-2 as it is "closer" to HEAD. That said, I think that > > > we should > > > keep the slots active and only avoid doing the checks for them (they are > > invalidated > >

Re: Fix 035_standby_logical_decoding.pl race conditions

2025-04-05 Thread Amit Kapila
On Wed, Apr 2, 2025 at 2:06 PM Bertrand Drouvot wrote: > > Hi Kuroda-san, > > On Wed, Apr 02, 2025 at 07:16:25AM +, Hayato Kuroda (Fujitsu) wrote: > > As far v4-0001: > > === 1 > > +# would advance an active replication slot's catalog_xmin > > s/would/could/? I mean the system also needs to be

Re: Fix 035_standby_logical_decoding.pl race conditions

2025-04-04 Thread Bertrand Drouvot
Hi, On Wed, Apr 02, 2025 at 03:04:07PM +0530, Amit Kapila wrote: > I have changed it based on your suggestions and made a few other > changes in the comments. Please see attached. Thanks! > * > +  if (IS_INJECTION_POINT_ATTACHED("log-running-xacts")) > > It is better to name the injection point

RE: Fix 035_standby_logical_decoding.pl race conditions

2025-04-03 Thread Hayato Kuroda (Fujitsu)
Dear Amit, Bertrand, > The other idea to simplify the changes for backbranches: > sub reactive_slots_change_hfs_and_wait_for_xmins > { > ... > +  my ($slot_prefix, $hsf, $invalidated, $needs_active_slot) = @_; > >   # create the logical slots > -  create_logical_slots($node_standby, $slot_prefix)

Re: Fix 035_standby_logical_decoding.pl race conditions

2025-04-03 Thread Amit Kapila
On Thu, Apr 3, 2025 at 12:29 PM Bertrand Drouvot wrote: > > On Thu, Apr 03, 2025 at 05:34:10AM +, Hayato Kuroda (Fujitsu) wrote: > > Dear Bertrand, Amit, > > > > > > I do prefer v5-PG17-2 as it is "closer" to HEAD. That said, I think > > > > that we should > > > > keep the slots active and on

RE: Fix 035_standby_logical_decoding.pl race conditions

2025-04-02 Thread Hayato Kuroda (Fujitsu)
> Isn't patch 0001-Fix-invalid-referring-of-hash-ref-for-replication-sl > unrelated to this thread? Or am, I missing something? I did attach wrongly, PSA correct set. Sorry for inconvenience. Best regards, Hayato Kuroda FUJITSU LIMITED v5-0001-Stabilize-035_standby_logical_decoding.pl-by-usin

RE: Fix 035_standby_logical_decoding.pl race conditions

2025-04-02 Thread Hayato Kuroda (Fujitsu)
Dear Bertrand, Amit, > > I do prefer v5-PG17-2 as it is "closer" to HEAD. That said, I think that we > > should > > keep the slots active and only avoid doing the checks for them (they are > invalidated > > that's fine, they are not that's fine too). > > > > I don't mind doing that, but there is

Re: Fix 035_standby_logical_decoding.pl race conditions

2025-04-02 Thread Amit Kapila
On Wed, Apr 2, 2025 at 8:30 PM Bertrand Drouvot wrote: > > On Wed, Apr 02, 2025 at 12:13:52PM +, Hayato Kuroda (Fujitsu) wrote: > > Dear Amit, Bertrand, > > > > > The other idea to simplify the changes for backbranches: > > > sub reactive_slots_change_hfs_and_wait_for_xmins > > > { > > > ... >

Re: Fix 035_standby_logical_decoding.pl race conditions

2025-04-02 Thread Amit Kapila
On Wed, Apr 2, 2025 at 2:06 PM Bertrand Drouvot wrote: > > > As far PG17-v4-0001: > > > === 4 > > It looks like that check_slots_conflict_reason() is not called with > checks_active_slot > as argument. > > === 5 > > I think that we could remove the need for the drop_active_slot parameter in > dro

RE: Fix 035_standby_logical_decoding.pl race conditions

2025-04-02 Thread Hayato Kuroda (Fujitsu)
Dear Amit, Bertrand, > You have not added any injection point for the above case. Isn't it > possible that if running_xact record is logged concurrently to the > pruning record, it should move the active slot on standby, and the > same failure should occur in this case as well? I considered that

Re: Fix 035_standby_logical_decoding.pl race conditions

2025-04-02 Thread Bertrand Drouvot
Hi Kuroda-san, On Wed, Apr 02, 2025 at 07:16:25AM +, Hayato Kuroda (Fujitsu) wrote: > Dear Amit, Bertrand, > > > You have not added any injection point for the above case. Isn't it > > possible that if running_xact record is logged concurrently to the > > pruning record, it should move the ac

Re: Fix 035_standby_logical_decoding.pl race conditions

2025-04-01 Thread Amit Kapila
On Tue, Apr 1, 2025 at 2:02 PM Bertrand Drouvot wrote: > > Hi Kuroda-san, > > On Tue, Apr 01, 2025 at 01:22:49AM +, Hayato Kuroda (Fujitsu) wrote: > > Dear Bertrand, > > > > Thanks for the updated patch! > > > > s/to avoid the seeing a xl_running_xacts/to avoid seeing a > > > xl_running_xacts

RE: Fix 035_standby_logical_decoding.pl race conditions

2025-04-01 Thread Hayato Kuroda (Fujitsu)
Dear Bertrand, > s/to avoid the seeing a xl_running_xacts/to avoid seeing a xl_running_xacts/? Fixed. > > === 2 (Nit) > > /* For testing slot invalidation due to the conflict */ > > Not sure "due to the conflict" is needed. > OK, removed. > About PG17-v2-0001 > > === 3 > > The commit

Re: Fix 035_standby_logical_decoding.pl race conditions

2025-04-01 Thread Amit Kapila
On Tue, Apr 1, 2025 at 6:53 AM Hayato Kuroda (Fujitsu) wrote: > With respect to 0001, can't this problem happen for the following case as well? # Recovery conflict: Invalidate conflicting slots, including in-use slots # Scenario 5: conflict due to on-access pruning. You have not added any inject

Re: Fix 035_standby_logical_decoding.pl race conditions

2025-04-01 Thread Bertrand Drouvot
Hi, On Tue, Apr 01, 2025 at 04:55:06PM +0530, Amit Kapila wrote: > On Tue, Apr 1, 2025 at 2:02 PM Bertrand Drouvot > wrote: > > > But if we "really" want to produce a "new" WAL record, what about using > > LogLogicalMessage()? > > > > We are using injection points for testing purposes, which me

RE: Fix 035_standby_logical_decoding.pl race conditions

2025-04-01 Thread Hayato Kuroda (Fujitsu)
Dear Bertrand, > > > s/to avoid the seeing a xl_running_xacts/to avoid seeing a > > > xl_running_xacts/? > > > > Fixed. Sorry, I misunderstood your comment and wrongly fixed. I will address in next version. > === 1 > > +* XXX What value should we return here? Originally this >

Re: Fix 035_standby_logical_decoding.pl race conditions

2025-04-01 Thread Bertrand Drouvot
Hi Kuroda-san, On Tue, Apr 01, 2025 at 01:22:49AM +, Hayato Kuroda (Fujitsu) wrote: > Dear Bertrand, > Thanks for the updated patch! > > s/to avoid the seeing a xl_running_xacts/to avoid seeing a > > xl_running_xacts/? > > Fixed. hmm, not sure as I still can see: +# Note that injection_

Re: Fix 035_standby_logical_decoding.pl race conditions

2025-03-31 Thread Bertrand Drouvot
Hi Kuroda-san and Amit, On Fri, Mar 28, 2025 at 09:02:29AM +, Hayato Kuroda (Fujitsu) wrote: > Dear Amit, > > > > > Right, I think this is a better idea. I like it too and the bonus point is that this injection point can be used in more tests (more use cases). A few comments: About v

RE: Fix 035_standby_logical_decoding.pl race conditions

2025-03-28 Thread Hayato Kuroda (Fujitsu)
Dear Amit, > > Right, I think this is a better idea. I have a few comments: > 1. > + /* > + * In 035_standby_logical_decoding.pl, RUNNING_XACTS could move slots's > + * xmin forward and cause random failures. > > No need to use test file name in code comments. Fixed. > 2. The comments atop wai

Re: Fix 035_standby_logical_decoding.pl race conditions

2025-03-27 Thread Amit Kapila
On Wed, Mar 26, 2025 at 1:17 PM Hayato Kuroda (Fujitsu) wrote: > > > Seeing all these failures, I wonder whether we can reliably test > > active slots apart from wal_level change test (aka Scenario 6: > > incorrect wal_level on primary.). Sure, we can try by having some > > injection point kind of

RE: Fix 035_standby_logical_decoding.pl race conditions

2025-03-26 Thread Hayato Kuroda (Fujitsu)
Dear Amit, Bertrand, > Seeing all these failures, I wonder whether we can reliably test > active slots apart from wal_level change test (aka Scenario 6: > incorrect wal_level on primary.). Hmm, agreed. We do not have good solution to stabilize tests, at least for now. I've created a patch for PG1

RE: Fix 035_standby_logical_decoding.pl race conditions

2025-03-26 Thread Hayato Kuroda (Fujitsu)
Dear Amit, Bertrand, > Seeing all these failures, I wonder whether we can reliably test > active slots apart from wal_level change test (aka Scenario 6: > incorrect wal_level on primary.). Sure, we can try by having some > injection point kind of tests, but is it really worth because, anyway > the

Re: Fix 035_standby_logical_decoding.pl race conditions

2025-03-25 Thread Amit Kapila
On Fri, Mar 21, 2025 at 9:48 PM Bertrand Drouvot wrote: > > So, I'm not sure I like the idea that much, but thinking out loud: I wonder if > we could bypass the "active" slot checks in 16 and 17 and use injection > points as > proposed as of 18 (as we need the injection points changes proposed in

Re: Fix 035_standby_logical_decoding.pl race conditions

2025-03-24 Thread Bertrand Drouvot
Hi Kuroda-san, On Mon, Mar 24, 2025 at 04:54:21AM +, Hayato Kuroda (Fujitsu) wrote: > > So, I'm not sure I like the idea that much, but thinking out loud: I wonder > > if > > we could bypass the "active" slot checks in 16 and 17 and use injection > > points as > > proposed as of 18 (as we ne

RE: Fix 035_standby_logical_decoding.pl race conditions

2025-03-24 Thread Hayato Kuroda (Fujitsu)
Dear Bertrand, I'm also working on the thread to resolve the random failure. > Yes, that's also my understanding. It's also easy to "simulate" by adding > a checkpoint on the primary and a long enough sleep after we launched our sql > in > wait_until_vacuum_can_remove(). Thanks for letting me k

RE: Fix 035_standby_logical_decoding.pl race conditions

2025-03-23 Thread Hayato Kuroda (Fujitsu)
Dear Bertrand, > > SIGSTOP signal for pg_recvlogical may do the idea, > > Yeah, but would we be "really" testing an "active" slot? Yeah, this is also a debatable point. > At the end we want to produce an invalidation that may or not happen on a real > environment. The corner case is in the test

Re: Fix 035_standby_logical_decoding.pl race conditions

2025-03-21 Thread Bertrand Drouvot
Hi Kuroda-san, On Fri, Mar 21, 2025 at 12:28:10PM +, Hayato Kuroda (Fujitsu) wrote: > I'm also working on the thread to resolve the random failure. Thanks for looking at it! > I've also tried the idea with the living transaction via background_psql(), > but I got the same result. The test co

Re: Fix 035_standby_logical_decoding.pl race conditions

2025-03-19 Thread Bertrand Drouvot
Hi, On Wed, Mar 19, 2025 at 12:12:19PM +0530, Amit Kapila wrote: > On Mon, Feb 10, 2025 at 8:12 PM Bertrand Drouvot > wrote: > > > > Please find attached a patch to $SUBJECT. > > > > In rare circumstances (and on slow machines) it is possible that a > > xl_running_xacts > > is emitted and that t

Re: Fix 035_standby_logical_decoding.pl race conditions

2025-03-18 Thread Amit Kapila
On Mon, Feb 10, 2025 at 8:12 PM Bertrand Drouvot wrote: > > Please find attached a patch to $SUBJECT. > > In rare circumstances (and on slow machines) it is possible that a > xl_running_xacts > is emitted and that the catalog_xmin of a logical slot on the standby advances > past the conflict poin

Fix 035_standby_logical_decoding.pl race conditions

2025-02-10 Thread Bertrand Drouvot
Hi hackers, Please find attached a patch to $SUBJECT. In rare circumstances (and on slow machines) it is possible that a xl_running_xacts is emitted and that the catalog_xmin of a logical slot on the standby advances past the conflict point. In that case, no conflict is reported and the test fai