Re: Fix 035_standby_logical_decoding.pl race conditions

Amit Kapila Mon, 07 Apr 2025 02:46:38 -0700

On Thu, Apr 3, 2025 at 3:15 PM Amit Kapila <[email protected]> wrote:
>
> On Thu, Apr 3, 2025 at 12:29 PM Bertrand Drouvot
> <[email protected]> wrote:
> >
> > On Thu, Apr 03, 2025 at 05:34:10AM +0000, Hayato Kuroda (Fujitsu) wrote:
> > > Dear Bertrand, Amit,
> > >
> > > > > I do prefer v5-PG17-2 as it is "closer" to HEAD. That said, I think 
> > > > > that we should
> > > > > keep the slots active and only avoid doing the checks for them (they 
> > > > > are
> > > > invalidated
> > > > > that's fine, they are not that's fine too).
> > > > >
> > > >
> > > > I don't mind doing that, but there is no benefit in making slots
> > > > active unless we can validate them. And we will end up adding some
> > > > more checks, as in function check_slots_conflict_reason without any
> > > > advantage.
> >
> > I think that there is advantage. The pros are:
> >
> > - the test would be closer to HEAD from a behavioural point of view
> > - it's very rare to hit the corner cases: so the test would behave the same
> > as on HEAD most of the time (and when it does not that would not hurt as the
> > checks are nor done)
> > - Kuroda-San's patch removes "or die "replication slot stats of 
> > vacuum_full_activeslot not updated"
> > while keeping the slot active is able to keep it (should the slot being 
> > invalidated
> > or not). But more on that in the comment === 1 below.
> >
> > > I feel Kuroda-San's second patch is simple, and we have
> > > > fewer chances to make mistakes and easy to maintain in the future as
> > > > well.
> >
> > Yeah maybe but the price to pay is to discard the pros above. That said, 
> > I'm also
> > fine with Kuroda-San's patch if both of you feel that it's better.
> >
> > > I have concerns for Bertrand's patch that it could introduce another 
> > > timing
> > > issue. E.g., if the activated slots are not invalidated, dropping slots 
> > > is keep
> > > being activated so the dropping might be fail.
> >
> > Yeah, but the drop is done with "$node_standby->psql" so that the test does 
> > not
> > produce an error. It would produce an error should we use 
> > "$node_standby->safe_psql"
> > instead.
> >
> > > I did not reproduce this but
> > > something like this can happen if we activate slots.
> >
> > You can see it that way (+ reproducer.txt):
> >
> > "
> > +       my $bdt = $node_standby->safe_psql('postgres', qq[SELECT * from 
> > pg_replication_slots]);
> > +       note "BDT: $bdt";
> > +
> >         $node_standby->psql('postgres',
> >                 qq[SELECT pg_drop_replication_slot('$inactive_slot')]);
> > "
> >
> > You'd see the slot being active and the "$node_standby->psql" not reporting
> > any error.
> >
>
> Hmm, but adding some additional smarts also makes this test less easy
> to backpatch. I see your points related to the benefits, but I still
> mildly prefer to go with the lesser changes approach for backbranches
> patch. Normally, we don't enhance backbranches code without making
> equivalent changes in HEAD, so adding some new bugs only in
> backbranches has a lesser chance.
>


Bertrand, do you agree with the fewer changes approach (where active
slots won't be tested) for backbranches? I think now that we have
established that the vacuum full test is also prone to failure due to
race condition in the test, this is the only remaining open point.

-- 
With Regards,
Amit Kapila.

Re: Fix 035_standby_logical_decoding.pl race conditions

Reply via email to