On Fri, Jan 19, 2024 at 09:03:01AM +0000, Bertrand Drouvot wrote:
> On Fri, Jan 19, 2024 at 09:00:01AM +0300, Alexander Lakhin wrote:
>> 15.01.2024 12:00, Alexander Lakhin wrote:
>>> If this approach looks promising to you, maybe we could add a submodule to
>>> perl/PostgreSQL/Test/ and use this functionality in other tests (e.g., in
>>> 019_replslot_limit) as well.
>>> 
>>> Personally I think that having such a functionality for using in tests
>>> might be useful not only to avoid some "problematic" behaviour but also to
>>> test the opposite cases.
>> 
>> After spending a few days on it, I've discovered two more issues:
>> https://www.postgresql.org/message-id/16d6d9cc-f97d-0b34-be65-425183ed3721%40gmail.com
>> https://www.postgresql.org/message-id/b0102688-6d6c-c86a-db79-e0e91d245b1a%40gmail.com
>> 
>> (The latter is not related to bgwriter directly, but it was discovered
>> thanks to the RUNNING_XACTS record flew in WAL in a lucky moment.)
>> 
>> So it becomes clear that the 035 test is not the only one, which might
>> suffer from bgwriter's activity,
> 
> Yeah... thanks for sharing!
> 
>> and inventing a way to stop bgwriter/
>> control it is a different subject, getting out of scope of the current
>> issue.
> 
> Agree.
> 
>> 15.01.2024 11:49, Bertrand Drouvot wrote:
>> Maybe it would be a right move to commit the fix, and then think about
>> more rare failures.
> 
> +1

Yeah, agreed to make things more stable before making them fancier.

>> 2) Shall we align the 035_standby_logical_decoding with
>> 031_recovery_conflict in regard to improving stability of vacuum?
> 
> Yeah, I think that could make sense.

Probably.  That can always be done as a separate change, after a few
runs of the slow buildfarm members able to reproduce the failure.

>> I see the following options for this:
>> a) use wait_until_vacuum_can_remove() and autovacuum = off in both tests;
>> b) use FREEZE and autovacuum = off in both tests;
>> c) use wait_until_vacuum_can_remove() in 035, FREEZE in 031, and
>>  autovacuum = off in both.
> 
> I'd vote for a) as I've the feeling it's "easier" to understand (and I'm not
> sure using FREEZE would give full "stabilization predictability", at least for
> 035_standby_logical_decoding.pl). That said I did not test what the outcome 
> would
> be for 031_recovery_conflict.pl by making use of a).

Yeah, I think I agree here with a), as v7 does for 035.

+# Launch $sql and wait for a new snapshot that has a newer horizon before
+# doing the vacuum with $vac_option on $to_vac.
+sub wait_until_vacuum_can_remove

This had better document what the arguments of this routine are,
because that's really unclear.  $to_vac is the relation that will be
vacuumed, for one.

Also, wouldn't it be better to document in the test why
txid_current_snapshot() is chosen?  Contrary to txid_current(), it
does not generate a Transaction/COMMIT to make the test more
predictible, something you have mentioned upthread, and implied in the
test.

-  INSERT INTO flush_wal DEFAULT VALUES; -- see create table flush_wal

This removes two INSERTs on flush_wal and refactors the code to do the
INSERT in wait_until_vacuum_can_remove(), using a SQL comment to
document a reference about the reason why an INSERT is used.  Couldn't
you just use a normal comment here?

>> I've re-tested the v6 patch today and confirmed that it makes the test
>> more stable. I ran (with bgwriter_delay = 10000 in temp.config) 20 tests in
>> parallel and got failures ('inactiveslot slot invalidation is logged with
>> vacuum on pg_authid') on iterations 2, 6, 6 with no patch applied.
>> (With unlimited CPU, when average test duration is around 70 seconds.)
>> 
>> But with v6 applied, 60 iterations succeeded.
> 
> Nice! Thanks for the testing!

I need to review what you have more thoroughly, but is it OK to assume
that both of you are happy with the latest version of the patch in
terms of stability gained?  That's still not the full picture, still a
good step towards that.
--
Michael

Attachment: signature.asc
Description: PGP signature

Reply via email to