On Fri, Jan 19, 2024 at 09:03:01AM +0000, Bertrand Drouvot wrote: > On Fri, Jan 19, 2024 at 09:00:01AM +0300, Alexander Lakhin wrote: >> 15.01.2024 12:00, Alexander Lakhin wrote: >>> If this approach looks promising to you, maybe we could add a submodule to >>> perl/PostgreSQL/Test/ and use this functionality in other tests (e.g., in >>> 019_replslot_limit) as well. >>> >>> Personally I think that having such a functionality for using in tests >>> might be useful not only to avoid some "problematic" behaviour but also to >>> test the opposite cases. >> >> After spending a few days on it, I've discovered two more issues: >> https://www.postgresql.org/message-id/16d6d9cc-f97d-0b34-be65-425183ed3721%40gmail.com >> https://www.postgresql.org/message-id/b0102688-6d6c-c86a-db79-e0e91d245b1a%40gmail.com >> >> (The latter is not related to bgwriter directly, but it was discovered >> thanks to the RUNNING_XACTS record flew in WAL in a lucky moment.) >> >> So it becomes clear that the 035 test is not the only one, which might >> suffer from bgwriter's activity, > > Yeah... thanks for sharing! > >> and inventing a way to stop bgwriter/ >> control it is a different subject, getting out of scope of the current >> issue. > > Agree. > >> 15.01.2024 11:49, Bertrand Drouvot wrote: >> Maybe it would be a right move to commit the fix, and then think about >> more rare failures. > > +1
Yeah, agreed to make things more stable before making them fancier. >> 2) Shall we align the 035_standby_logical_decoding with >> 031_recovery_conflict in regard to improving stability of vacuum? > > Yeah, I think that could make sense. Probably. That can always be done as a separate change, after a few runs of the slow buildfarm members able to reproduce the failure. >> I see the following options for this: >> a) use wait_until_vacuum_can_remove() and autovacuum = off in both tests; >> b) use FREEZE and autovacuum = off in both tests; >> c) use wait_until_vacuum_can_remove() in 035, FREEZE in 031, and >> autovacuum = off in both. > > I'd vote for a) as I've the feeling it's "easier" to understand (and I'm not > sure using FREEZE would give full "stabilization predictability", at least for > 035_standby_logical_decoding.pl). That said I did not test what the outcome > would > be for 031_recovery_conflict.pl by making use of a). Yeah, I think I agree here with a), as v7 does for 035. +# Launch $sql and wait for a new snapshot that has a newer horizon before +# doing the vacuum with $vac_option on $to_vac. +sub wait_until_vacuum_can_remove This had better document what the arguments of this routine are, because that's really unclear. $to_vac is the relation that will be vacuumed, for one. Also, wouldn't it be better to document in the test why txid_current_snapshot() is chosen? Contrary to txid_current(), it does not generate a Transaction/COMMIT to make the test more predictible, something you have mentioned upthread, and implied in the test. - INSERT INTO flush_wal DEFAULT VALUES; -- see create table flush_wal This removes two INSERTs on flush_wal and refactors the code to do the INSERT in wait_until_vacuum_can_remove(), using a SQL comment to document a reference about the reason why an INSERT is used. Couldn't you just use a normal comment here? >> I've re-tested the v6 patch today and confirmed that it makes the test >> more stable. I ran (with bgwriter_delay = 10000 in temp.config) 20 tests in >> parallel and got failures ('inactiveslot slot invalidation is logged with >> vacuum on pg_authid') on iterations 2, 6, 6 with no patch applied. >> (With unlimited CPU, when average test duration is around 70 seconds.) >> >> But with v6 applied, 60 iterations succeeded. > > Nice! Thanks for the testing! I need to review what you have more thoroughly, but is it OK to assume that both of you are happy with the latest version of the patch in terms of stability gained? That's still not the full picture, still a good step towards that. -- Michael
signature.asc
Description: PGP signature