On Mon, Jul 8, 2024 at 2:25 PM Melanie Plageman <melanieplage...@gmail.com> wrote: > Attached v3 has one important additional component in the test -- I > use pg_stat_progress_vacuum to confirm that we actually do more than > one pass of index vacuuming. Otherwise, it would have been trivial for > the test to incorrectly pass.
That's a good idea. > I could still use another pair of eyes on the test (looking out for > stability enhancing measures I could take). First, the basics: I found that your test failed reliably without your fix, and passed reliably with your fix. Next, performance: the total test runtime (as indicated by "time meson test -q recovery/043_vacuum_horizon_floor") was comparable to other recovery/* TAP tests. The new vacuum_horizon_floor test is a little on the long running side, as these tests go, but I think that's fine. Minor nitpicking about the comments in your TAP test: * It is necessary but not sufficient for your test to "skewer" maybe_needed, relative to OldestXmin. Obviously, it is not sufficient because the test can only fail when VACUUM prunes a heap page after the backend's horizons have been "skewered" in this sense. Pruning is when we get stuck, and if there's no more pruning then there's no opportunity for VACUUM to get stuck. Perhaps this point should be noted directly in the comments. You could add a sentence immediately after the existing sentence "Then vacuum's first pass will continue and pruning...". This new sentence would then add commentary such as "Finally, vacuum's second pass over the heap...". * Perhaps you should point out that you're using VACUUM FREEZE for this because it'll force the backend to always get a cleanup lock. This is something you rely on to make the repro reliable, but that's it. In other words, point out to the reader that this bug has nothing to do with freezing; it just so happens to be convenient to use VACUUM FREEZE this way, due to implementation details. * The sentence "VACUUM proceeds with pruning and does a visibility check on each tuple..." describes the bug in terms of the current state of things on Postgres 17, but Postgres 17 hasn't been released just yet. Does that really make sense? If you want to describe the invariant that caused heap_pre_freeze_checks() to error-out on HEAD/Postgres 17, then the commit message of your fix seems like the right place for that. You could reference these errors in passing. The errors seem fairly incidental to the real problem, at least to me. I think that there is some chance that this test will break the build farm in whatever way, since there is a long history of VACUUM not quite behaving as expected with these sorts of tests. I think that you should commit the test case separately, first thing in the morning, and then keep an eye on the build farm for the rest of the day. I don't think that it's sensible to bend over backwards, just to avoid breaking the build farm in this way. Thanks for working on this -- Peter Geoghegan