On Mon, Jul 22, 2024 at 10:54 PM Masahiko Sawada <sawada.m...@gmail.com> wrote: > > On Mon, Jul 22, 2024 at 6:26 PM Melanie Plageman > <melanieplage...@gmail.com> wrote: > > > > On Mon, Jul 22, 2024 at 6:36 PM Tom Lane <t...@sss.pgh.pa.us> wrote: > > > > > > Melanie Plageman <melanieplage...@gmail.com> writes: > > > > We've only run tests with this commit on some of the back branches for > > > > some of these animals. Of those, I don't see any failures so far. So, > > > > it seems the test instability is just related to trying to get > > > > multiple passes of index vacuuming reliably with TIDStore. > > > > > > > AFAICT, all the 32bit machine failures are timeouts waiting for the > > > > standby to catch up (mamba, gull, merswine). Unfortunately, the > > > > failures on copperhead (a 64 bit machine) are because we don't > > > > actually succeed in triggering a second vacuum pass. This would not be > > > > fixed by a longer timeout. > > > > > > Ouch. This seems to me to raise the importance of getting a better > > > way to test multiple-index-vacuum-passes. Peter argued upthread > > > that we don't need a better way, but I don't see how that argument > > > holds water if copperhead was not reaching it despite being 64-bit. > > > (Did you figure out exactly why it doesn't reach the code?) > > > > I wasn't able to reproduce the failure (failing to do > 1 index vacuum > > pass) on my local machine (which is 64 bit) without decreasing the > > number of tuples inserted. The copperhead failure confuses me because > > the speed of the machine should *not* affect how much space the dead > > item TIDStore takes up. I would have bet money that the same number > > and offsets of dead tuples per page in a relation would take up the > > same amount of space in a TIDStore on any 64-bit system -- regardless > > of how slowly it runs vacuum. > > Looking at copperhead's failure logs, I could not find that "VACUUM > (VERBOSE, FREEZE) vac_horizon_floor_table;" wrote the number of index > scans in logs. Is there any clue that made you think the test failed > to do multiple index vacuum passes?
The vacuum doesn't actually finish because I have a cursor that keeps it from finishing and then I query pg_stat_progress_vacuum after the first index vacuuming round should have happened and it did not do the index vacuum: [20:39:34.645](351.522s) # poll_query_until timed out executing this query: # # SELECT index_vacuum_count > 0 # FROM pg_stat_progress_vacuum # WHERE datname='test_db' AND relid::regclass = 'vac_horizon_floor_table'::regclass; # # expecting this output: # t # last actual query output: # f https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=copperhead&dt=2024-07-22%2015%3A00%3A11 I suppose it is possible that it did in fact time out and the index vacuum was still in progress. But most of the other "too slow" failures were when the standby was trying to catch up. Usually the pg_stat_progress_vacuum test fails because we didn't actually do that index vacuuming round yet. - Melanie