On Tue, Jul 23, 2024 at 5:43 AM Melanie Plageman <melanieplage...@gmail.com> wrote: > > On Mon, Jul 22, 2024 at 10:54 PM Masahiko Sawada <sawada.m...@gmail.com> > wrote: > > > > On Mon, Jul 22, 2024 at 6:26 PM Melanie Plageman > > <melanieplage...@gmail.com> wrote: > > > > > > On Mon, Jul 22, 2024 at 6:36 PM Tom Lane <t...@sss.pgh.pa.us> wrote: > > > > > > > > Melanie Plageman <melanieplage...@gmail.com> writes: > > > > > We've only run tests with this commit on some of the back branches for > > > > > some of these animals. Of those, I don't see any failures so far. So, > > > > > it seems the test instability is just related to trying to get > > > > > multiple passes of index vacuuming reliably with TIDStore. > > > > > > > > > AFAICT, all the 32bit machine failures are timeouts waiting for the > > > > > standby to catch up (mamba, gull, merswine). Unfortunately, the > > > > > failures on copperhead (a 64 bit machine) are because we don't > > > > > actually succeed in triggering a second vacuum pass. This would not be > > > > > fixed by a longer timeout. > > > > > > > > Ouch. This seems to me to raise the importance of getting a better > > > > way to test multiple-index-vacuum-passes. Peter argued upthread > > > > that we don't need a better way, but I don't see how that argument > > > > holds water if copperhead was not reaching it despite being 64-bit. > > > > (Did you figure out exactly why it doesn't reach the code?) > > > > > > I wasn't able to reproduce the failure (failing to do > 1 index vacuum > > > pass) on my local machine (which is 64 bit) without decreasing the > > > number of tuples inserted. The copperhead failure confuses me because > > > the speed of the machine should *not* affect how much space the dead > > > item TIDStore takes up. I would have bet money that the same number > > > and offsets of dead tuples per page in a relation would take up the > > > same amount of space in a TIDStore on any 64-bit system -- regardless > > > of how slowly it runs vacuum. > > > > Looking at copperhead's failure logs, I could not find that "VACUUM > > (VERBOSE, FREEZE) vac_horizon_floor_table;" wrote the number of index > > scans in logs. Is there any clue that made you think the test failed > > to do multiple index vacuum passes? > > The vacuum doesn't actually finish because I have a cursor that keeps > it from finishing and then I query pg_stat_progress_vacuum after the > first index vacuuming round should have happened and it did not do the > index vacuum: > > [20:39:34.645](351.522s) # poll_query_until timed out executing this query: > # > # SELECT index_vacuum_count > 0 > # FROM pg_stat_progress_vacuum > # WHERE datname='test_db' AND relid::regclass = > 'vac_horizon_floor_table'::regclass; > # > # expecting this output: > # t > # last actual query output: > # f > > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=copperhead&dt=2024-07-22%2015%3A00%3A11 > > I suppose it is possible that it did in fact time out and the index > vacuum was still in progress. But most of the other "too slow" > failures were when the standby was trying to catch up. Usually the > pg_stat_progress_vacuum test fails because we didn't actually do that > index vacuuming round yet.
Thank you for your explanation! I understood the test cases. I figured out why two-round index vacuum was not triggered on copperhead although it's a 64-bit system. In short, this test case depends on MEMORY_CONTEXT_CHECK (or USE_ASSERT_CHECKING) being on. In this test case, every BlocktableEntry size would be 16 bytes; the header is 8 bytes and offset bitmap is 8 bytes (covering up to offset 63). We calculate the memory size (required_size in BumpAlloc()) to allocate in a bump memory context as follows: #ifdef MEMORY_CONTEXT_CHECKING /* ensure there's always space for the sentinel byte */ chunk_size = MAXALIGN(size + 1); #else chunk_size = MAXALIGN(size); #endif (snip) required_size = chunk_size + Bump_CHUNKHDRSZ; Without MEMORY_CONTEXT_CHECK, if size is 16 bytes, required_size is also 16 bytes as it's already 8-byte aligned and Bump_CHUNKHDRSZ is 0. On the other hand with MEMORY_CONTEXT_CHECK, the requied_size is bumped to 40 bytes as chunk_size is 24 bytes and Bump_CHUNKHDRSZ is 16 bytes. Therefore, with MEMORY_CONTEXT_CHECK, we allocate more memory and use more Bump memory blocks, resulting in filling up TidStore in the test cases. We can easily reproduce this test failure with PostgreSQL server built without --enable-cassert. It seems that copperhead is the sole BF animal that doesn't use --enable-cassert but runs recovery-check. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com