Re: Vacuum ERRORs out considering freezing dead tuples from before OldestXmin

Masahiko Sawada Tue, 23 Jul 2024 15:40:42 -0700

On Tue, Jul 23, 2024 at 5:43 AM Melanie Plageman
<melanieplage...@gmail.com> wrote:
>
> On Mon, Jul 22, 2024 at 10:54 PM Masahiko Sawada <sawada.m...@gmail.com> 
> wrote:
> >
> > On Mon, Jul 22, 2024 at 6:26 PM Melanie Plageman
> > <melanieplage...@gmail.com> wrote:
> > >
> > > On Mon, Jul 22, 2024 at 6:36 PM Tom Lane <t...@sss.pgh.pa.us> wrote:
> > > >
> > > > Melanie Plageman <melanieplage...@gmail.com> writes:
> > > > > We've only run tests with this commit on some of the back branches for
> > > > > some of these animals. Of those, I don't see any failures so far. So,
> > > > > it seems the test instability is just related to trying to get
> > > > > multiple passes of index vacuuming reliably with TIDStore.
> > > >
> > > > > AFAICT, all the 32bit machine failures are timeouts waiting for the
> > > > > standby to catch up (mamba, gull, merswine). Unfortunately, the
> > > > > failures on copperhead (a 64 bit machine) are because we don't
> > > > > actually succeed in triggering a second vacuum pass. This would not be
> > > > > fixed by a longer timeout.
> > > >
> > > > Ouch.  This seems to me to raise the importance of getting a better
> > > > way to test multiple-index-vacuum-passes.  Peter argued upthread
> > > > that we don't need a better way, but I don't see how that argument
> > > > holds water if copperhead was not reaching it despite being 64-bit.
> > > > (Did you figure out exactly why it doesn't reach the code?)
> > >
> > > I wasn't able to reproduce the failure (failing to do > 1 index vacuum
> > > pass) on my local machine (which is 64 bit) without decreasing the
> > > number of tuples inserted. The copperhead failure confuses me because
> > > the speed of the machine should *not* affect how much space the dead
> > > item TIDStore takes up. I would have bet money that the same number
> > > and offsets of dead tuples per page in a relation would take up the
> > > same amount of space in a TIDStore on any 64-bit system -- regardless
> > > of how slowly it runs vacuum.
> >
> > Looking at copperhead's failure logs, I could not find that "VACUUM
> > (VERBOSE, FREEZE) vac_horizon_floor_table;" wrote the number of index
> > scans in logs. Is there any clue that made you think the test failed
> > to do multiple index vacuum passes?
>
> The vacuum doesn't actually finish because I have a cursor that keeps
> it from finishing and then I query pg_stat_progress_vacuum after the
> first index vacuuming round should have happened and it did not do the
> index vacuum:
>
> [20:39:34.645](351.522s) # poll_query_until timed out executing this query:
> #
> # SELECT index_vacuum_count > 0
> # FROM pg_stat_progress_vacuum
> # WHERE datname='test_db' AND relid::regclass =
> 'vac_horizon_floor_table'::regclass;
> #
> # expecting this output:
> # t
> # last actual query output:
> # f
>
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=copperhead&dt=2024-07-22%2015%3A00%3A11
>
> I suppose it is possible that it did in fact time out and the index
> vacuum was still in progress. But most of the other "too slow"
> failures were when the standby was trying to catch up. Usually the
> pg_stat_progress_vacuum test fails because we didn't actually do that
> index vacuuming round yet.


Thank you for your explanation! I understood the test cases.

I figured out why two-round index vacuum was not triggered on
copperhead although it's a 64-bit system. In short, this test case
depends on MEMORY_CONTEXT_CHECK (or USE_ASSERT_CHECKING) being on.

In this test case, every BlocktableEntry size would be 16 bytes; the
header is 8 bytes and offset bitmap is 8 bytes (covering up to offset
63). We calculate the memory size (required_size in BumpAlloc()) to
allocate in a bump memory context as follows:

#ifdef MEMORY_CONTEXT_CHECKING
    /* ensure there's always space for the sentinel byte */
    chunk_size = MAXALIGN(size + 1);
#else
    chunk_size = MAXALIGN(size);
#endif

   (snip)

    required_size = chunk_size + Bump_CHUNKHDRSZ;

Without MEMORY_CONTEXT_CHECK, if size is 16 bytes, required_size is
also 16 bytes as it's already 8-byte aligned and Bump_CHUNKHDRSZ is 0.
On the other hand with MEMORY_CONTEXT_CHECK, the requied_size is
bumped to 40 bytes as chunk_size is 24 bytes and Bump_CHUNKHDRSZ is 16
bytes. Therefore, with MEMORY_CONTEXT_CHECK, we allocate more memory
and use more Bump memory blocks, resulting in filling up TidStore in
the test cases. We can easily reproduce this test failure with
PostgreSQL server built without --enable-cassert. It seems that
copperhead is the sole BF animal that doesn't use --enable-cassert but
runs recovery-check.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Re: Vacuum ERRORs out considering freezing dead tuples from before OldestXmin

Reply via email to