Re: [PoC] Improve dead tuple storage for lazy vacuum

Masahiko Sawada Wed, 01 Mar 2023 03:59:45 -0800

On Wed, Mar 1, 2023 at 3:37 PM John Naylor <john.nay...@enterprisedb.com> wrote:
>
> On Tue, Feb 28, 2023 at 10:09 PM Masahiko Sawada <sawada.m...@gmail.com> 
> wrote:
> >
> > On Tue, Feb 28, 2023 at 10:20 PM Masahiko Sawada <sawada.m...@gmail.com> 
> > wrote:
> > >
> > > On Tue, Feb 28, 2023 at 3:42 PM John Naylor
> > > <john.nay...@enterprisedb.com> wrote:
> > > >
> > > >
> > > > On Fri, Feb 24, 2023 at 12:50 PM Masahiko Sawada 
> > > > <sawada.m...@gmail.com> wrote:
> > > > >
> > > > > On Wed, Feb 22, 2023 at 6:55 PM John Naylor
> > > > > <john.nay...@enterprisedb.com> wrote:
> > > > > >
> > > > > > That doesn't seem useful to me. If we've done enough testing to 
> > > > > > reassure us the new way always gives the same answer, the old way 
> > > > > > is not needed at commit time. If there is any doubt it will always 
> > > > > > give the same answer, then the whole patchset won't be committed.
> > > >
> > > > > My idea is to make the bug investigation easier but on
> > > > > reflection, it seems not the best idea given this purpose.
> > > >
> > > > My concern with TIDSTORE_DEBUG is that it adds new code that mimics the 
> > > > old tid array. As I've said, that doesn't seem like a good thing to 
> > > > carry forward forevermore, in any form. Plus, comparing new code with 
> > > > new code is not the same thing as comparing existing code with new 
> > > > code. That was my idea upthread.
> > > >
> > > > Maybe the effort my idea requires is too much vs. the likelihood of 
> > > > finding a problem. In any case, it's clear that if I want that level of 
> > > > paranoia, I'm going to have to do it myself.
> > > >
> > > > > What do you think
> > > > > about the attached patch? Please note that it also includes the
> > > > > changes for minimum memory requirement.
> > > >
> > > > Most of the asserts look logical, or at least harmless.
> > > >
> > > > - int max_off; /* the maximum offset number */
> > > > + OffsetNumber max_off; /* the maximum offset number */
> > > >
> > > > I agree with using the specific type for offsets here, but I'm not sure 
> > > > why this change belongs in this patch. If we decided against the new 
> > > > asserts, this would be easy to lose.
> > >
> > > Right. I'll separate this change as a separate patch.
> > >
> > > >
> > > > This change, however, defies common sense:
> > > >
> > > > +/*
> > > > + * The minimum amount of memory required by TidStore is 2MB, the 
> > > > current minimum
> > > > + * valid value for the maintenance_work_mem GUC. This is required to 
> > > > allocate the
> > > > + * DSA initial segment, 1MB, and some meta data. This number is 
> > > > applied also to
> > > > + * the local TidStore cases for simplicity.
> > > > + */
> > > > +#define TIDSTORE_MIN_MEMORY (2 * 1024 * 1024L) /* 2MB */
> > > >
> > > > + /* Sanity check for the max_bytes */
> > > > + if (max_bytes < TIDSTORE_MIN_MEMORY)
> > > > + elog(ERROR, "memory for TidStore must be at least %ld, but %zu 
> > > > provided",
> > > > + TIDSTORE_MIN_MEMORY, max_bytes);
> > > >
> > > > Aside from the fact that this elog's something that would never get 
> > > > past development, the #define just adds a hard-coded copy of something 
> > > > that is already hard-coded somewhere else, whose size depends on an 
> > > > implementation detail in a third place.
> > > >
> > > > This also assumes that all users of tid store are limited by 
> > > > maintenance_work_mem. Andres thought of an example of some day unifying 
> > > > with tidbitmap.c, and maybe other applications will be limited by 
> > > > work_mem.
> > > >
> > > > But now that I'm looking at the guc tables, I am reminded that 
> > > > work_mem's minimum is 64kB, so this highlights a design problem: There 
> > > > is obviously no requirement that the minimum work_mem has to be >= a 
> > > > single DSA segment, even though operations like parallel hash and 
> > > > parallel bitmap heap scan are limited by work_mem.
> > >
> > > Right.
> > >
> > > >  It would be nice to find out what happens with these parallel features 
> > > > when work_mem is tiny (maybe parallelism is not even considered?).
> > >
> > > IIUC both don't care about the allocated DSA segment size. Parallel
> > > hash accounts actual tuple (+ header) size as used memory but doesn't
> > > consider how much DSA segment is allocated behind. Both parallel hash
> > > and parallel bitmap scan can work even with work_mem = 64kB, but when
> > > checking the total DSA segment size allocated during these operations,
> > > it was 1MB.
> > >
> > > I realized that there is a similar memory limit design issue also on
> > > the non-shared tidstore cases. We deduct 70kB from max_bytes but it
> > > won't work fine with work_mem = 64kB.  Probably we need to reconsider
> > > it. FYI 70kB comes from the maximum slab block size for node256.
> >
> > Currently, we calculate the slab block size enough to allocate 32
> > chunks from there. For node256, the leaf node is 2,088 bytes and the
> > slab block size is 66,816 bytes. One idea to fix this issue to
> > decrease it.
>
> I think we're trying to solve the wrong problem here. I need to study this 
> more, but it seems that code that needs to stay within a memory limit only 
> needs to track what's been allocated in chunks within a block, since writing 
> there is what invokes a page fault.


Right. I guess we've discussed what we use for calculating the *used*
memory amount but I don't remember.

I think I was confused by the fact that we use some different
approaches to calculate the amount of used memory. Parallel hash and
tidbitmap use the allocated chunk size whereas hash_agg_check_limits()
in nodeAgg.c uses MemoryContextMemAllocated(), which uses the
allocated block size.

> If we're not keeping track of each and every chunk space, for speed, it 
> doesn't follow that we need to keep every block allocation within the 
> configured limit. I'm guessing we can just ask the context if the block space 
> has gone *over* the limit, and we can assume that the last allocation we 
> perform will only fault one additional page. We need to have a clear answer 
> on this before doing anything else.
>
> If that's correct, and I'm not positive yet, we can get rid of all the 
> fragile assumptions about things the tid store has no business knowing about, 
> as well as the guc change.

True.

> I'm not sure how this affects progress reporting, because it would be nice if 
> it didn't report dead_tuple_bytes bigger than max_dead_tuple_bytes.

Yes, the progress reporting could be confusable. Particularly, in
shared tidstore cases, the dead_tuple_bytes could be much bigger than
max_dead_tuple_bytes. Probably what we need might be functions for
MemoryContext and dsa_area to get the amount of memory that has been
allocated, by not tracking every chunk space. For example, the
functions would be like what SlabStats() does; iterate over every
block and calculates the total/free memory usage.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Re: [PoC] Improve dead tuple storage for lazy vacuum

Reply via email to