Re: [PoC] Improve dead tuple storage for lazy vacuum

Masahiko Sawada Sun, 05 Mar 2023 22:28:38 -0800

On Fri, Mar 3, 2023 at 8:04 PM John Naylor <john.nay...@enterprisedb.com> wrote:
>
> On Wed, Mar 1, 2023 at 6:59 PM Masahiko Sawada <sawada.m...@gmail.com> wrote:
> >
> > On Wed, Mar 1, 2023 at 3:37 PM John Naylor <john.nay...@enterprisedb.com> 
> > wrote:
> > >
> > > I think we're trying to solve the wrong problem here. I need to study 
> > > this more, but it seems that code that needs to stay within a memory 
> > > limit only needs to track what's been allocated in chunks within a block, 
> > > since writing there is what invokes a page fault.
> >
> > Right. I guess we've discussed what we use for calculating the *used*
> > memory amount but I don't remember.
> >
> > I think I was confused by the fact that we use some different
> > approaches to calculate the amount of used memory. Parallel hash and
> > tidbitmap use the allocated chunk size whereas hash_agg_check_limits()
> > in nodeAgg.c uses MemoryContextMemAllocated(), which uses the
> > allocated block size.
>
> That's good to know. The latter says:
>
>  * After adding a new group to the hash table, check whether we need to enter
>  * spill mode. Allocations may happen without adding new groups (for instance,
>  * if the transition state size grows), so this check is imperfect.
>
> I'm willing to claim that vacuum can be imperfect also, given the tid store's 
> properties: 1) on average much more efficient in used space, and 2) no longer 
> bound by the 1GB limit.
>
> > > I'm not sure how this affects progress reporting, because it would be 
> > > nice if it didn't report dead_tuple_bytes bigger than 
> > > max_dead_tuple_bytes.
> >
> > Yes, the progress reporting could be confusable. Particularly, in
> > shared tidstore cases, the dead_tuple_bytes could be much bigger than
> > max_dead_tuple_bytes. Probably what we need might be functions for
> > MemoryContext and dsa_area to get the amount of memory that has been
> > allocated, by not tracking every chunk space. For example, the
> > functions would be like what SlabStats() does; iterate over every
> > block and calculates the total/free memory usage.
>
> I'm not sure we need to invent new infrastructure for this. Looking at v29 in 
> vacuumlazy.c, the order of operations for memory accounting is:
>
> First, get the block-level space -- stop and vacuum indexes if we exceed the 
> limit:
>
> /*
>  * Consider if we definitely have enough space to process TIDs on page
>  * already.  If we are close to overrunning the available space for
>  * dead_items TIDs, pause and do a cycle of vacuuming before we tackle
>  * this page.
>  */
> if (TidStoreIsFull(vacrel->dead_items)) --> which is basically "if 
> (TidStoreMemoryUsage(ts) > ts->control->max_bytes)"
>
> Then, after pruning the current page, store the tids and then get the 
> block-level space again:
>
> else if (prunestate.num_offsets > 0)
> {
>   /* Save details of the LP_DEAD items from the page in dead_items */
>   TidStoreSetBlockOffsets(...);
>
>   pgstat_progress_update_param(PROGRESS_VACUUM_DEAD_TUPLE_BYTES,
>                                TidStoreMemoryUsage(dead_items));
> }
>
> Since the block-level measurement is likely overestimating quite a bit, I 
> propose to simply reverse the order of the actions here, effectively 
> reporting progress for the *last page* and not the current one: First update 
> progress with the current memory usage, then add tids for this page. If this 
> allocated a new block, only a small bit of that will be written to. If this 
> block pushes it over the limit, we will detect that up at the top of the 
> loop. It's kind of like our earlier attempts at a "fudge factor", but simpler 
> and less brittle. And, as far as OS pages we have actually written to, I 
> think it'll effectively respect the memory limit, at least in the local mem 
> case. And the numbers will make sense.
>
> Thoughts?


It looks to work but it still doesn't work in a case where a shared
tidstore is created with a 64kB memory limit, right?
TidStoreMemoryUsage() returns 1MB and TidStoreIsFull() returns true
from the beginning.

BTW I realized that since the caller can pass dsa_area to tidstore
(and radix tree), if other data are allocated in the same DSA are,
TidStoreMemoryUsage() (and RT_MEMORY_USAGE()) returns the memory usage
that includes not only itself but also other data. Probably it's
better to comment that the passed dsa_area should be dedicated to a
tidstore (or a radix tree).

>
> But now that I'm looking more closely at the details of memory accounting, I 
> don't like that TidStoreMemoryUsage() is called twice per page pruned (see 
> above). Maybe it wouldn't noticeably slow things down, but it's a bit sloppy. 
> It seems like we should call it once per loop and save the result somewhere. 
> If that's the right way to go, that possibly indicates that TidStoreIsFull() 
> is not a useful interface, at least in this form.

Agreed.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Re: [PoC] Improve dead tuple storage for lazy vacuum

Reply via email to