On Fri, Mar 3, 2023 at 8:04 PM John Naylor <john.nay...@enterprisedb.com> wrote: > > On Wed, Mar 1, 2023 at 6:59 PM Masahiko Sawada <sawada.m...@gmail.com> wrote: > > > > On Wed, Mar 1, 2023 at 3:37 PM John Naylor <john.nay...@enterprisedb.com> > > wrote: > > > > > > I think we're trying to solve the wrong problem here. I need to study > > > this more, but it seems that code that needs to stay within a memory > > > limit only needs to track what's been allocated in chunks within a block, > > > since writing there is what invokes a page fault. > > > > Right. I guess we've discussed what we use for calculating the *used* > > memory amount but I don't remember. > > > > I think I was confused by the fact that we use some different > > approaches to calculate the amount of used memory. Parallel hash and > > tidbitmap use the allocated chunk size whereas hash_agg_check_limits() > > in nodeAgg.c uses MemoryContextMemAllocated(), which uses the > > allocated block size. > > That's good to know. The latter says: > > * After adding a new group to the hash table, check whether we need to enter > * spill mode. Allocations may happen without adding new groups (for instance, > * if the transition state size grows), so this check is imperfect. > > I'm willing to claim that vacuum can be imperfect also, given the tid store's > properties: 1) on average much more efficient in used space, and 2) no longer > bound by the 1GB limit. > > > > I'm not sure how this affects progress reporting, because it would be > > > nice if it didn't report dead_tuple_bytes bigger than > > > max_dead_tuple_bytes. > > > > Yes, the progress reporting could be confusable. Particularly, in > > shared tidstore cases, the dead_tuple_bytes could be much bigger than > > max_dead_tuple_bytes. Probably what we need might be functions for > > MemoryContext and dsa_area to get the amount of memory that has been > > allocated, by not tracking every chunk space. For example, the > > functions would be like what SlabStats() does; iterate over every > > block and calculates the total/free memory usage. > > I'm not sure we need to invent new infrastructure for this. Looking at v29 in > vacuumlazy.c, the order of operations for memory accounting is: > > First, get the block-level space -- stop and vacuum indexes if we exceed the > limit: > > /* > * Consider if we definitely have enough space to process TIDs on page > * already. If we are close to overrunning the available space for > * dead_items TIDs, pause and do a cycle of vacuuming before we tackle > * this page. > */ > if (TidStoreIsFull(vacrel->dead_items)) --> which is basically "if > (TidStoreMemoryUsage(ts) > ts->control->max_bytes)" > > Then, after pruning the current page, store the tids and then get the > block-level space again: > > else if (prunestate.num_offsets > 0) > { > /* Save details of the LP_DEAD items from the page in dead_items */ > TidStoreSetBlockOffsets(...); > > pgstat_progress_update_param(PROGRESS_VACUUM_DEAD_TUPLE_BYTES, > TidStoreMemoryUsage(dead_items)); > } > > Since the block-level measurement is likely overestimating quite a bit, I > propose to simply reverse the order of the actions here, effectively > reporting progress for the *last page* and not the current one: First update > progress with the current memory usage, then add tids for this page. If this > allocated a new block, only a small bit of that will be written to. If this > block pushes it over the limit, we will detect that up at the top of the > loop. It's kind of like our earlier attempts at a "fudge factor", but simpler > and less brittle. And, as far as OS pages we have actually written to, I > think it'll effectively respect the memory limit, at least in the local mem > case. And the numbers will make sense. > > Thoughts?
It looks to work but it still doesn't work in a case where a shared tidstore is created with a 64kB memory limit, right? TidStoreMemoryUsage() returns 1MB and TidStoreIsFull() returns true from the beginning. BTW I realized that since the caller can pass dsa_area to tidstore (and radix tree), if other data are allocated in the same DSA are, TidStoreMemoryUsage() (and RT_MEMORY_USAGE()) returns the memory usage that includes not only itself but also other data. Probably it's better to comment that the passed dsa_area should be dedicated to a tidstore (or a radix tree). > > But now that I'm looking more closely at the details of memory accounting, I > don't like that TidStoreMemoryUsage() is called twice per page pruned (see > above). Maybe it wouldn't noticeably slow things down, but it's a bit sloppy. > It seems like we should call it once per loop and save the result somewhere. > If that's the right way to go, that possibly indicates that TidStoreIsFull() > is not a useful interface, at least in this form. Agreed. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com