On Wed, Mar 1, 2023 at 3:37 PM John Naylor <john.nay...@enterprisedb.com> wrote: > > On Tue, Feb 28, 2023 at 10:09 PM Masahiko Sawada <sawada.m...@gmail.com> > wrote: > > > > On Tue, Feb 28, 2023 at 10:20 PM Masahiko Sawada <sawada.m...@gmail.com> > > wrote: > > > > > > On Tue, Feb 28, 2023 at 3:42 PM John Naylor > > > <john.nay...@enterprisedb.com> wrote: > > > > > > > > > > > > On Fri, Feb 24, 2023 at 12:50 PM Masahiko Sawada > > > > <sawada.m...@gmail.com> wrote: > > > > > > > > > > On Wed, Feb 22, 2023 at 6:55 PM John Naylor > > > > > <john.nay...@enterprisedb.com> wrote: > > > > > > > > > > > > That doesn't seem useful to me. If we've done enough testing to > > > > > > reassure us the new way always gives the same answer, the old way > > > > > > is not needed at commit time. If there is any doubt it will always > > > > > > give the same answer, then the whole patchset won't be committed. > > > > > > > > > My idea is to make the bug investigation easier but on > > > > > reflection, it seems not the best idea given this purpose. > > > > > > > > My concern with TIDSTORE_DEBUG is that it adds new code that mimics the > > > > old tid array. As I've said, that doesn't seem like a good thing to > > > > carry forward forevermore, in any form. Plus, comparing new code with > > > > new code is not the same thing as comparing existing code with new > > > > code. That was my idea upthread. > > > > > > > > Maybe the effort my idea requires is too much vs. the likelihood of > > > > finding a problem. In any case, it's clear that if I want that level of > > > > paranoia, I'm going to have to do it myself. > > > > > > > > > What do you think > > > > > about the attached patch? Please note that it also includes the > > > > > changes for minimum memory requirement. > > > > > > > > Most of the asserts look logical, or at least harmless. > > > > > > > > - int max_off; /* the maximum offset number */ > > > > + OffsetNumber max_off; /* the maximum offset number */ > > > > > > > > I agree with using the specific type for offsets here, but I'm not sure > > > > why this change belongs in this patch. If we decided against the new > > > > asserts, this would be easy to lose. > > > > > > Right. I'll separate this change as a separate patch. > > > > > > > > > > > This change, however, defies common sense: > > > > > > > > +/* > > > > + * The minimum amount of memory required by TidStore is 2MB, the > > > > current minimum > > > > + * valid value for the maintenance_work_mem GUC. This is required to > > > > allocate the > > > > + * DSA initial segment, 1MB, and some meta data. This number is > > > > applied also to > > > > + * the local TidStore cases for simplicity. > > > > + */ > > > > +#define TIDSTORE_MIN_MEMORY (2 * 1024 * 1024L) /* 2MB */ > > > > > > > > + /* Sanity check for the max_bytes */ > > > > + if (max_bytes < TIDSTORE_MIN_MEMORY) > > > > + elog(ERROR, "memory for TidStore must be at least %ld, but %zu > > > > provided", > > > > + TIDSTORE_MIN_MEMORY, max_bytes); > > > > > > > > Aside from the fact that this elog's something that would never get > > > > past development, the #define just adds a hard-coded copy of something > > > > that is already hard-coded somewhere else, whose size depends on an > > > > implementation detail in a third place. > > > > > > > > This also assumes that all users of tid store are limited by > > > > maintenance_work_mem. Andres thought of an example of some day unifying > > > > with tidbitmap.c, and maybe other applications will be limited by > > > > work_mem. > > > > > > > > But now that I'm looking at the guc tables, I am reminded that > > > > work_mem's minimum is 64kB, so this highlights a design problem: There > > > > is obviously no requirement that the minimum work_mem has to be >= a > > > > single DSA segment, even though operations like parallel hash and > > > > parallel bitmap heap scan are limited by work_mem. > > > > > > Right. > > > > > > > It would be nice to find out what happens with these parallel features > > > > when work_mem is tiny (maybe parallelism is not even considered?). > > > > > > IIUC both don't care about the allocated DSA segment size. Parallel > > > hash accounts actual tuple (+ header) size as used memory but doesn't > > > consider how much DSA segment is allocated behind. Both parallel hash > > > and parallel bitmap scan can work even with work_mem = 64kB, but when > > > checking the total DSA segment size allocated during these operations, > > > it was 1MB. > > > > > > I realized that there is a similar memory limit design issue also on > > > the non-shared tidstore cases. We deduct 70kB from max_bytes but it > > > won't work fine with work_mem = 64kB. Probably we need to reconsider > > > it. FYI 70kB comes from the maximum slab block size for node256. > > > > Currently, we calculate the slab block size enough to allocate 32 > > chunks from there. For node256, the leaf node is 2,088 bytes and the > > slab block size is 66,816 bytes. One idea to fix this issue to > > decrease it. > > I think we're trying to solve the wrong problem here. I need to study this > more, but it seems that code that needs to stay within a memory limit only > needs to track what's been allocated in chunks within a block, since writing > there is what invokes a page fault.
Right. I guess we've discussed what we use for calculating the *used* memory amount but I don't remember. I think I was confused by the fact that we use some different approaches to calculate the amount of used memory. Parallel hash and tidbitmap use the allocated chunk size whereas hash_agg_check_limits() in nodeAgg.c uses MemoryContextMemAllocated(), which uses the allocated block size. > If we're not keeping track of each and every chunk space, for speed, it > doesn't follow that we need to keep every block allocation within the > configured limit. I'm guessing we can just ask the context if the block space > has gone *over* the limit, and we can assume that the last allocation we > perform will only fault one additional page. We need to have a clear answer > on this before doing anything else. > > If that's correct, and I'm not positive yet, we can get rid of all the > fragile assumptions about things the tid store has no business knowing about, > as well as the guc change. True. > I'm not sure how this affects progress reporting, because it would be nice if > it didn't report dead_tuple_bytes bigger than max_dead_tuple_bytes. Yes, the progress reporting could be confusable. Particularly, in shared tidstore cases, the dead_tuple_bytes could be much bigger than max_dead_tuple_bytes. Probably what we need might be functions for MemoryContext and dsa_area to get the amount of memory that has been allocated, by not tracking every chunk space. For example, the functions would be like what SlabStats() does; iterate over every block and calculates the total/free memory usage. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com