On Fri, Dec 15, 2023 at 10:30 AM John Naylor <johncnaylo...@gmail.com> wrote: > > On Thu, Dec 14, 2023 at 7:22 AM Masahiko Sawada <sawada.m...@gmail.com> wrote: > > In v45, 0001 - 0006 are from earlier versions but I've merged previous > > updates. So the radix tree now has RT_SET() and RT_FIND() but not > > RT_GET() and RT_SEARCH(). 0007 and 0008 are the updates from previous > > versions that incorporated the above comments. 0009 patch integrates > > tidstore with lazy vacuum. > > Excellent! I repeated a quick run of the small "test 1" with very low m_w_m > from > > https://www.postgresql.org/message-id/CAFBsxsHrvTPUK%3DC1%3DxweJjGujja4Xjfgva3C8jnW3Shz6RBnFg%40mail.gmail.com > > ...and got similar results, so we still have good space-efficiency on this > test: > > master: > INFO: finished vacuuming "john.public.test": index scans: 9 > system usage: CPU: user: 56.83 s, system: 9.36 s, elapsed: 119.62 s > > v45: > INFO: finished vacuuming "john.public.test": index scans: 1 > system usage: CPU: user: 6.82 s, system: 2.05 s, elapsed: 10.89 s
Thank you for testing it again. That's a very good result. > For my next steps, I will finish the node-shrinking behavior and save > for a later patchset. Not needed for tid store, but needs to happen > because of assumptions in the code. Also, some time ago, I think I > commented out RT_FREE_RECURSE to get something working, so I'll fix > it, and look at other fixmes and todos. Great! > > > Note that DSA segment problem is not > > resolved yet in this patch. > > I remember you started a separate thread about this, but I don't think > it got any attention. Maybe reply with a "TLDR;" and share a patch to > allow controlling max segment size. Yeah, I recalled that thread. Will send a reply. > > Some more comments: > > v45-0003: > > Since RT_ITERATE_NEXT_PTR works for tid store, do we even need > RT_ITERATE_NEXT anymore? The former should handle fixed-length values > just fine? If so, we should rename it to match the latter. Agreed to rename it. > > + * The caller is responsible for locking/unlocking the tree in shared mode. > > This is not new to v45, but this will come up again below. This needs > more explanation: Since we're returning a pointer (to support > variable-length values), the caller needs to maintain control until > it's finished with the value. Will fix. > > v45-0005: > > + * Regarding the concurrency support, we use a single LWLock for the > TidStore. > + * The TidStore is exclusively locked when inserting encoded tids to the > + * radix tree or when resetting itself. When searching on the TidStore or > + * doing the iteration, it is not locked but the underlying radix tree is > + * locked in shared mode. > > This is just stating facts without giving any reasons. Readers are > going to wonder why it's inconsistent. The "why" is much more > important than the "what". Even with that, this comment is also far > from the relevant parts, and so will get out of date. Maybe we can > just make sure each relevant function is explained individually. Right, I'll fix it. > > v45-0007: > > -RT_SCOPE RT_RADIX_TREE * RT_CREATE(MemoryContext ctx); > +RT_SCOPE RT_RADIX_TREE * RT_CREATE(MemoryContext ctx, Size work_mem); > > Tid store calls this max_bytes -- can we use that name here, too? > "work_mem" is highly specific. While I agree that "work_mem" is highly specific, I avoided using "max_bytes" in radix tree because "max_bytes" sounds to me there is a memory limitation but the radix tree doesn't have it actually. It might be sufficient to mention it in the comment, though. > > - RT_PTR_ALLOC *slot; > + RT_PTR_ALLOC *slot = NULL; > > We have a macro for invalid pointer because of DSA. Will fix. > > v45-0008: > > - if (off < 1 || off > MAX_TUPLES_PER_PAGE) > + if (unlikely(off < 1 || off > MAX_TUPLES_PER_PAGE)) > elog(ERROR, "tuple offset out of range: %u", off); > > This is a superfluous distraction, since the error path is located way > off in the cold segment of the binary. Okay, will remove it. > > v45-0009: > > (just a few small things for now) > > - * lazy_vacuum_heap_page() -- free page's LP_DEAD items listed in the > - * vacrel->dead_items array. > + * lazy_vacuum_heap_page() -- free page's LP_DEAD items. > > I think we can keep as "listed in the TID store". > > - * Allocate dead_items (either using palloc, or in dynamic shared memory). > - * Sets dead_items in vacrel for caller. > + * Allocate a (local or shared) TidStore for storing dead TIDs. Sets > dead_items > + * in vacrel for caller. > > I think we want to keep "in dynamic shared memory". It's still true. > I'm not sure anything needs to change here, actually. Agreed with above comments. Will fix them. > > parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes, > - int nrequested_workers, int max_items, > - int elevel, BufferAccessStrategy bstrategy) > + int nrequested_workers, int vac_work_mem, > + int max_offset, int elevel, > + BufferAccessStrategy bstrategy) > > It seems very strange to me that this function has to pass the > max_offset. In general, it's been simpler to assume we have a constant > max_offset, but in this case that fact is not helping. Something to > think about for later. max_offset was previously used in old TID encoding in tidstore. Since tidstore has entries for each block, I think we no longer need it. > > - (errmsg("scanned index \"%s\" to remove %d row versions", > + (errmsg("scanned index \"%s\" to remove " UINT64_FORMAT " row versions", > > This should be signed int64. Will fix. > > v45-0010: > > Thinking about this some more, I'm not sure we need to do anything > different for the *starting* segment size. (Controlling *max* size > does seem important, however.) For the corner case of m_w_m = 1MB, > it's fine if vacuum quits pruning immediately after (in effect) it > finds the DSA has gone to 2MB. It's not worth bothering with, IMO. If > the memory accounting starts >1MB because we're adding the trivial > size of some struct, let's just stop doing that. The segment > allocations are what we care about. IIUC it's for work_mem, whose the minimum value is 64kB. > > v45-0011: > > + /* > + * max_bytes is forced to be at least 64kB, the current minimum valid > + * value for the work_mem GUC. > + */ > + max_bytes = Max(64 * 1024L, max_bytes); > > Why? This is to avoid creating a radix tree within very small memory. The minimum work_mem value is a reasonable lower bound that PostgreSQL uses internally. It's actually copied from tuplesort.c. >I believe I mentioned months ago that copying a hard-coded value > that can get out of sync is not maintainable, but I don't even see the > point of this part. True. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com