On Wed, 9 Sep 2020 at 05:38, Thomas Munro <thomas.mu...@gmail.com> wrote: > > On Wed, Sep 9, 2020 at 3:47 AM David Rowley <dgrowle...@gmail.com> wrote: > > On Tue, 8 Sep 2020 at 12:08, Thomas Munro <thomas.mu...@gmail.com> wrote: > > > One thought is that if we're going to copy everything out and back in > > > again, we might want to consider doing it in a > > > memory-prefetcher-friendly order. Would it be a good idea to > > > rearrange the tuples to match line pointer order, so that the copying > > > work and also later sequential scans are in a forward direction? > > > > That's an interesting idea but wouldn't that require both the copy to > > the separate buffer *and* a qsort? That's the worst of both > > implementations. We'd need some other data structure too in order to > > get the index of the sorted array by reverse lineitem point, which > > might require an additional array and an additional sort. > > Well I may not have had enough coffee yet but I thought you'd just > have to spin though the item IDs twice. Once to compute sum(lp_len) > so you can compute the new pd_upper, and the second time to copy the > tuples from their random locations on the temporary page to new > sequential locations, so that afterwards item ID order matches offset > order.
I think you were adequately caffeinated. You're right that this is fairly simple to do, but it looks even more simple than looping twice of the array. I think it's just a matter of looping over the itemidbase backwards and putting the higher itemid tuples at the end of the page. I've done it this way in the attached patch. I also added a presorted path which falls back on doing memmoves without the temp buffer when the itemidbase array indicates that higher lineitems all have higher offsets. I'm doing the presort check in the calling function since that loops over the lineitems already. We can just memmove the tuples in reverse order without overwriting any yet to be moved tuples when these are in order. Also, I added code to collapse the memcpy and memmoves for adjacent tuples so that we perform the minimal number of calls to those functions. Once we've previously compacted a page it seems that the code is able to reduce the number of calls significantly. I added some logging and reviewed at after a run of the benchmark and saw that for about 192 tuples we're mostly just doing 3-4 memcpys in the non-presorted path and just 2 memmoves, for the presorted code path. I also found that in my test the presorted path was only taken 12.39% of the time. Trying with 120 million UPDATEs instead of 12 million in the test ended up reducing this to just 10.89%. It seems that it'll just be 1 or 2 tuples spoiling it since new tuples will still be added earlier in the page after we free up space to add more. I also experimented seeing what would happen if I also tried to collapse the memcpys for copying to the temp buffer. The performance got a little worse from doing that. So I left that code #ifdef'd out With the attached v3, performance is better. The test now runs recovery 65.6 seconds, vs master's 148.5 seconds. So about 2.2x faster. We should probably consider what else can be done to try to write pages with tuples for earlier lineitems earlier in the page. VACUUM FULLs and friends will switch back to the opposite order when rewriting the heap. Also fixed my missing libc debug symbols: 24.90% postgres postgres [.] PageRepairFragmentation 15.26% postgres libc-2.31.so [.] __memmove_avx_unaligned_erms 9.61% postgres postgres [.] hash_search_with_hash_value 8.03% postgres postgres [.] compactify_tuples 6.25% postgres postgres [.] XLogReadBufferExtended 3.74% postgres postgres [.] PinBuffer 2.25% postgres postgres [.] hash_bytes 1.79% postgres postgres [.] heap_xlog_update 1.47% postgres postgres [.] LWLockRelease 1.44% postgres postgres [.] XLogReadRecord 1.33% postgres postgres [.] PageGetHeapFreeSpace 1.16% postgres postgres [.] DecodeXLogRecord 1.13% postgres postgres [.] pg_comp_crc32c_sse42 1.12% postgres postgres [.] LWLockAttemptLock 1.09% postgres postgres [.] StartupXLOG 0.90% postgres postgres [.] ReadBuffer_common 0.84% postgres postgres [.] SlruSelectLRUPage 0.74% postgres libc-2.31.so [.] __memcmp_avx2_movbe 0.68% postgres [kernel.kallsyms] [k] copy_user_generic_string 0.66% postgres postgres [.] PageAddItemExtended 0.66% postgres postgres [.] PageIndexTupleOverwrite 0.62% postgres postgres [.] smgropen 0.60% postgres postgres [.] ReadPageInternal 0.57% postgres postgres [.] GetPrivateRefCountEntry 0.52% postgres postgres [.] heap_redo 0.51% postgres postgres [.] AdvanceNextFullTransactionIdPastXid David
diff --git a/src/backend/storage/page/bufpage.c b/src/backend/storage/page/bufpage.c index d708117a40..339dfc0763 100644 --- a/src/backend/storage/page/bufpage.c +++ b/src/backend/storage/page/bufpage.c @@ -411,51 +411,187 @@ PageRestoreTempPage(Page tempPage, Page oldPage) } /* - * sorting support for PageRepairFragmentation and PageIndexMultiDelete + * Item pruning support for PageRepairFragmentation and PageIndexMultiDelete */ -typedef struct itemIdSortData +typedef struct itemIdCompactData { uint16 offsetindex; /* linp array index */ int16 itemoff; /* page offset of item data */ uint16 alignedlen; /* MAXALIGN(item data len) */ -} itemIdSortData; -typedef itemIdSortData *itemIdSort; - -static int -itemoffcompare(const void *itemidp1, const void *itemidp2) -{ - /* Sort in decreasing itemoff order */ - return ((itemIdSort) itemidp2)->itemoff - - ((itemIdSort) itemidp1)->itemoff; -} +} itemIdCompactData; +typedef itemIdCompactData *itemIdCompact; /* * After removing or marking some line pointers unused, move the tuples to * remove the gaps caused by the removed items. + * + * Callers may pass 'presorted' as true if the itemidbase array is sorted in + * ascending order of itemoff. This allows a slightly more optimal code path + * to be taken. */ static void -compactify_tuples(itemIdSort itemidbase, int nitems, Page page) +compactify_tuples(itemIdCompact itemidbase, int nitems, Page page, bool presorted) { PageHeader phdr = (PageHeader) page; Offset upper; int i; - /* sort itemIdSortData array into decreasing itemoff order */ - qsort((char *) itemidbase, nitems, sizeof(itemIdSortData), - itemoffcompare); - - upper = phdr->pd_special; - for (i = 0; i < nitems; i++) + if (presorted) { - itemIdSort itemidptr = &itemidbase[i]; - ItemId lp; + Offset copy_tail; + Offset copy_head; + itemIdCompact itemidptr; + + /* + * The order of lineitem offsets is already in the optimal order, i.e, + * lower item pointers have a lower offset. This allows us to move + * the tuples up to the end of the heap in reverse order without + * having to worry about overwriting memory of other tuples during the + * move operation. + */ + + /* + * Double check the caller didn't mess up while setting the presorted + * flag to true. + */ +#ifdef USE_ASSERT_CHECKING + { + Offset lastoff = 0; + + for (i = 0; i < nitems; i++) + { + itemidptr = &itemidbase[i]; + if (lastoff > itemidptr->itemoff) + Assert(false); + lastoff = itemidptr->itemoff; + } + } +#endif /* USE_ASSERT_CHECKING */ - lp = PageGetItemId(page, itemidptr->offsetindex + 1); - upper -= itemidptr->alignedlen; + /* + * Do the tuple compactification. Collapse memmove calls for adjacent + * tuples. + */ + upper = phdr->pd_special; + copy_tail = copy_head = itemidbase[nitems - 1].itemoff + itemidbase[nitems - 1].alignedlen; + for (i = nitems - 1; i >= 0; i--) + { + ItemId lp; + + itemidptr = &itemidbase[i]; + lp = PageGetItemId(page, itemidptr->offsetindex + 1); + + if (copy_head != itemidptr->itemoff + itemidptr->alignedlen) + { + memmove((char *) page + upper, + page + copy_head, + copy_tail - copy_head); + copy_tail = itemidptr->itemoff + itemidptr->alignedlen; + } + upper -= itemidptr->alignedlen; + copy_head = itemidptr->itemoff; + + lp->lp_off = upper; + + } + + /* move the remaining chunk */ memmove((char *) page + upper, - (char *) page + itemidptr->itemoff, - itemidptr->alignedlen); - lp->lp_off = upper; + page + copy_head, + copy_tail - copy_head); + } + else + { + Offset copy_tail; + Offset copy_head; + itemIdCompact itemidptr; + PGAlignedBlock scratch; + char *scratchptr = scratch.data; + + /* + * The tuples in the itemidbase array may be in any order so in order to + * move these to the end of the page we must make a temp copy of each + * tuple before we copy them back into the page at the new position. + * + * If a large number of the tuples have been pruned (>75%) then we'll copy + * these into the temp buffer tuple-by-tuple, otherwise, we'll just copy + * the entire tuple section of the page in a single memcpy(). Doing this + * saves us doing large copies when we're pruning most of the tuples. + */ + if (nitems < PageGetMaxOffsetNumber(page) / 4) + { + for (i = 0; i < nitems; i++) + { + itemIdCompact itemidptr = &itemidbase[i]; + memcpy(scratchptr + itemidptr->itemoff, page + itemidptr->itemoff, + itemidptr->alignedlen); + } + } + else + { +#ifdef NOTUSED + /* + * Try to do the minimal amount of copying to get the required + * tuples into the temp buffer. + */ + copy_tail = copy_head = itemidbase[0].itemoff + itemidbase[0].alignedlen; + for (i = 0; i < nitems; i++) + { + itemidptr = &itemidbase[i]; + + if (copy_head != itemidptr->itemoff + itemidptr->alignedlen) + { + memcpy((char *) scratchptr + copy_head, + page + copy_head, + copy_tail - copy_head); + copy_tail = itemidptr->itemoff + itemidptr->alignedlen; + } + copy_head = itemidptr->itemoff; + } + + /* Copy the remaining chunk */ + memcpy((char *) scratchptr + copy_head, + page + copy_head, + copy_tail - copy_head); +#else + /* Copy the entire tuple space */ + memcpy(scratchptr + phdr->pd_upper, + page + phdr->pd_upper, + phdr->pd_special - phdr->pd_upper); +#endif + } + + /* + * Do the tuple compactification. Collapse memmove calls for adjacent + * tuples. + */ + upper = phdr->pd_special; + copy_tail = copy_head = itemidbase[nitems - 1].itemoff + itemidbase[nitems - 1].alignedlen; + for (i = nitems - 1; i >= 0; i--) + { + ItemId lp; + + itemidptr = &itemidbase[i]; + lp = PageGetItemId(page, itemidptr->offsetindex + 1); + + if (copy_head != itemidptr->itemoff + itemidptr->alignedlen) + { + memcpy((char *) page + upper, + scratchptr + copy_head, + copy_tail - copy_head); + copy_tail = itemidptr->itemoff + itemidptr->alignedlen; + } + upper -= itemidptr->alignedlen; + copy_head = itemidptr->itemoff; + + lp->lp_off = upper; + + } + + /* Copy the remaining chunk */ + memcpy((char *) page + upper, + scratchptr + copy_head, + copy_tail - copy_head); } phdr->pd_upper = upper; @@ -477,14 +613,16 @@ PageRepairFragmentation(Page page) Offset pd_lower = ((PageHeader) page)->pd_lower; Offset pd_upper = ((PageHeader) page)->pd_upper; Offset pd_special = ((PageHeader) page)->pd_special; - itemIdSortData itemidbase[MaxHeapTuplesPerPage]; - itemIdSort itemidptr; + Offset last_offset; + itemIdCompactData itemidbase[MaxHeapTuplesPerPage]; + itemIdCompact itemidptr; ItemId lp; int nline, nstorage, nunused; int i; Size totallen; + bool presorted = true; /* For now */ /* * It's worth the trouble to be more paranoid here than in most places, @@ -509,6 +647,7 @@ PageRepairFragmentation(Page page) nline = PageGetMaxOffsetNumber(page); itemidptr = itemidbase; nunused = totallen = 0; + last_offset = 0; for (i = FirstOffsetNumber; i <= nline; i++) { lp = PageGetItemId(page, i); @@ -518,6 +657,12 @@ PageRepairFragmentation(Page page) { itemidptr->offsetindex = i - 1; itemidptr->itemoff = ItemIdGetOffset(lp); + + if (last_offset > itemidptr->itemoff) + presorted = false; + else + last_offset = itemidptr->itemoff; + if (unlikely(itemidptr->itemoff < (int) pd_upper || itemidptr->itemoff >= (int) pd_special)) ereport(ERROR, @@ -552,7 +697,7 @@ PageRepairFragmentation(Page page) errmsg("corrupted item lengths: total %u, available space %u", (unsigned int) totallen, pd_special - pd_lower))); - compactify_tuples(itemidbase, nstorage, page); + compactify_tuples(itemidbase, nstorage, page, presorted); } /* Set hint bit for PageAddItem */ @@ -831,9 +976,9 @@ PageIndexMultiDelete(Page page, OffsetNumber *itemnos, int nitems) Offset pd_lower = phdr->pd_lower; Offset pd_upper = phdr->pd_upper; Offset pd_special = phdr->pd_special; - itemIdSortData itemidbase[MaxIndexTuplesPerPage]; + itemIdCompactData itemidbase[MaxIndexTuplesPerPage]; ItemIdData newitemids[MaxIndexTuplesPerPage]; - itemIdSort itemidptr; + itemIdCompact itemidptr; ItemId lp; int nline, nused; @@ -932,7 +1077,7 @@ PageIndexMultiDelete(Page page, OffsetNumber *itemnos, int nitems) phdr->pd_lower = SizeOfPageHeaderData + nused * sizeof(ItemIdData); /* and compactify the tuple data */ - compactify_tuples(itemidbase, nused, page); + compactify_tuples(itemidbase, nused, page, false); }