On Mon, 7 Sep 2020 at 19:47, David Rowley <dgrowle...@gmail.com> wrote: > I wondered if we could get around that just by having another buffer > somewhere and memcpy the tuples into that first then copy the tuples > out that buffer back into the page. No need to worry about the order > we do that in as there's no chance to overwrite memory belonging to > other tuples. > > Doing that gives me 79.166 seconds in the same recovery test. Or about > 46% faster, instead of 22% (I mistakenly wrote 28% yesterday)
I did some more thinking about this and considered if there's a way to just get rid of the sorting version of compactify_tuples() completely. In the version from yesterday, I fell back on the sort version for when more than half the tuples from the page were being pruned. I'd thought that in this case copying out *all* of the page from pd_upper up to the pd_special (the tuple portion of the page) would maybe be more costly since that would include (needlessly) copying all the pruned tuples too. The sort also becomes cheaper in that case since the number of items to sort is less, hence I thought it was a good idea to keep the old version for some cases. However, I now think we can just fix this by conditionally copying all tuples when in 1 big memcpy when not many tuples have been pruned and when more tuples are pruned we can just do individual memcpys into the separate buffer. I wrote a little .c program to try to figure out of there's some good cut off point to where one method becomes better than the other and I find that generally if we're pruning away about 75% of tuples then doing a memcpy() per non-pruned tuple is faster, otherwise, it seems better just to copy the entire tuple area of the page. See attached compact_test.c I ran this and charted the cut off at (nitems < totaltups / 4) and (nitems < totaltups / 2), and nitems < 16) ./compact_test 32 192 ./compact_test 64 96 ./compact_test 128 48 ./compact_test 256 24 ./compact_test 512 12 The / 4 one gives me the graphs with the smallest step when the method switches. See attached 48_tuples_per_page.png for comparison. I've so far come up with the attached compactify_tuples_dgr_v2.patch.txt. Thomas pointed out to me off-list that using PGAlignedBlock is the general way to allocate a page-sized data on the stack. I'm still classing this patch as PoC grade. I'll need to look a bit harder at the correctness of it all. I did spend quite a bit of time trying to find a case where this is slower than master's version. I can't find a case where there's any noticeable slowdown. Using the same script from [1] I tried a few variations of the t1 table by adding an additional column to pad out the tuple to make it wider. Obviously a wider tuple means fewer tuples on the page so less tuples for master's qsort to sort during compactify_tuples(). I did manage to squeeze a bit more performance out of the test cases. Yesterday I got 79.166 seconds. This version gets me 76.623 seconds. Here are the results of the various tuple widths: narrow width row test: insert into t1 select x,0 from generate_series(1,10000000) x; (32 byte tuples) patched: 76.623 master: 137.593 medium width row test: insert into t1 select x,0,md5(x::text) || md5((x+1)::Text) from generate_series(1,10000000) x; (about 64 byte tuples) patched: 64.411 master: 95.576 wide row test: insert into t1 select x,0,(select string_agg(md5(y::text),'') from generate_Series(x,x+30) y) from generate_series(1,1000000)x; (1024 byte tuples) patched: 88.653 master: 90.077 Changing the test so instead of having 10 million rows in the table and updating a random row 12 million times. I put just 10 rows in the table and updated them 12 million times. This results in compactify_tuples() pruning all but 1 row (since autovac can't keep up with this, each row does end up on a page by itself). I wanted to ensure I didn't regress a workload that master's qsort() version would have done very well at. qsorting 1 element is pretty fast. 10-row narrow test: patched: 10.633 <--- small regression master: 10.366 I could special case this and do a memmove without copying the tuple to another buffer, but I don't think the slowdown is enough to warrant having such a special case. Another thing I tried was to instead of compacting the page in compactify_tuples(), I just get rid of that function and did the compacting in the existing loop in PageRepairFragmentation(). This does require changing the ERROR check to a PANIC since we may have already started shuffling tuples around when we find the corrupted line pointer. However, I was just unable to make this faster than the attached version. I'm still surprised at this as I can completely get rid of the itemidbase array. The best run-time I got with this method out the original test was 86 seconds, so 10 seconds slower than what the attached can do. So I threw that idea away. David [1] https://www.postgresql.org/message-id/caaphdvokwqazhiuxet8jsqupjkdph8dnuzdfusx9p7dxrjd...@mail.gmail.com
#define BLCKSZ 8192 #define MaxHeapTuplesPerPage 291 typedef unsigned long long int64; typedef unsigned short uint16; typedef short int16; typedef union PGAlignedBlock { char data[BLCKSZ]; double force_align_d; int64 force_align_i64; } PGAlignedBlock; typedef struct PageHeader { int pd_lower; int pd_upper; int pd_special; } PageHeader; typedef struct itemIdCompactData { uint16 offsetindex; /* linp array index */ int16 itemoff; /* page offset of item data */ uint16 alignedlen; /* MAXALIGN(item data len) */ } itemIdCompactData; typedef itemIdCompactData *itemIdCompact; #include <stdlib.h> #include <stdio.h> #include <time.h> static int itemoffcompare(const void *itemidp1, const void *itemidp2) { /* Sort in decreasing itemoff order */ return rand() > RAND_MAX / 2 ? 1 : -1; } void make_itemIdCompactData_array(itemIdCompactData *array, int size, int tupwidth) { int i; int pd_upper = BLCKSZ; if (size * tupwidth > BLCKSZ) { fprintf(stderr, "tuple data bigger than BLCKSZ\n"); exit(-1); } for (i = 0; i < size; i++) { pd_upper -= tupwidth; array[i].offsetindex = i + 1; array[i].itemoff = pd_upper; array[i].alignedlen = tupwidth; } qsort((char *) array, size, sizeof(itemIdCompactData), itemoffcompare); } void print_itemIdCompactData_array(itemIdCompactData *array, int size) { int i; for (i = 0; i < size; i++) { itemIdCompact itemidptr = &array[i]; printf("item[%d] offset: %d, len: %u\n", itemidptr->offsetindex, itemidptr->itemoff, itemidptr->alignedlen); } } static void compactify_tuples(itemIdCompact itemidbase, int nitems, int totaltups, PGAlignedBlock *page, PageHeader *phdr) { int upper; PGAlignedBlock scratch; char *scratchptr = scratch.data; int i; /* * The tuples in the itemidbase array may be in any order so in order to * move these to the end of the page we must make a temp copy of each * tuple before we copy them back into the page at the new position. * * If a large number of the tuples have been pruned (>75%) then we'll copy * these into the temp buffer tuple-by-tuple, otherwise we'll just copy * the entire tuple section of the page in a single memcpy(). Doing this * saves us doing large copies when we're pruning most of the tuples. */ if (nitems < totaltups / 2) { for (i = 0; i < nitems; i++) { itemIdCompact itemidptr = &itemidbase[i]; memcpy(scratchptr + itemidptr->itemoff, page->data + itemidptr->itemoff, itemidptr->alignedlen); } } else { /* Copy the entire tuple space */ memcpy(scratchptr + phdr->pd_upper, page->data + phdr->pd_upper, phdr->pd_special - phdr->pd_upper); } upper = BLCKSZ; for (i = 0; i < nitems; i++) { itemIdCompact itemidptr = &itemidbase[i]; upper -= itemidptr->alignedlen; memcpy((char *) page + upper, scratchptr + itemidptr->itemoff, itemidptr->alignedlen); } } int main(int argc, char **argv) { int tupwidth; int ntuples; int unprunedtups; PGAlignedBlock page; PageHeader header; clock_t start, end; int i; int t; if (argc < 3) { printf("Usage: %s <tupwidth> <ntuples>\n", argv[0]); return -1; } tupwidth = atoi(argv[1]); ntuples = atoi(argv[2]); if (ntuples > MaxHeapTuplesPerPage) { printf("ntuples cannot be above %d\n", MaxHeapTuplesPerPage); return -1; } header.pd_lower = 0; header.pd_upper = BLCKSZ - tupwidth * ntuples; header.pd_special = BLCKSZ; itemIdCompactData itemidbase[MaxHeapTuplesPerPage]; for (unprunedtups = 1; unprunedtups < ntuples; unprunedtups++) { make_itemIdCompactData_array(itemidbase, unprunedtups, tupwidth); start = clock(); for (i = 0; i < 1000000; i++) compactify_tuples(itemidbase, unprunedtups, ntuples, &page, &header); end = clock(); printf("%d : %g msec\n", unprunedtups, (double) (end - start) / CLOCKS_PER_SEC * 1000); } return 0; }
diff --git a/src/backend/storage/page/bufpage.c b/src/backend/storage/page/bufpage.c index d708117a40..b20559230c 100644 --- a/src/backend/storage/page/bufpage.c +++ b/src/backend/storage/page/bufpage.c @@ -411,50 +411,67 @@ PageRestoreTempPage(Page tempPage, Page oldPage) } /* - * sorting support for PageRepairFragmentation and PageIndexMultiDelete + * Item pruning support for PageRepairFragmentation and PageIndexMultiDelete */ -typedef struct itemIdSortData +typedef struct itemIdCompactData { uint16 offsetindex; /* linp array index */ int16 itemoff; /* page offset of item data */ uint16 alignedlen; /* MAXALIGN(item data len) */ -} itemIdSortData; -typedef itemIdSortData *itemIdSort; - -static int -itemoffcompare(const void *itemidp1, const void *itemidp2) -{ - /* Sort in decreasing itemoff order */ - return ((itemIdSort) itemidp2)->itemoff - - ((itemIdSort) itemidp1)->itemoff; -} +} itemIdCompactData; +typedef itemIdCompactData *itemIdCompact; /* * After removing or marking some line pointers unused, move the tuples to * remove the gaps caused by the removed items. */ static void -compactify_tuples(itemIdSort itemidbase, int nitems, Page page) +compactify_tuples(itemIdCompact itemidbase, int nitems, Page page) { PageHeader phdr = (PageHeader) page; Offset upper; + PGAlignedBlock scratch; + char *scratchptr = scratch.data; int i; - /* sort itemIdSortData array into decreasing itemoff order */ - qsort((char *) itemidbase, nitems, sizeof(itemIdSortData), - itemoffcompare); + /* + * The tuples in the itemidbase array may be in any order so in order to + * move these to the end of the page we must make a temp copy of each + * tuple before we copy them back into the page at the new position. + * + * If a large number of the tuples have been pruned (>75%) then we'll copy + * these into the temp buffer tuple-by-tuple, otherwise we'll just copy + * the entire tuple section of the page in a single memcpy(). Doing this + * saves us doing large copies when we're pruning most of the tuples. + */ + if (nitems < PageGetMaxOffsetNumber(page) / 4) + { + for (i = 0; i < nitems; i++) + { + itemIdCompact itemidptr = &itemidbase[i]; + memcpy(scratchptr + itemidptr->itemoff, page + itemidptr->itemoff, + itemidptr->alignedlen); + } + } + else + { + /* Copy the entire tuple space */ + memcpy(scratchptr + phdr->pd_upper, + page + phdr->pd_upper, + phdr->pd_special - phdr->pd_upper); + } upper = phdr->pd_special; for (i = 0; i < nitems; i++) { - itemIdSort itemidptr = &itemidbase[i]; + itemIdCompact itemidptr = &itemidbase[i]; ItemId lp; lp = PageGetItemId(page, itemidptr->offsetindex + 1); upper -= itemidptr->alignedlen; - memmove((char *) page + upper, - (char *) page + itemidptr->itemoff, - itemidptr->alignedlen); + memcpy((char *) page + upper, + scratchptr + itemidptr->itemoff, + itemidptr->alignedlen); lp->lp_off = upper; } @@ -477,8 +494,8 @@ PageRepairFragmentation(Page page) Offset pd_lower = ((PageHeader) page)->pd_lower; Offset pd_upper = ((PageHeader) page)->pd_upper; Offset pd_special = ((PageHeader) page)->pd_special; - itemIdSortData itemidbase[MaxHeapTuplesPerPage]; - itemIdSort itemidptr; + itemIdCompactData itemidbase[MaxHeapTuplesPerPage]; + itemIdCompact itemidptr; ItemId lp; int nline, nstorage, @@ -831,9 +848,9 @@ PageIndexMultiDelete(Page page, OffsetNumber *itemnos, int nitems) Offset pd_lower = phdr->pd_lower; Offset pd_upper = phdr->pd_upper; Offset pd_special = phdr->pd_special; - itemIdSortData itemidbase[MaxIndexTuplesPerPage]; + itemIdCompactData itemidbase[MaxIndexTuplesPerPage]; ItemIdData newitemids[MaxIndexTuplesPerPage]; - itemIdSort itemidptr; + itemIdCompact itemidptr; ItemId lp; int nline, nused;