Hi, In the GIN parallel build thread [0] I proposed adding a 'flush' callback to the tuplesort API [1], to be called by the tuplesort machinery after writing each run, so that users of tuplesort can buffer write calls within each sorted run and use that buffer state to merge/deduplicate sorttuples before writing them to disk (assuming some invariants are held). By implementing deduplication the user can thus reduce the number of sortable tuples and the total working set size significantly, thus later reducing time spent reading and processing those runs and speeding up the build time for tuplesorts with many duplicate sort keys, but with mergable output.
As the GIN parallel index patch was already large and complex enough, it was suggested to split it into a separate patch. I'd said that I'd try to build one, so here is a patch that adds the required API to the tuplesort internals, and implements the 'buffer, deduplicate, flush' idea for nbtree index builds that can benefit from deduplication. If deduplication is enabled for the btree index, we now merge tuples up to the tuplesort buffer size while we're still extracting tuples from the heap, thereby reducing the temporary storage requirement of the index build. One complication that this patch needs to deal with is that parallel scans (and syncscan wraparounds) can cause some higher TIDs to appear before lower tids in two posting lists (e.g. lists [1, 4] and [2, 3]). Therefore, we must take some special care while loading the tree to make sure we only write out TIDs which are smaller than the latest tuple's TID; which is implemented with a growing TID buffer. As part of these changes, there are some changes to the btree build behaviour. Where previously we wouldn't ever create posting lists larger than the target posting list size (812 bytes), we can now emit some posting tuples with up to 10 [^3] TIDs and up to BTMaxItemSize large when the non-deduplicated base tuple would otherwise be large enough to consume that 812-byte posting list size limit. Local testing shows that index builds can see 60%+ reduced temporary disk usage when indexing values with high duplication factors -storage comparable to the resulting index- and I've seen 50% index reduced build times for some text-based deduplication workloads. Note: this isn't yet very polished, but it works. I'm reasonably happy with happy-path performance so far, but I've seen bad cases (unique dataset without UNIQUE specifier) regress by as much as 30%, so this is definitely not (yet?) a panacea. It should be noted that UNIQUE btree builds still make use of the old non-duplicating tuplesort code, meaning they can be used to detect regressions in this patch (vs a non-UNIQUE index build with otherwise the same definition). Kind regards, Matthias van de Meent Neon (https://neon.tech) [0] https://postgr.es/m/flat/6ab4003f-a8b8-4d75-a67f-f25ad98582dc%40enterprisedb.com [1] https://postgr.es/m/CAEze2WjB1vpxtvKuWVEThSaB-v4%2B8H0EXsOB%3DyLAv8pLcrQuKw%40mail.gmail.com [2] https://postgr.es/m/CAEze2WiTAeZe4t5wAeRN834xFBqROPmjeK2XTstNko6bbVPX%3DA%40mail.gmail.com [^3] Chosen by fair dice roll
From bf0f1b64932dfc206bb6f94a875abf92b44c7da0 Mon Sep 17 00:00:00 2001 From: Matthias van de Meent <boekewurm+postgres@gmail.com> Date: Wed, 28 Aug 2024 15:28:37 +0200 Subject: [PATCH v0 1/2] Allow tuplesort implementations to buffer writes Before, all writes to the sort tapes would have to be completed during the call to writetup(). That's sufficient when the user of tuplesort isn't interested in merging sorted tuples, but btree (and in the future, GIN) sorts tuples to later merge them during insertion into the index. If it'd merge the tuples before writing them to disk, we can save significant disk space and IO. As such, we allow WRITETUP to do whatever it wants when we're filling a tape with tuples, and call FLUSHWRITES() at the end to mark the end of that tape so that the tuplesort can flush any remaining buffers to disk. By design, this does _not_ allow deduplication while the dataset is still in memory. Writing data to disk is inherently expensive, so we're likely to win time by spending some additional cycles on buffering the data in the hopes of not writing as much data. However, in memory the additional cycles may cause too much of an overhead to be useful. Note that any implementation of tuple merging using the buffering strategy that is enabled by this commit must also make sure that the merged tuples are definitely not larger than the sum of the sizes of the merged tuples. --- src/include/utils/tuplesort.h | 9 +++++++++ src/backend/utils/sort/tuplesort.c | 5 +++++ src/backend/utils/sort/tuplesortvariants.c | 7 +++++++ 3 files changed, 21 insertions(+) diff --git a/src/include/utils/tuplesort.h b/src/include/utils/tuplesort.h index cde83f6201..ed17ca00b6 100644 --- a/src/include/utils/tuplesort.h +++ b/src/include/utils/tuplesort.h @@ -194,6 +194,15 @@ typedef struct void (*writetup) (Tuplesortstate *state, LogicalTape *tape, SortTuple *stup); + /* + * Flush any buffered writetup() writes. + * + * This is useful when writetup() buffers writes for more efficient + * use of the tape's resources, e.g. when deduplicating or merging + * sort tuples. + */ + void (*flushwrites) (Tuplesortstate *state, LogicalTape *tape); + /* * Function to read a stored tuple from tape back into memory. 'len' is * the already-read length of the stored tuple. The tuple is allocated diff --git a/src/backend/utils/sort/tuplesort.c b/src/backend/utils/sort/tuplesort.c index c960cfa823..fd838a3b1b 100644 --- a/src/backend/utils/sort/tuplesort.c +++ b/src/backend/utils/sort/tuplesort.c @@ -395,6 +395,7 @@ struct Sharedsort #define REMOVEABBREV(state,stup,count) ((*(state)->base.removeabbrev) (state, stup, count)) #define COMPARETUP(state,a,b) ((*(state)->base.comparetup) (a, b, state)) #define WRITETUP(state,tape,stup) ((*(state)->base.writetup) (state, tape, stup)) +#define FLUSHWRITES(state,tape) ((state)->base.flushwrites ? (*(state)->base.flushwrites) (state, tape) : (void) 0) #define READTUP(state,stup,tape,len) ((*(state)->base.readtup) (state, stup, tape, len)) #define FREESTATE(state) ((state)->base.freestate ? (*(state)->base.freestate) (state) : (void) 0) #define LACKMEM(state) ((state)->availMem < 0 && !(state)->slabAllocatorUsed) @@ -2244,6 +2245,8 @@ mergeonerun(Tuplesortstate *state) } } + FLUSHWRITES(state, state->destTape); + /* * When the heap empties, we're done. Write an end-of-run marker on the * output tape. @@ -2369,6 +2372,8 @@ dumptuples(Tuplesortstate *state, bool alltuples) WRITETUP(state, state->destTape, stup); } + FLUSHWRITES(state, state->destTape); + state->memtupcount = 0; /* diff --git a/src/backend/utils/sort/tuplesortvariants.c b/src/backend/utils/sort/tuplesortvariants.c index e07ba4ea4b..b9e8b3943e 100644 --- a/src/backend/utils/sort/tuplesortvariants.c +++ b/src/backend/utils/sort/tuplesortvariants.c @@ -199,6 +199,7 @@ tuplesort_begin_heap(TupleDesc tupDesc, base->comparetup = comparetup_heap; base->comparetup_tiebreak = comparetup_heap_tiebreak; base->writetup = writetup_heap; + base->flushwrites = NULL; base->readtup = readtup_heap; base->haveDatum1 = true; base->arg = tupDesc; /* assume we need not copy tupDesc */ @@ -275,6 +276,7 @@ tuplesort_begin_cluster(TupleDesc tupDesc, base->comparetup = comparetup_cluster; base->comparetup_tiebreak = comparetup_cluster_tiebreak; base->writetup = writetup_cluster; + base->flushwrites = NULL; base->readtup = readtup_cluster; base->freestate = freestate_cluster; base->arg = arg; @@ -383,6 +385,7 @@ tuplesort_begin_index_btree(Relation heapRel, base->comparetup = comparetup_index_btree; base->comparetup_tiebreak = comparetup_index_btree_tiebreak; base->writetup = writetup_index; + base->flushwrites = NULL; base->readtup = readtup_index; base->haveDatum1 = true; base->arg = arg; @@ -462,6 +465,7 @@ tuplesort_begin_index_hash(Relation heapRel, base->comparetup = comparetup_index_hash; base->comparetup_tiebreak = comparetup_index_hash_tiebreak; base->writetup = writetup_index; + base->flushwrites = NULL; base->readtup = readtup_index; base->haveDatum1 = true; base->arg = arg; @@ -506,6 +510,7 @@ tuplesort_begin_index_gist(Relation heapRel, base->comparetup = comparetup_index_btree; base->comparetup_tiebreak = comparetup_index_btree_tiebreak; base->writetup = writetup_index; + base->flushwrites = NULL; base->readtup = readtup_index; base->haveDatum1 = true; base->arg = arg; @@ -561,6 +566,7 @@ tuplesort_begin_index_brin(int workMem, base->removeabbrev = removeabbrev_index_brin; base->comparetup = comparetup_index_brin; base->writetup = writetup_index_brin; + base->flushwrites = NULL; base->readtup = readtup_index_brin; base->haveDatum1 = true; base->arg = NULL; @@ -602,6 +608,7 @@ tuplesort_begin_datum(Oid datumType, Oid sortOperator, Oid sortCollation, base->comparetup = comparetup_datum; base->comparetup_tiebreak = comparetup_datum_tiebreak; base->writetup = writetup_datum; + base->flushwrites = NULL; base->readtup = readtup_datum; base->haveDatum1 = true; base->arg = arg; -- 2.45.2
v0-0002-Add-tuplesort-level-deduplication-to-nbtree-build.patch
Description: Binary data