Hi,

In the GIN parallel build thread [0] I proposed adding a 'flush'
callback to the tuplesort API [1], to be called by the tuplesort
machinery after writing each run, so that users of tuplesort can
buffer write calls within each sorted run and use that buffer state to
merge/deduplicate sorttuples before writing them to disk (assuming
some invariants are held).
By implementing deduplication the user can thus reduce the number of
sortable tuples and the total working set size significantly, thus
later reducing time spent reading and processing those runs and
speeding up the build time for tuplesorts with many duplicate sort
keys, but with mergable output.

As the GIN parallel index patch was already large and complex enough,
it was suggested to split it into a separate patch. I'd said that I'd
try to build one, so here is a patch that adds the required API to the
tuplesort internals, and implements the 'buffer, deduplicate, flush'
idea for nbtree index builds that can benefit from deduplication.

If deduplication is enabled for the btree index, we now merge tuples
up to the tuplesort buffer size while we're still extracting tuples
from the heap, thereby reducing the temporary storage requirement of
the index build.

One complication that this patch needs to deal with is that parallel
scans (and syncscan wraparounds) can cause some higher TIDs to appear
before lower tids in two posting lists (e.g. lists [1, 4] and [2, 3]).
Therefore, we must take some special care while loading the tree to
make sure we only write out TIDs which are smaller than the latest
tuple's TID; which is implemented with a growing TID buffer.

As part of these changes, there are some changes to the btree build
behaviour. Where previously we wouldn't ever create posting lists
larger than the target posting list size (812 bytes), we can now emit
some posting tuples with up to 10 [^3] TIDs and up to BTMaxItemSize
large when the non-deduplicated base tuple would otherwise be large
enough to consume that 812-byte posting list size limit.

Local testing shows that index builds can see 60%+ reduced temporary
disk usage when indexing values with high duplication factors -storage
comparable to the resulting index- and I've seen 50% index reduced
build times for some text-based deduplication workloads.

Note: this isn't yet very polished, but it works. I'm reasonably happy
with happy-path performance so far, but I've seen bad cases (unique
dataset without UNIQUE specifier) regress by as much as 30%, so this
is definitely not (yet?) a panacea.

It should be noted that UNIQUE btree builds still make use of the old
non-duplicating tuplesort code, meaning they can be used to detect
regressions in this patch (vs a non-UNIQUE index build with otherwise
the same definition).

Kind regards,

Matthias van de Meent
Neon (https://neon.tech)

[0] 
https://postgr.es/m/flat/6ab4003f-a8b8-4d75-a67f-f25ad98582dc%40enterprisedb.com
[1] 
https://postgr.es/m/CAEze2WjB1vpxtvKuWVEThSaB-v4%2B8H0EXsOB%3DyLAv8pLcrQuKw%40mail.gmail.com
[2] 
https://postgr.es/m/CAEze2WiTAeZe4t5wAeRN834xFBqROPmjeK2XTstNko6bbVPX%3DA%40mail.gmail.com

[^3] Chosen by fair dice roll
From bf0f1b64932dfc206bb6f94a875abf92b44c7da0 Mon Sep 17 00:00:00 2001
From: Matthias van de Meent <boekewurm+postgres@gmail.com>
Date: Wed, 28 Aug 2024 15:28:37 +0200
Subject: [PATCH v0 1/2] Allow tuplesort implementations to buffer writes

Before, all writes to the sort tapes would have to be completed during
the call to writetup().  That's sufficient when the user of tuplesort
isn't interested in merging sorted tuples, but btree (and in the future,
GIN) sorts tuples to later merge them during insertion into the index.
If it'd merge the tuples before writing them to disk, we can save
significant disk space and IO.

As such, we allow WRITETUP to do whatever it wants when we're filling a
tape with tuples, and call FLUSHWRITES() at the end to mark the end of
that tape so that the tuplesort can flush any remaining buffers to disk.

By design, this does _not_ allow deduplication while the dataset is still
in memory. Writing data to disk is inherently expensive, so we're likely
to win time by spending some additional cycles on buffering the data in
the hopes of not writing as much data. However, in memory the additional
cycles may cause too much of an overhead to be useful.

Note that any implementation of tuple merging using the buffering
strategy that is enabled by this commit must also make sure that the
merged tuples are definitely not larger than the sum of the sizes of the
merged tuples.
---
 src/include/utils/tuplesort.h              | 9 +++++++++
 src/backend/utils/sort/tuplesort.c         | 5 +++++
 src/backend/utils/sort/tuplesortvariants.c | 7 +++++++
 3 files changed, 21 insertions(+)

diff --git a/src/include/utils/tuplesort.h b/src/include/utils/tuplesort.h
index cde83f6201..ed17ca00b6 100644
--- a/src/include/utils/tuplesort.h
+++ b/src/include/utils/tuplesort.h
@@ -194,6 +194,15 @@ typedef struct
 	void		(*writetup) (Tuplesortstate *state, LogicalTape *tape,
 							 SortTuple *stup);
 
+	/*
+	 * Flush any buffered writetup() writes.
+	 *
+	 * This is useful when writetup() buffers writes for more efficient
+	 * use of the tape's resources, e.g. when deduplicating or merging
+	 * sort tuples.
+	 */
+	void		(*flushwrites) (Tuplesortstate *state, LogicalTape *tape);
+
 	/*
 	 * Function to read a stored tuple from tape back into memory. 'len' is
 	 * the already-read length of the stored tuple.  The tuple is allocated
diff --git a/src/backend/utils/sort/tuplesort.c b/src/backend/utils/sort/tuplesort.c
index c960cfa823..fd838a3b1b 100644
--- a/src/backend/utils/sort/tuplesort.c
+++ b/src/backend/utils/sort/tuplesort.c
@@ -395,6 +395,7 @@ struct Sharedsort
 #define REMOVEABBREV(state,stup,count)	((*(state)->base.removeabbrev) (state, stup, count))
 #define COMPARETUP(state,a,b)	((*(state)->base.comparetup) (a, b, state))
 #define WRITETUP(state,tape,stup)	((*(state)->base.writetup) (state, tape, stup))
+#define FLUSHWRITES(state,tape)	((state)->base.flushwrites ? (*(state)->base.flushwrites) (state, tape) : (void) 0)
 #define READTUP(state,stup,tape,len) ((*(state)->base.readtup) (state, stup, tape, len))
 #define FREESTATE(state)	((state)->base.freestate ? (*(state)->base.freestate) (state) : (void) 0)
 #define LACKMEM(state)		((state)->availMem < 0 && !(state)->slabAllocatorUsed)
@@ -2244,6 +2245,8 @@ mergeonerun(Tuplesortstate *state)
 		}
 	}
 
+	FLUSHWRITES(state, state->destTape);
+
 	/*
 	 * When the heap empties, we're done.  Write an end-of-run marker on the
 	 * output tape.
@@ -2369,6 +2372,8 @@ dumptuples(Tuplesortstate *state, bool alltuples)
 		WRITETUP(state, state->destTape, stup);
 	}
 
+	FLUSHWRITES(state, state->destTape);
+
 	state->memtupcount = 0;
 
 	/*
diff --git a/src/backend/utils/sort/tuplesortvariants.c b/src/backend/utils/sort/tuplesortvariants.c
index e07ba4ea4b..b9e8b3943e 100644
--- a/src/backend/utils/sort/tuplesortvariants.c
+++ b/src/backend/utils/sort/tuplesortvariants.c
@@ -199,6 +199,7 @@ tuplesort_begin_heap(TupleDesc tupDesc,
 	base->comparetup = comparetup_heap;
 	base->comparetup_tiebreak = comparetup_heap_tiebreak;
 	base->writetup = writetup_heap;
+	base->flushwrites = NULL;
 	base->readtup = readtup_heap;
 	base->haveDatum1 = true;
 	base->arg = tupDesc;		/* assume we need not copy tupDesc */
@@ -275,6 +276,7 @@ tuplesort_begin_cluster(TupleDesc tupDesc,
 	base->comparetup = comparetup_cluster;
 	base->comparetup_tiebreak = comparetup_cluster_tiebreak;
 	base->writetup = writetup_cluster;
+	base->flushwrites = NULL;
 	base->readtup = readtup_cluster;
 	base->freestate = freestate_cluster;
 	base->arg = arg;
@@ -383,6 +385,7 @@ tuplesort_begin_index_btree(Relation heapRel,
 	base->comparetup = comparetup_index_btree;
 	base->comparetup_tiebreak = comparetup_index_btree_tiebreak;
 	base->writetup = writetup_index;
+	base->flushwrites = NULL;
 	base->readtup = readtup_index;
 	base->haveDatum1 = true;
 	base->arg = arg;
@@ -462,6 +465,7 @@ tuplesort_begin_index_hash(Relation heapRel,
 	base->comparetup = comparetup_index_hash;
 	base->comparetup_tiebreak = comparetup_index_hash_tiebreak;
 	base->writetup = writetup_index;
+	base->flushwrites = NULL;
 	base->readtup = readtup_index;
 	base->haveDatum1 = true;
 	base->arg = arg;
@@ -506,6 +510,7 @@ tuplesort_begin_index_gist(Relation heapRel,
 	base->comparetup = comparetup_index_btree;
 	base->comparetup_tiebreak = comparetup_index_btree_tiebreak;
 	base->writetup = writetup_index;
+	base->flushwrites = NULL;
 	base->readtup = readtup_index;
 	base->haveDatum1 = true;
 	base->arg = arg;
@@ -561,6 +566,7 @@ tuplesort_begin_index_brin(int workMem,
 	base->removeabbrev = removeabbrev_index_brin;
 	base->comparetup = comparetup_index_brin;
 	base->writetup = writetup_index_brin;
+	base->flushwrites = NULL;
 	base->readtup = readtup_index_brin;
 	base->haveDatum1 = true;
 	base->arg = NULL;
@@ -602,6 +608,7 @@ tuplesort_begin_datum(Oid datumType, Oid sortOperator, Oid sortCollation,
 	base->comparetup = comparetup_datum;
 	base->comparetup_tiebreak = comparetup_datum_tiebreak;
 	base->writetup = writetup_datum;
+	base->flushwrites = NULL;
 	base->readtup = readtup_datum;
 	base->haveDatum1 = true;
 	base->arg = arg;
-- 
2.45.2

Attachment: v0-0002-Add-tuplesort-level-deduplication-to-nbtree-build.patch
Description: Binary data

Reply via email to