Re: branch-free tuplesort partitioning

John Naylor Mon, 10 Feb 2025 00:19:22 -0800

I benchmarked this a couple weeks ago, and just now got around to
putting the results in a presentable form.

I took some of the more interesting cases from the B&M test Peter
shared upthread (best and worst performers in a quick test) and ran
them with a range of "m" values.

This is exploratory work, so the attached script doesn't quite
correspond to what's in the spreadsheet, but it gives some idea of the
details. I used 1 million records for everything, with prewarm and
turbo off.

-Random with modulus
This is the best case, beating our current sort on all but the lowest
cardinality inputs (which are only slightly slower).

-Sawtooth
This is the worst case, worse for all saw sizes. The case with 30 saws
is particularly bad. I tried adding a recursion stopping heap sort,
but that made things worse. I had hoisted the presorted check out of
the recursion steps because it used the full comparator (for
simplicity), and adding back a specialized check for every recursion
helped somewhat, but not enough to matter.

-Stagger [value = (i*multiplier + i) % num_records]
No clear winner here.

-Miscellaneous cases
Random and ascending are repeated in the spreadsheet for reference.
-Zipfian is from the numpy random number generator using 1.01 as the
parameter. Patch looks good here.
-"s95" is 95% ascending with 5% random tail. This regresses a bit.
-"p5" is a best case for our current sort: 95% zeros, 2.5% random
negative values, and 2.5% random positive values. The first pivot is
pretty much guaranteed to be zero, so in master all zeros are moved to
the middle partition upon first recursion. The patch takes two passes
to bucket all zeros together (because branchless comparisons with
single machine instructions must be 2-way and not 3-way), but in the
grand scheme of things it amounts to only a small regression.
-Organ regresses substantially.
-Descending regresses but not as much as I expected since descending
input is supposed to be a bad case for Lomuto partitioning schemes.

These are mixed results, so I'm not inclined to pursue this further
this cycle. There is one thing which could tip the scales in its favor
in the future: The planner doesn't yet know about the specialized
comparators using single machine instructions for the first key. The
patch has a fairly tight range for (almost) unique inputs. Whether
it's faster or slower than master, these inputs very often measure
about the same. This is not surprising because there are no branches
to mispredict during partitioning. Having predictable worst cases
would good for the planner.

One thing I didn't test is multiple keys (or resolving ties of an
abreviated first key). When the patch buckets duplicates together, it
must (if necessary) recurse to that partition to resolve the
tiebreaks. That was simple to code, but untested for either
correctness or performance. It's a bit like the multi-key sort
proposal in this regard, but with only two distinct steps:

https://www.postgresql.org/message-id/PH7P220MB1533DA211DF219996760CBB7D9EB2%40PH7P220MB1533.NAMP220.PROD.OUTLOOK.COM

--
John Naylor
Amazon Web Services

branchless-lomuto-20250210.ods
Description: application/vnd.oasis.opendocument.spreadsheet

BL-perf-test-examples.sh
Description: application/shellscript

From ad030d722028f90ef2afe6e7d7a696bc1a8ae2eb Mon Sep 17 00:00:00 2001
From: John Naylor <john.nay...@postgresql.org>
Date: Mon, 25 Nov 2024 12:37:33 +0700
Subject: [PATCH v2] Branchless Lomuto partitioning

POC that only works on datatypes that use ssup_datum_signed_cmp.
Dev-only GUC needed because it can't yet handle NULLs or DESC.
---
 src/backend/utils/misc/guc_tables.c       |  12 ++
 src/backend/utils/sort/tuple_partition.h  |  38 +++++
 src/backend/utils/sort/tuple_small_sort.h |  30 ++++
 src/backend/utils/sort/tuplesort.c        | 192 +++++++++++++++++++++-
 src/include/utils/guc.h                   |   1 +
 src/include/utils/tuplesort.h             |   4 +
 6 files changed, 274 insertions(+), 3 deletions(-)
 create mode 100644 src/backend/utils/sort/tuple_partition.h
 create mode 100644 src/backend/utils/sort/tuple_small_sort.h

diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c
index 71448bb4fd..b53ebfca09 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -1720,6 +1720,18 @@ struct config_bool ConfigureNamesBool[] =
 		NULL, NULL, NULL
 	},
 
+	{
+		/* XXX not for commit */
+		{"debug_branchless_sort", PGC_USERSET, QUERY_TUNING_METHOD,
+			gettext_noop("WIP testing of branchless sort techniques"),
+			NULL,
+			GUC_EXPLAIN
+		},
+		&debug_branchless_sort,
+		false,
+		NULL, NULL, NULL
+	},
+
 #ifdef TRACE_SYNCSCAN
 	/* this is undocumented because not exposed in a standard build */
 	{
diff --git a/src/backend/utils/sort/tuple_partition.h b/src/backend/utils/sort/tuple_partition.h
new file mode 100644
index 0000000000..d84c3ae005
--- /dev/null
+++ b/src/backend/utils/sort/tuple_partition.h
@@ -0,0 +1,38 @@
+/*
+ * Based on psuedocode from https://orlp.net/blog/branchless-lomuto-partitioning/
+ *
+ * There is deliberately no include guard here.
+ */
+
+size_t i = 0;
+size_t j = 0;
+SortTuple pivot = *pivot_pos;
+
+Assert(n>0);
+
+
+/* create gap at front */
+*pivot_pos = v[0];
+
+while (j < n - 1)
+{
+	v[j] = v[i];
+	j += 1;
+	v[i] = v[j];
+#ifdef PARTITION_LEFT
+	i += !CMP_2WAY(pivot, v[i]);
+#else
+	i += CMP_2WAY(v[i], pivot);
+#endif
+}
+
+v[j] = v[i];
+v[i] = pivot;
+#ifdef PARTITION_LEFT
+	i += !CMP_2WAY(pivot, v[i]);
+#else
+	i += CMP_2WAY(v[i], pivot);
+#endif
+
+/* i is the number of elements in the left partition */
+return i;
diff --git a/src/backend/utils/sort/tuple_small_sort.h b/src/backend/utils/sort/tuple_small_sort.h
new file mode 100644
index 0000000000..30737a286c
--- /dev/null
+++ b/src/backend/utils/sort/tuple_small_sort.h
@@ -0,0 +1,30 @@
+/*
+ * There is deliberately no include guard here.
+ */
+
+SortTuple	 *pl,
+			 *pm;
+
+for (pm = begin + 1; pm < begin + n; pm++)
+{
+	pl = pm;
+
+	/*
+	 * Compare first so we can avoid 2 moves for an element already
+	 * positioned correctly.
+	 */
+	if (CMP_3WAY(pl - 1, pl) > 0)
+	{
+		SortTuple tmp = *pl;
+
+		do
+		{
+			*pl = *(pl - 1);
+			pl--;
+		}
+		while (pl > begin && CMP_3WAY(pl - 1, &tmp) > 0);
+
+		*pl = tmp;
+	}
+
+}
diff --git a/src/backend/utils/sort/tuplesort.c b/src/backend/utils/sort/tuplesort.c
index 2ef32d53a4..623d278f4b 100644
--- a/src/backend/utils/sort/tuplesort.c
+++ b/src/backend/utils/sort/tuplesort.c
@@ -122,6 +122,7 @@
 
 /* GUC variables */
 bool		trace_sort = false;
+bool		debug_branchless_sort = false;	/* XXX not for commit */
 
 #ifdef DEBUG_BOUNDED_SORT
 bool		optimize_bounded_sort = true;
@@ -619,6 +620,155 @@ qsort_tuple_int32_compare(SortTuple *a, SortTuple *b, Tuplesortstate *state)
 #define ST_DEFINE
 #include "lib/sort_template.h"
 
+
+/*
+ * WIP: Branchless partitioning assumes NULLs have been handled already,
+ * so we don't consider them here.
+ * XXX: only works on first sort key, possibly abbreviated.
+ */
+#define LEADING_DATUM_CMP(a, b) \
+	ApplySortComparator((a)->datum1, false, \
+						(b)->datum1, false, ssup)
+
+static pg_noinline SortTuple *
+datum_med3(SortTuple *a,
+		   SortTuple *b,
+		   SortTuple *c, SortSupport ssup)
+{
+	return LEADING_DATUM_CMP(a, b) < 0 ?
+		(LEADING_DATUM_CMP(b, c) < 0 ? b : (LEADING_DATUM_CMP(a, c) < 0 ? c : a))
+		: (LEADING_DATUM_CMP(b, c) > 0 ? b : (LEADING_DATUM_CMP(a, c) < 0 ? a : c));
+}
+
+#define CMP_2WAY(a, b) (DatumGetInt64((a).datum1) < DatumGetInt64((b).datum1))
+#define CMP_3WAY(a,b) qsort_tuple_signed_compare(a,b,state)
+static size_t
+part_right_datum_signed_asc(SortTuple *v, size_t n, SortTuple *pivot_pos)
+{
+#include "tuple_partition.h"
+}
+
+static size_t
+part_left_datum_signed_asc(SortTuple *v, size_t n, SortTuple *pivot_pos)
+{
+#define PARTITION_LEFT
+#include "tuple_partition.h"
+#undef PARTITION_LEFT
+}
+
+static inline void
+small_sort_datum_signed(SortTuple *begin, size_t n, Tuplesortstate *state)
+{
+#include "tuple_small_sort.h"
+}
+#undef CMP_2WAY
+#undef CMP_3WAY
+
+
+static void
+qsort_tuple_datum(SortTuple *data, size_t n, Tuplesortstate *state, SortTuple *ancestor_pivot)
+{
+	SortTuple  *a = data,
+			   *pl,
+			   *pm,
+			   *pn;
+	size_t		n_left_part;
+	SortSupportData *ssup = state->base.sortKeys;
+
+
+loop:
+	CHECK_FOR_INTERRUPTS();
+
+	if (n < 7)
+	{
+		SortTuple  *begin;
+
+		if (ancestor_pivot != NULL &&
+			state->base.onlyKey == NULL)
+		{
+			/*
+			 * We must inculde the ancestor pivot, because the previous
+			 * partitioning step only compared the first key (possibly
+			 * abbreviated).
+			 */
+			begin = ancestor_pivot;
+			n++;
+		}
+		else
+			begin = a;
+
+		if (state->base.sortKeys[0].comparator == ssup_datum_signed_cmp)
+			small_sort_datum_signed(begin, n, state);
+
+		return;
+	}
+
+	pm = a + (n / 2);
+	if (n > 7)
+	{
+		pl = a;
+		pn = a + (n - 1);
+		if (n > 40)
+		{
+			size_t		d = (n / 8);
+
+			pl = datum_med3(pl, pl + d, pl + 2 * d, ssup);
+			pm = datum_med3(pm - d, pm, pm + d, ssup);
+			pn = datum_med3(pn - 2 * d, pn - d, pn, ssup);
+		}
+		pm = datum_med3(pl, pm, pn, ssup);
+	}
+
+	/*
+	 * Heuristic for when to bucket duplicates: If pivot compares equal to the
+	 * ancestor pivot, then there are likely a large number of duplicates in
+	 * this partition. In this case we "partition left", putting
+	 * elements equal to the pivot into the left partition, and greater elements
+	 * in the right partition.
+	 */
+	if (ancestor_pivot != NULL && LEADING_DATUM_CMP(ancestor_pivot, pm) == 0)
+	{
+		n_left_part = state->base.partition_left(a, n, pm);
+
+		/*
+		 * If the leading datum is authoritative, we are done. If not, we
+		 * recurse with a standard sort using the tiebreak comparator. We must
+		 * inculde both the current pivot and ancestor pivot.
+		 */
+		if (state->base.onlyKey == NULL)
+			qsort_tuple(ancestor_pivot, n_left_part + 1,
+						state->base.comparetup_tiebreak,
+						state);
+
+		/*
+		 * The only time this value must be correctly set to NULL is when we
+		 * enter the root partition. Setting it NULL here is an optimization:
+		 * Since all elements to the right of the current pivot are strictly
+		 * greater than it, we won't include it when we eventually
+		 * recurse to a small sort.
+		 */
+		ancestor_pivot = NULL;
+
+		a += n_left_part;
+		n -= n_left_part;
+		goto loop;
+	}
+	else
+		n_left_part = state->base.partition_right(a, n, pm);
+
+	/* WIP: Keep recursion simple for now. */
+
+	/* Recurse on left partition... */
+	qsort_tuple_datum(a, n_left_part, state, ancestor_pivot);
+
+	/* ..., then iterate on right partition to save stack space */
+	ancestor_pivot = a + n_left_part;
+	a = ancestor_pivot + 1;
+	n -= n_left_part + 1;
+	goto loop;
+}
+
+
 /*
  *		tuplesort_begin_xxx
  *
@@ -2679,6 +2829,23 @@ tuplesort_sort_memtuples(Tuplesortstate *state)
 
 	if (state->memtupcount > 1)
 	{
+		int presorted = 1;
+
+		/* one-time precheck for monotonic input */
+		for (SortTuple* pm = state->memtuples + 1;
+			 pm < state->memtuples + state->memtupcount;
+			 pm++)
+		{
+			CHECK_FOR_INTERRUPTS();
+			if (COMPARETUP(state, pm - 1, pm) > 0)
+			{
+				presorted = 0;
+				break;
+			}
+		}
+		if (presorted)
+			return;
+
 		/*
 		 * Do we have the leading column's value or abbreviation in datum1,
 		 * and is there a specialization for its comparator?
@@ -2695,9 +2862,28 @@ tuplesort_sort_memtuples(Tuplesortstate *state)
 #if SIZEOF_DATUM >= 8
 			else if (state->base.sortKeys[0].comparator == ssup_datum_signed_cmp)
 			{
-				qsort_tuple_signed(state->memtuples,
-								   state->memtupcount,
-								   state);
+				/* WIP: proof of concept for one datum type */
+				SortTuple  *pm;
+
+				if (debug_branchless_sort)
+				{
+					state->base.partition_right = part_right_datum_signed_asc;
+					state->base.partition_left = part_left_datum_signed_asc;
+					qsort_tuple_datum(state->memtuples,
+									  state->memtupcount,
+									  state, NULL);
+				}
+				else
+					qsort_tuple_signed(state->memtuples,
+									   state->memtupcount,
+									   state);
+
+				/* WIP: correctness check */
+				for (pm = state->memtuples + 1;
+					 pm < state->memtuples + state->memtupcount;
+					 pm++)
+					Assert(COMPARETUP(state, pm - 1, pm) <= 0);
+
 				return;
 			}
 #endif
diff --git a/src/include/utils/guc.h b/src/include/utils/guc.h
index 1233e07d7d..b049db493d 100644
--- a/src/include/utils/guc.h
+++ b/src/include/utils/guc.h
@@ -300,6 +300,7 @@ extern PGDLLIMPORT int tcp_user_timeout;
 extern PGDLLIMPORT char *role_string;
 extern PGDLLIMPORT bool in_hot_standby_guc;
 extern PGDLLIMPORT bool trace_sort;
+extern PGDLLIMPORT bool debug_branchless_sort; /* XXX not for commit */
 
 #ifdef DEBUG_BOUNDED_SORT
 extern PGDLLIMPORT bool optimize_bounded_sort;
diff --git a/src/include/utils/tuplesort.h b/src/include/utils/tuplesort.h
index c63f1e5d6d..85a511fe7b 100644
--- a/src/include/utils/tuplesort.h
+++ b/src/include/utils/tuplesort.h
@@ -180,6 +180,10 @@ typedef struct
 	 */
 	SortTupleComparator comparetup_tiebreak;
 
+	/* Optional partition functions for single-instruction comparators */
+	size_t		(*partition_right) (SortTuple *begin, size_t n, SortTuple *pivot_pos);
+	size_t		(*partition_left) (SortTuple *begin, size_t n, SortTuple *pivot_pos);
+
 	/*
 	 * Alter datum1 representation in the SortTuple's array back from the
 	 * abbreviated key to the first column value.
-- 
2.48.1

Re: branch-free tuplesort partitioning

Reply via email to