I wrote:

> The v1 patch
> has some optimizations, but in other ways things are simple and/or
> wasteful. Exactly how things fit together will be informed by what, if
> anything, has to be done to avoid regressions.

In v1, radix sort diverts to qsort_tuple for small partitions (similar
to how quicksort diverts to insertion sort), but qsort_tuple is
inefficient because the comparator is called via a function pointer.

I also thought having two different radix sorts was too complex, so I
wondered if it'd be better to get rid of the smaller radix sort (whose
control flow I find harder to understand, even ignoring the unsightly
goto) and have the larger sort divert to a new quicksort
specialization that compares on the conditioned datum. That allows
skipping branches for NULL comparisons and order reversal. I've done
this in v2. It makes sense to replace the three current
integer-comparison quicksorts with one.

v1 was careful to restore isnull1 to false when diverting to quicksort
for the tiebreak. v2 doesn't bother, since the only tiebreak in core
that looks at isnull1 is comparetup_datum_tiebreak, which is not
reachable by radix sort, requiring a pass-by-value datum that compares
like an integer. This is a bit of a risk, since it's possible a third
party fork could be doing something weird. Seems unlikely, but
something to keep in mind.

I used a standalone program (attached) to microbenchmark this new
fallback qsort vs. a pass of radix sort on one byte to get a decent
threshold value. This is not quite fair, since the quicksort will then
be finished, but the radix sort could still need to recurse to the
next byte(s), so these number could underestimate the threshold. This
is just to get an idea.

The numbers are in RDTSC ticks per element sorted.

cardinality: 256
number of elements:  100   qsort: 35.4 radix: 49.2
number of elements:  200   qsort: 34.9 radix: 38.1
number of elements:  400   qsort: 42.4 radix: 34.4
number of elements:  800   qsort: 95.0 radix: 29.2
number of elements: 1600   qsort: 115.0 radix: 22.4
number of elements: 3200   qsort: 125.5 radix: 19.4
number of elements: 6400   qsort: 128.1 radix: 17.6

With the highest cardinality possible on a single byte, radix sort is
actually not bad at low inputs. Notice that the time per element is
consistently going down with larger inputs. Smaller inputs have large
constant overheads, made worse by my unrolling the counting step.

cardinality: 2
number of elements:  100   qsort: 09.2 radix: 28.0
number of elements:  200   qsort: 09.1 radix: 19.5
number of elements:  400   qsort: 10.4 radix: 15.7
number of elements:  800   qsort: 10.1 radix: 14.5
number of elements: 1600   qsort: 10.4 radix: 13.7
number of elements: 3200   qsort: 15.8 radix: 13.6
number of elements: 6400   qsort: 22.2 radix: 13.8

This is an extreme best case for B&M quicksort, which is basically
O(n) -- the point at which the per-element time goes up seems purely
due to exceeding L1 cache. Radix sort takes a big input to catch up,
but it doesn't seem awful, either.

cardinality: 16
number of elements:  100   qsort: 19.5 radix: 34.5
number of elements:  200   qsort: 18.7 radix: 22.6
number of elements:  400   qsort: 18.5 radix: 17.2
number of elements:  800   qsort: 25.0 radix: 14.8
number of elements: 1600   qsort: 43.8 radix: 13.8
number of elements: 3200   qsort: 51.2 radix: 13.2
number of elements: 6400   qsort: 59.0 radix: 12.8

This is still low cardinality, but behaves more like the high cardinality case.

I've set the threshold to 400 for now, but I'm not claiming that's the
end story. In addition to the underestimation mentioned above,
unrolling the counting step is a factor. Unrolling makes smaller
inputs worse (which we can reach by recursing from larger inputs), but
unrolling seems important for large inputs with low cardinality (a few
percent, but I haven't shared numbers yet). We've found that a large
input with only 4-5 distinct values just barely wins with radix sort.
I'll be curious to see if unrolling is actually needed to prevent
regressions there.

Other things to consider:

- I don't quite like how the NULL partitioning step looks, and it
could be costly when the distribution of NULL is not predictable, so
I'm thinking of turning part of that into a branch-free cyclic
permutation, similar to
https://www.postgresql.org/message-id/CANWCAZbAmaZ7P%2BARjS97sJLXsBB5CPZyzFgqNDiqe-L%2BBqXzug%40mail.gmail.com

- The quicksort on the NULL partition still compares isnull1 -- the
branches are predictable but perhaps it's worth it to add a
specialization that skips that.

--
John Naylor
Amazon Web Services
From 8d643a61386d10a7996f9cddb0965f18ec0fe832 Mon Sep 17 00:00:00 2001
From: John Naylor <[email protected]>
Date: Fri, 17 Oct 2025 09:57:43 +0700
Subject: [PATCH v2] Use radix sort when datum1 is an integer type

XXX regression tests don't pass for underspecified queries; this
is expected
---
 src/backend/utils/misc/guc_parameters.dat |   7 +
 src/backend/utils/sort/tuplesort.c        | 506 +++++++++++++++++++++-
 src/include/utils/guc.h                   |   1 +
 src/include/utils/tuplesort.h             |  12 +-
 4 files changed, 506 insertions(+), 20 deletions(-)

diff --git a/src/backend/utils/misc/guc_parameters.dat b/src/backend/utils/misc/guc_parameters.dat
index d6fc8333850..f8fc6c88082 100644
--- a/src/backend/utils/misc/guc_parameters.dat
+++ b/src/backend/utils/misc/guc_parameters.dat
@@ -681,6 +681,13 @@
   boot_val => 'false',
 },
 
+{ name => 'wip_radix_sort', type => 'bool', context => 'PGC_USERSET', group => 'DEVELOPER_OPTIONS',
+  short_desc => 'Test radix sort for debugging.',
+  flags => 'GUC_NOT_IN_SAMPLE',
+  variable => 'wip_radix_sort',
+  boot_val => 'true',
+},
+
 # this is undocumented because not exposed in a standard build
 { name => 'trace_syncscan', type => 'bool', context => 'PGC_USERSET', group => 'DEVELOPER_OPTIONS',
   short_desc => 'Generate debugging output for synchronized scanning.',
diff --git a/src/backend/utils/sort/tuplesort.c b/src/backend/utils/sort/tuplesort.c
index 5d4411dc33f..ac91089ac42 100644
--- a/src/backend/utils/sort/tuplesort.c
+++ b/src/backend/utils/sort/tuplesort.c
@@ -104,6 +104,7 @@
 #include "commands/tablespace.h"
 #include "miscadmin.h"
 #include "pg_trace.h"
+#include "port/pg_bitutils.h"
 #include "storage/shmem.h"
 #include "utils/guc.h"
 #include "utils/memutils.h"
@@ -122,6 +123,7 @@
 
 /* GUC variables */
 bool		trace_sort = false;
+bool		wip_radix_sort = true;
 
 #ifdef DEBUG_BOUNDED_SORT
 bool		optimize_bounded_sort = true;
@@ -490,6 +492,25 @@ static void tuplesort_updatemax(Tuplesortstate *state);
  * abbreviations of text or multi-key sorts.  There could be!  Is it worth it?
  */
 
+/* Used for conditioned datums, so we can ignore NULLs and sort direction. */
+static pg_attribute_always_inline int
+qsort_tuple_conditioned_compare(SortTuple *a, SortTuple *b, Tuplesortstate *state)
+{
+	if (a->cond_datum1 < b->cond_datum1)
+		return -1;
+	if (a->cond_datum1 > b->cond_datum1)
+		return 1;
+
+	/*
+	 * No need to waste effort calling the tiebreak function when there are no
+	 * other keys to sort on.
+	 */
+	if (state->base.onlyKey != NULL)
+		return 0;
+
+	return state->base.comparetup_tiebreak(a, b, state);
+}
+
 /* Used if first key's comparator is ssup_datum_unsigned_cmp */
 static pg_attribute_always_inline int
 qsort_tuple_unsigned_compare(SortTuple *a, SortTuple *b, Tuplesortstate *state)
@@ -567,6 +588,15 @@ qsort_tuple_int32_compare(SortTuple *a, SortTuple *b, Tuplesortstate *state)
  * common comparison functions on pass-by-value leading datums.
  */
 
+#define ST_SORT qsort_tuple_conditioned
+#define ST_ELEMENT_TYPE SortTuple
+#define ST_COMPARE(a, b, state) qsort_tuple_conditioned_compare(a, b, state)
+#define ST_COMPARE_ARG_TYPE Tuplesortstate
+#define ST_CHECK_FOR_INTERRUPTS
+#define ST_SCOPE static
+#define ST_DEFINE
+#include "lib/sort_template.h"
+
 #define ST_SORT qsort_tuple_unsigned
 #define ST_ELEMENT_TYPE SortTuple
 #define ST_COMPARE(a, b, state) qsort_tuple_unsigned_compare(a, b, state)
@@ -615,6 +645,235 @@ qsort_tuple_int32_compare(SortTuple *a, SortTuple *b, Tuplesortstate *state)
 #define ST_DEFINE
 #include "lib/sort_template.h"
 
+
+/*
+ * WIP: For now prefer test coverage of radix sort in Assert builds.
+ */
+#ifdef USE_ASSERT_CHECKING
+#define QSORT_THRESHOLD 0
+#else
+#define QSORT_THRESHOLD 400
+#endif
+
+typedef struct PartitionInfo
+{
+	union
+	{
+		size_t		count;
+		size_t		offset;
+	};
+	size_t		next_offset;
+}			PartitionInfo;
+
+static inline uint8_t
+extract_key(Datum key, int level)
+{
+	return (key >> (((SIZEOF_DATUM - 1) - level) * 8)) & 0xFF;
+}
+
+static inline void
+swap(SortTuple *a, SortTuple *b)
+{
+	SortTuple	tmp = *a;
+
+	*a = *b;
+	*b = tmp;
+}
+
+/*
+ * Condition datum to work with pure unsigned comparison,
+ * taking ASC/DESC into account as well.
+ */
+static inline Datum
+condition_datum(Datum orig, SortSupport ssup)
+{
+	Datum		cond_datum1;
+
+	if (ssup->comparator == ssup_datum_signed_cmp)
+	{
+		/* it was already cast to unsigned when stored */
+		cond_datum1 = orig ^ (UINT64CONST(1) << 63);
+	}
+	else if (ssup->comparator == ssup_datum_int32_cmp)
+	{
+		/*
+		 * First normalize to uint32. Technically, we don't need to do this,
+		 * but it forces the upper bytes to remain the same regardless of
+		 * sign.
+		 */
+		uint32		u32 = DatumGetUInt32(orig) ^ ((uint32) 1 << 31);
+
+		cond_datum1 = UInt32GetDatum(u32);
+	}
+	else
+	{
+		Assert(ssup->comparator == ssup_datum_unsigned_cmp);
+		cond_datum1 = orig;
+	}
+
+	if (ssup->ssup_reverse)
+		cond_datum1 = ~cond_datum1;
+
+	return cond_datum1;
+}
+
+/*
+ * Based on implementation in https://github.com/skarupke/ska_sort (Boost license),
+ * with the following changes:
+ *  - unroll loop in counting step
+ *  - count sorted partitions in every pass, rather than maintaining list of unsorted partitions
+ * TODO: match qsort API with number of elements rather than end pointer
+ */
+static void
+ska_byte_sort(SortTuple *begin,
+			  SortTuple *end, int level, Tuplesortstate *state)
+{
+	/* size_t		counts0[256] = {0}; */
+	size_t		counts1[256] = {0};
+	size_t		counts2[256] = {0};
+	size_t		counts3[256] = {0};
+	PartitionInfo partitions[256] = {0};
+	uint8_t		remaining_partitions[256] = {0};
+	size_t		total = 0;
+	int			num_partitions = 0;
+	int			num_remaining;
+	SortTuple  *ctup;
+
+	/* count key chunks, unrolled for speed */
+
+	for (ctup = begin; ctup + 4 < end; ctup += 4)
+	{
+		uint8		key_chunk0 = extract_key((ctup + 0)->cond_datum1, level);
+		uint8		key_chunk1 = extract_key((ctup + 1)->cond_datum1, level);
+		uint8		key_chunk2 = extract_key((ctup + 2)->cond_datum1, level);
+		uint8		key_chunk3 = extract_key((ctup + 3)->cond_datum1, level);
+
+		partitions[key_chunk0].count++;
+		counts1[key_chunk1]++;
+		counts2[key_chunk2]++;
+		counts3[key_chunk3]++;
+
+	}
+
+	for (size_t i = 0; i < 256; i++)
+		partitions[i].count += counts1[i] + counts2[i] + counts3[i];
+
+	for (; ctup < end; ctup++)
+	{
+		uint8		key_chunk;
+
+		key_chunk = extract_key(ctup->cond_datum1, level);
+		partitions[key_chunk].count++;
+	}
+
+	/* compute partition offsets */
+	for (int i = 0; i < 256; ++i)
+	{
+		size_t		count = partitions[i].count;
+
+		if (count)
+		{
+			partitions[i].offset = total;
+			total += count;
+			remaining_partitions[num_partitions] = i;
+			++num_partitions;
+		}
+		partitions[i].next_offset = total;
+	}
+
+	num_remaining = num_partitions;
+
+	/*
+	 * Permute tuples to correct partition. If we started with one partition,
+	 * there is nothing to do. If a permutation from a previous iteration
+	 * results in a single partition that hasn't been marked as sorted, we
+	 * know it's actually sorted.
+	 */
+	while (num_remaining > 1)
+	{
+		/*
+		 * We can only exit the loop when all partitions are sorted, so must
+		 * reset every iteration
+		 */
+		num_remaining = num_partitions;
+
+		for (int i = 0; i < num_partitions; i++)
+		{
+			uint8		idx = remaining_partitions[i];
+
+			PartitionInfo part = partitions[idx];
+
+			for (SortTuple *st = begin + part.offset;
+				 st < begin + part.next_offset;
+				 st++)
+			{
+				uint8		this_partition = extract_key(st->cond_datum1, level);
+				size_t		offset = partitions[this_partition].offset++;
+
+				Assert(begin + offset < end);
+				swap(st, begin + offset);
+			};
+
+			if (part.offset == part.next_offset)
+			{
+				/* partition is sorted; skip */
+				num_remaining--;
+			}
+		}
+	}
+
+	{
+		size_t		start_offset = 0;
+		SortTuple  *partition_begin = begin;
+
+		for (uint8_t *it = remaining_partitions, *end = remaining_partitions + num_partitions;
+			 it != end;
+			 ++it)
+		{
+			size_t		end_offset = partitions[*it].next_offset;
+			SortTuple  *partition_end = begin + end_offset;
+			ptrdiff_t	num_elements = end_offset - start_offset;
+
+			if (num_elements > 1)
+			{
+				if (level < SIZEOF_DATUM - 1)
+				{
+					if (num_elements < QSORT_THRESHOLD)
+					{
+						qsort_tuple_conditioned(partition_begin,
+												num_elements,
+												state);
+					}
+					else
+					{
+						ska_byte_sort(partition_begin,
+									  partition_end,
+									  level + 1,
+									  state);
+					}
+				}
+				else if (state->base.onlyKey == NULL)
+				{
+					/*
+					 * Finished radix sort on all bytes of cond_datum1
+					 * (possibily abbreviated), now qsort with tiebreak
+					 * comparator.
+					 * XXX comparetup_tiebreak cannot inspect isnull1.
+					 * In core, the only tiebreak that does so is comparetup_datum_tiebreak,
+					 * but we wouldn't have gotten here if that's the case.
+					 */
+					qsort_tuple(partition_begin,
+								num_elements,
+								state->base.comparetup_tiebreak,
+								state);
+				}
+			}
+			start_offset = end_offset;
+			partition_begin = partition_end;
+		}
+	}
+}
+
 /*
  *		tuplesort_begin_xxx
  *
@@ -2663,8 +2922,203 @@ sort_bounded_heap(Tuplesortstate *state)
 	state->boundUsed = true;
 }
 
+/* WIP: allow turning common prefix skipping off for testing */
+#define COMMON_PREFIX
+
+/*
+ * Compute conditioned datums for SortTuples so that a single
+ * unsigned comparison resolves the sort order up to comparetup_*_tiebreak.
+ * Then dispatch to either radix sort or a specialized qsort.
+ */
+static void
+sort_tuple_conditioned(Tuplesortstate *state)
+{
+	SortSupportData ssup = state->base.sortKeys[0];
+
+	bool		nulls_first = ssup.ssup_nulls_first;
+	SortTuple  *first = state->memtuples;
+	SortTuple  *last = state->memtuples + state->memtupcount;
+	SortTuple  *not_null_start;
+	size_t		d1,
+				d2,
+				not_null_count;
+#ifdef COMMON_PREFIX
+	Datum		first_datum = 0;
+	Datum		common_upper_bits = 0;
+#endif
+	int			common_prefix;
+
+	/*
+	 * Partition by isnull1, since we can only radix sort on non-NULL
+	 * elements.
+	 */
+
+	/*
+	 * Find the leftmost NOT NULL tuple if NULLS FIRST, or leftmost NULL
+	 * element if NULLS LAST.
+	 */
+	while (first < last && first->isnull1 == nulls_first)
+		first++;
+
+	/*
+	 * XXX We must start "last" after the final tuple to maintain the
+	 * invariant that it ends up one after the first partition, and the first
+	 * partition may correspond to the entire array. If "first" isn't gotten
+	 * this far, we need to pre-decrement "last" before beginning its loop.
+	 */
+	if (first < last)
+		last--;
+
+	/*
+	 * Find the rightmost NULL tuple if NULLS FIRST, or rightmost NOT NULL
+	 * tuple if NULLS LAST.
+	 */
+	while (first < last && last->isnull1 != nulls_first)
+		last--;
+
+	/* swap pairs of tuples that are in the wrong order */
+	while (first < last)
+	{
+		swap(first, last);
+		while (first < last && first->isnull1 == nulls_first)
+			first++;
+		while (first < last && last->isnull1 != nulls_first)
+			last--;
+	}
+
+	d1 = last - state->memtuples;
+	d2 = state->memtupcount - d1;
+
+	Assert(last == first);
+	Assert(last + d2 == state->memtuples + state->memtupcount);
+	for (SortTuple *pm = state->memtuples;
+		 pm < state->memtuples + d1;
+		 pm++)
+		Assert(pm->isnull1 == nulls_first);
+	for (SortTuple *pm = last;
+		 pm < last + d2;
+		 pm++)
+		Assert(pm->isnull1 != nulls_first);
+
+	/*
+	 * Sort null partition using tiebreak comparator. XXX this will repeat the
+	 * NULL check for abbreviated keys.
+	 */
+	if (nulls_first)
+	{
+		qsort_tuple(state->memtuples,
+					d1,
+					state->base.comparetup_tiebreak,
+					state);
+		not_null_start = last;
+		not_null_count = d2;
+	}
+	else
+	{
+		qsort_tuple(last,
+					d2,
+					state->base.comparetup_tiebreak,
+					state);
+		not_null_start = state->memtuples;
+		not_null_count = d1;
+	}
+
+	/*
+	 * Condition datum so that unsigned comparision is order-preserving, and
+	 * compute the common prefix to skip unproductive recursion steps during
+	 * radix sort.
+	 */
+	for (SortTuple *tup = not_null_start;
+		 tup < not_null_start + not_null_count;
+		 tup++)
+	{
+		Datum		cond_datum1 = condition_datum(tup->datum1, &ssup);
+#ifdef COMMON_PREFIX
+		if (tup == not_null_start)
+		{
+			/* Need to start with some value, may as well be the first one. */
+			first_datum = cond_datum1;
+		}
+		else
+		{
+			Datum		this_common_bits;
+
+			/* The bits in common will be zero */
+			this_common_bits = first_datum ^ cond_datum1;
+
+			/*
+			 * We're really only interested in the case where the rightmost
+			 * one bit is further right, but this branch should be rare enough
+			 * not to waste cycles trying harder.
+			 */
+			if (this_common_bits > common_upper_bits)
+				common_upper_bits = this_common_bits;
+		}
+#endif
+		tup->cond_datum1 = cond_datum1;
+	}
+
+	if (not_null_count < QSORT_THRESHOLD)
+		qsort_tuple_conditioned(not_null_start,
+								not_null_count,
+								state);
+	else
+	{
+
+		/*
+		 * The upper bits are zero where all values are the same, if any. Turn
+		 * the byte position of the rightmost one bit into the byte where
+		 * radix sort should start bucketing. OR-ing in the lowest bit guards
+		 * against undefined behavior without changing the result.
+		 */
+#ifdef COMMON_PREFIX
+		common_prefix = sizeof(Datum) - 1 -
+			(pg_leftmost_one_pos64(common_upper_bits | 1) / BITS_PER_BYTE);
+#else
+		common_prefix = 0;
+#endif
+		/* perform the radix sort on the not-NULL partition */
+		ska_byte_sort(not_null_start,
+					  not_null_start + not_null_count,
+					  common_prefix,
+					  state);
+	}
+
+	/*
+	 * Restore fields that were overwritten with temporary conditioned datum1
+	 */
+	for (SortTuple *tup = not_null_start;
+		 tup < not_null_start + not_null_count;
+		 tup++)
+	{
+		/* need to restore NOT NULL */
+		tup->isnull1 = false;
+		/* be tidy */
+		tup->srctape = 0;
+	}
+}
+
+/* Verify sort using standard comparator. */
+static void
+check_sorted(Tuplesortstate *state)
+{
+#ifdef USE_ASSERT_CHECKING
+	for (SortTuple *pm = state->memtuples + 1;
+		 pm < state->memtuples + state->memtupcount;
+		 pm++)
+	{
+#if 0
+		Assert(COMPARETUP(state, pm - 1, pm) <= 0);
+#else
+		if (COMPARETUP(state, pm - 1, pm) > 0)
+			elog(ERROR, "SORT FAILED");
+#endif
+	}
+#endif
+}
+
 /*
- * Sort all memtuples using specialized qsort() routines.
+ * Sort all memtuples using specialized routines.
  *
  * Quicksort is used for small in-memory sorts, and external sort runs.
  */
@@ -2681,26 +3135,42 @@ tuplesort_sort_memtuples(Tuplesortstate *state)
 		 */
 		if (state->base.haveDatum1 && state->base.sortKeys)
 		{
-			if (state->base.sortKeys[0].comparator == ssup_datum_unsigned_cmp)
-			{
-				qsort_tuple_unsigned(state->memtuples,
-									 state->memtupcount,
-									 state);
-				return;
-			}
-			else if (state->base.sortKeys[0].comparator == ssup_datum_signed_cmp)
+			SortSupportData ssup = state->base.sortKeys[0];
+
+			if (wip_radix_sort)
 			{
-				qsort_tuple_signed(state->memtuples,
-								   state->memtupcount,
-								   state);
-				return;
+				if ((ssup.comparator == ssup_datum_unsigned_cmp ||
+					 ssup.comparator == ssup_datum_signed_cmp ||
+					 ssup.comparator == ssup_datum_int32_cmp))
+				{
+					sort_tuple_conditioned(state);
+					check_sorted(state);
+					return;
+				}
 			}
-			else if (state->base.sortKeys[0].comparator == ssup_datum_int32_cmp)
+			else
 			{
-				qsort_tuple_int32(state->memtuples,
-								  state->memtupcount,
-								  state);
-				return;
+				if (state->base.sortKeys[0].comparator == ssup_datum_unsigned_cmp)
+				{
+					qsort_tuple_unsigned(state->memtuples,
+										 state->memtupcount,
+										 state);
+					return;
+				}
+				else if (state->base.sortKeys[0].comparator == ssup_datum_signed_cmp)
+				{
+					qsort_tuple_signed(state->memtuples,
+									   state->memtupcount,
+									   state);
+					return;
+				}
+				else if (state->base.sortKeys[0].comparator == ssup_datum_int32_cmp)
+				{
+					qsort_tuple_int32(state->memtuples,
+									  state->memtupcount,
+									  state);
+					return;
+				}
 			}
 		}
 
diff --git a/src/include/utils/guc.h b/src/include/utils/guc.h
index f21ec37da89..bc6f7fa60f3 100644
--- a/src/include/utils/guc.h
+++ b/src/include/utils/guc.h
@@ -324,6 +324,7 @@ extern PGDLLIMPORT int tcp_user_timeout;
 extern PGDLLIMPORT char *role_string;
 extern PGDLLIMPORT bool in_hot_standby_guc;
 extern PGDLLIMPORT bool trace_sort;
+extern PGDLLIMPORT bool wip_radix_sort;
 
 #ifdef DEBUG_BOUNDED_SORT
 extern PGDLLIMPORT bool optimize_bounded_sort;
diff --git a/src/include/utils/tuplesort.h b/src/include/utils/tuplesort.h
index ef79f259f93..b2ecbbc9e51 100644
--- a/src/include/utils/tuplesort.h
+++ b/src/include/utils/tuplesort.h
@@ -149,8 +149,16 @@ typedef struct
 {
 	void	   *tuple;			/* the tuple itself */
 	Datum		datum1;			/* value of first key column */
-	bool		isnull1;		/* is first key column NULL? */
-	int			srctape;		/* source tape number */
+
+	union
+	{
+		struct
+		{
+			bool		isnull1;		/* is first key column NULL? */
+			int			srctape;		/* source tape number */
+		};
+		Datum		cond_datum1;		/* sort key for radix sort */
+	};
 } SortTuple;
 
 typedef int (*SortTupleComparator) (const SortTuple *a, const SortTuple *b,
-- 
2.51.0

#include <stddef.h>
#include <stdint.h>
#include <stdio.h>
#include <stdbool.h>
#include <stdlib.h>
#include <string.h>

#include <x86intrin.h>

/*#define DEBUG1*/

typedef uint8_t uint8;
typedef uint64_t uint64;
typedef uint64_t Datum;
typedef int64_t int64;
#define SIZEOF_DATUM 8
#define lengthof(array) (sizeof (array) / sizeof ((array)[0]))
#define Assert(x)
#define CppConcat(x, y)                 x##y
#define CHECK_FOR_INTERRUPTS()
#define Min(x, y)		((x) < (y) ? (x) : (y))

#if defined(__GNUC__)
#define pg_noinline __attribute__((noinline))
/* msvc via declspec */
#elif defined(_MSC_VER)
#define pg_noinline __declspec(noinline)
#else
#define pg_noinline
#endif

#if defined(__GNUC__) && defined(__OPTIMIZE__)
/* GCC supports always_inline via __attribute__ */
#define pg_attribute_always_inline __attribute__((always_inline)) inline
#elif defined(_MSC_VER)
/* MSVC has a special keyword for this */
#define pg_attribute_always_inline __forceinline
#else
/* Otherwise, the best we can do is to say "inline" */
#define pg_attribute_always_inline inline
#endif


typedef struct
{
	void	   *tuple;			/* the tuple itself */
	Datum		datum1;			/* value of first key column */
	union
	{
		struct
		{
			bool		isnull1;		/* is first key column NULL? */
			int			srctape;		/* source tape number */
		};
		Datum		cond_datum1;
	};
} SortTuple;


/* Used for conditioned datums, so we can ignore NULLs and sort direction. */
static pg_attribute_always_inline int
qsort_tuple_conditioned_compare(SortTuple *a, SortTuple *b)
{
	if (a->cond_datum1 < b->cond_datum1)
		return -1;
	if (a->cond_datum1 > b->cond_datum1)
		return 1;

	/*
	 * No need to waste effort calling the tiebreak function when there are no
	 * other keys to sort on.
	 */
	//if (state->base.onlyKey != NULL)
		return 0;

	//return state->base.comparetup_tiebreak(a, b, state);
}

#define ST_SORT qsort_tuple_conditioned
#define ST_ELEMENT_TYPE SortTuple
#define ST_COMPARE(a, b) qsort_tuple_conditioned_compare(a, b)
#define ST_CHECK_FOR_INTERRUPTS
#define ST_SCOPE static
#define ST_DEFINE
#include "lib/sort_template.h"


typedef struct PartitionInfo
{
    union
    {
        size_t count;
        size_t offset;
    };
    size_t next_offset;
} PartitionInfo;

static inline uint8_t
extract_key(Datum key, int level)
{
	return (key >> (((SIZEOF_DATUM - 1) - level) * 8)) & 0xFF;
}

static inline void
swap(SortTuple * a, SortTuple * b)
{
	SortTuple tmp = *a;

	*a = *b;
	*b = tmp;
}


static void
pg_noinline
ska_byte_sort(SortTuple *begin,
			  SortTuple *end, int level)
{
	/* size_t		counts0[256] = {0}; */
	size_t		counts1[256] = {0};
	size_t		counts2[256] = {0};
	size_t		counts3[256] = {0};
	PartitionInfo partitions[256] = {0};
	uint8_t		remaining_partitions[256] = {0};
	size_t		total = 0;
	int			num_partitions = 0;
	int			num_remaining;
	SortTuple  *ctup;

	/* count key chunks, unrolled for speed */

	for (ctup = begin; ctup + 4 < end; ctup += 4)
	{
		uint8		key_chunk0 = extract_key((ctup + 0)->cond_datum1, level);
		uint8		key_chunk1 = extract_key((ctup + 1)->cond_datum1, level);
		uint8		key_chunk2 = extract_key((ctup + 2)->cond_datum1, level);
		uint8		key_chunk3 = extract_key((ctup + 3)->cond_datum1, level);

		partitions[key_chunk0].count++;
		counts1[key_chunk1]++;
		counts2[key_chunk2]++;
		counts3[key_chunk3]++;

	}

	for (size_t i = 0; i < 256; i++)
		partitions[i].count += counts1[i] + counts2[i] + counts3[i];

	for (; ctup < end; ctup++)
	{
		uint8		key_chunk;

		key_chunk = extract_key(ctup->cond_datum1, level);
		partitions[key_chunk].count++;
	}

	/* compute partition offsets */
	for (int i = 0; i < 256; ++i)
	{
		size_t		count = partitions[i].count;

		if (count)
		{
			partitions[i].offset = total;
			total += count;
			remaining_partitions[num_partitions] = i;
			++num_partitions;
		}
		partitions[i].next_offset = total;
	}

	num_remaining = num_partitions;

	/*
	 * Permute tuples to correct partition. If we started with one partition,
	 * there is nothing to do. If a permutation from a previous iteration
	 * results in a single partition that hasn't been marked as sorted, we
	 * know it's actually sorted.
	 */
	while (num_remaining > 1)
	{
		/*
		 * We can only exit the loop when all partitions are sorted, so must
		 * reset every iteration
		 */
		num_remaining = num_partitions;

		for (int i = 0; i < num_partitions; i++)
		{
			uint8		idx = remaining_partitions[i];

			PartitionInfo part = partitions[idx];

			for (SortTuple *st = begin + part.offset;
				 st < begin + part.next_offset;
				 st++)
			{
				uint8		this_partition = extract_key(st->cond_datum1, level);
				size_t		offset = partitions[this_partition].offset++;

				Assert(begin + offset < end);
				swap(st, begin + offset);
			};

			if (part.offset == part.next_offset)
			{
				/* partition is sorted; skip */
				num_remaining--;
			}
		}
	}
	/* no recursion */
}



int
main ()
{
	uint64_t start;
	uint64_t finish;
	double qticks;
	double rticks;
#define COUNT 8000
	int lengths[] = { 100, 200, 400, 800, 1600, 3200, 6400 };
	SortTuple *st = malloc(COUNT * sizeof(SortTuple));
	SortTuple *test_radix = malloc(COUNT * sizeof(SortTuple));
	SortTuple *test_qsort = malloc(COUNT * sizeof(SortTuple));

// 256 or less so that all entropy is in a single byte
#define CARDINALITY 256
	printf("cardinality: %d\n", CARDINALITY);
	for (int i=0; i<COUNT; i++)
	{
		// only lowest byte is populated
		int64 val = random() % CARDINALITY;
		SortTuple x = { .cond_datum1 = val };
		st[i] = x;
	}

	for (int j=0; j< lengthof(lengths); j++)
	{
		int len = lengths[j];
		qticks = rticks = 0;

		printf("number of elements: %4d   ", len);

#define NUM_MEASUREMENTS 1000000
		for (int k=0; k<NUM_MEASUREMENTS; k++)
		{
			// repopulate test
			memcpy(test_qsort, st, len * sizeof(SortTuple));

			start = __rdtsc();
			// only sort lowest byte
			qsort_tuple_conditioned(test_qsort, len);
			finish = __rdtsc();
			qticks += finish - start;

			memcpy(test_radix, st, len * sizeof(SortTuple));

#ifdef DEBUG1
			printf("before:\n");
			for (int i=0; i<len; i++)
				printf("%ld\n", test_radix[i].cond_datum1);
#endif

			start = __rdtsc();
			// only sort lowest byte
			ska_byte_sort(test_radix, test_radix + len, 7);
			finish = __rdtsc();
			rticks += finish - start;
		}

		printf("qsort: %04.1f radix: %04.1f\n", qticks / NUM_MEASUREMENTS / len,
											rticks / NUM_MEASUREMENTS / len);

#ifdef DEBUG1
			printf("after:\n");
			for (int i=0; i<len; i++)
				printf("%ld\n", test_radix[i].cond_datum1);
#endif
	}

}

Reply via email to