[HACKERS] Reusing abbreviated keys during second pass of ordered [set] aggregates

Peter Geoghegan Fri, 10 Jul 2015 15:06:08 -0700

Currently, there are certain aggregates that sort some tuples before
making a second pass over their memtuples array (through the tuplesort
"gettuple" interface), calling one or more attribute equality operator
functions as they go. For example, this occurs during the execution of
the following queries:


-- Salary is a numeric column:
SELECT count(DISTINCT salary) AS distinct_compensation_levels FROM employees;
-- Get most common distinct salary in organization:
SELECT mode() WITHIN GROUP (ORDER BY salary) AS
most_common_compensation FROM employees;

Since these examples involve numeric comparisons, the equality
operator calls involved in that second pass are expensive. There is no
equality fast-path for numeric equality as there is for text; I
imagine that in practice most calls to texteq() are resolved based on
raw datum size mismatch, which is dirt cheap (while the alternative, a
memcmp(), is still very cheap). Furthermore, numeric abbreviated keys
have a good chance of capturing the entropy of the original values
more effectively than we might expect with a text attribute. That
means that in the case of numeric, even sorted, adjacent pairs of
abbreviated keys have a pretty good chance of being distinct in the
real world (if their corresponding authoritative values are themselves
distinct).

I think we can avoid this expense much of the time.

Patch, Performance
===============

Attached patch makes several cases like this try and get away with
only using the pre-existing abbreviated keys from the sort (to
determine inequality). On my laptop, it removes 15% - 25% of the run
time of cases similar to the above, with a high cardinality numeric
attribute of uniformly distributed values. As usual with my work on
sorting, multiple concurrent sorts by multiple backends are helped
most; that puts the improvements for realistic cases involving a text
attribute at about 5% (the texteq() memcmp() is cheap, but not free)
with 4 clients using pgbench.

The numeric cases above are now at about the same speed as similar
queries that use a double precision floating point number attribute.
9.5-era testing shows that sort-heavy numeric cases are often about as
fast as "equivalent" double cases, so it seems worthwhile to mostly
eliminate this additional numeric overhead that cases like the above
still incur.

I imagine citext would benefit much more here once it has abbreviation
support (which should be a TODO item).

Omissions
--------------

I haven't attempted to accelerate AGG_SORTED strategy aggregates,
which looked like it would require more invasive changes. It seemed
like more trouble than it was worth, given that crossing group
boundaries is something that won't occur very often with typical
usage, and given the fact that text is naturally pretty fast there
anyway (who groups by a numeric attribute?). Similarly, I did not
teach setop_retrieve_direct() to consider the possibility that a set
operation (like a UNION) could reuse abbreviated keys if it happens to
be fed by a sort node. Finally, I did not accelerate the merging of
spools during B-Tree index builds, even though that actually seems
quite possible -- that case just seemed marginal. We currently have no
test coverage for that case [1], so I guess its performance can't be
too important.

SortSupport contract
================

As explained in the first patch's commit message, I revised the
contract that SortSupport has with opclasses. Should this proposal be
accepted, we should backpatch the first patch to 9.5, to prevent
anyone from building abbreviation support that won't work on 9.6 (even
if that is very unlikely). In general, it seems pretty silly to me to
not make use of the abbreviated keys that the sort already went to the
trouble of building, and it seems like the revision to the SortSupport
contract is not likely to notably limit any interesting cases in the
future, so I think that this will be uncontroversial.

[1] 
http://www.postgresql.org/message-id/flat/cam3swztvetvdysxee7py-8ar+-+xrapkjcze-fh_v+zwxmu...@mail.gmail.com#cam3swztvetvdysxee7py-8ar+-+xrapkjcze-fh_v+zwxmu...@mail.gmail.com
-- 
Peter Geoghegan

From 7ae66ebcda9de24b4b148ad90630d89a401feb68 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <peter.geoghega...@gmail.com>
Date: Mon, 6 Jul 2015 13:37:26 -0700
Subject: [PATCH 2/2] Reuse abbreviated keys in ordered [set] aggregates

When processing ordered aggregates following a sort that could make use
of the abbreviated key optimization, only call the equality operator to
compare successive pairs of tuples when their abbreviated keys were not
equal.  Only strict abbreviated key binary inequality is considered,
which is safe.
---
 src/backend/catalog/index.c            |  2 +-
 src/backend/executor/nodeAgg.c         | 20 ++++++++++++----
 src/backend/executor/nodeSort.c        |  2 +-
 src/backend/utils/adt/orderedsetaggs.c | 33 ++++++++++++++++++--------
 src/backend/utils/sort/tuplesort.c     | 42 ++++++++++++++++++++++++++++------
 src/include/utils/tuplesort.h          |  4 ++--
 6 files changed, 77 insertions(+), 26 deletions(-)

diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index 4246554..52d6f19 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -2982,7 +2982,7 @@ validate_index_heapscan(Relation heapRelation,
 			}
 
 			tuplesort_empty = !tuplesort_getdatum(state->tuplesort, true,
-												  &ts_val, &ts_isnull);
+												  &ts_val, &ts_isnull, NULL);
 			Assert(tuplesort_empty || !ts_isnull);
 			indexcursor = (ItemPointer) DatumGetPointer(ts_val);
 		}
diff --git a/src/backend/executor/nodeAgg.c b/src/backend/executor/nodeAgg.c
index 2bf48c5..9b932c4 100644
--- a/src/backend/executor/nodeAgg.c
+++ b/src/backend/executor/nodeAgg.c
@@ -476,7 +476,8 @@ fetch_input_tuple(AggState *aggstate)
 
 	if (aggstate->sort_in)
 	{
-		if (!tuplesort_gettupleslot(aggstate->sort_in, true, aggstate->sort_slot))
+		if (!tuplesort_gettupleslot(aggstate->sort_in, true,
+									aggstate->sort_slot, NULL))
 			return NULL;
 		slot = aggstate->sort_slot;
 	}
@@ -831,8 +832,8 @@ advance_aggregates(AggState *aggstate, AggStatePerGroup pergroup)
  * The one-input case is handled separately from the multi-input case
  * for performance reasons: for single by-value inputs, such as the
  * common case of count(distinct id), the tuplesort_getdatum code path
- * is around 300% faster.  (The speedup for by-reference types is less
- * but still noticeable.)
+ * is around 300% faster.  (The speedup for by-reference types without
+ * abbreviated key support is less but still noticeable.)
  *
  * This function handles only one grouping set (already set in
  * aggstate->current_set).
@@ -850,6 +851,8 @@ process_ordered_aggregate_single(AggState *aggstate,
 	MemoryContext workcontext = aggstate->tmpcontext->ecxt_per_tuple_memory;
 	MemoryContext oldContext;
 	bool		isDistinct = (peraggstate->numDistinctCols > 0);
+	Datum		newAbbrevVal = (Datum) 0;
+	Datum		oldAbbrevVal = (Datum) 0;
 	FunctionCallInfo fcinfo = &peraggstate->transfn_fcinfo;
 	Datum	   *newVal;
 	bool	   *isNull;
@@ -869,7 +872,7 @@ process_ordered_aggregate_single(AggState *aggstate,
 	 */
 
 	while (tuplesort_getdatum(peraggstate->sortstates[aggstate->current_set],
-							  true, newVal, isNull))
+							  true, newVal, isNull, &newAbbrevVal))
 	{
 		/*
 		 * Clear and select the working context for evaluation of the equality
@@ -887,6 +890,7 @@ process_ordered_aggregate_single(AggState *aggstate,
 			haveOldVal &&
 			((oldIsNull && *isNull) ||
 			 (!oldIsNull && !*isNull &&
+			  oldAbbrevVal == newAbbrevVal &&
 			  DatumGetBool(FunctionCall2(&peraggstate->equalfns[0],
 										 oldVal, *newVal)))))
 		{
@@ -902,6 +906,7 @@ process_ordered_aggregate_single(AggState *aggstate,
 				pfree(DatumGetPointer(oldVal));
 			/* and remember the new one for subsequent equality checks */
 			oldVal = *newVal;
+			oldAbbrevVal = newAbbrevVal;
 			oldIsNull = *isNull;
 			haveOldVal = true;
 		}
@@ -939,6 +944,8 @@ process_ordered_aggregate_multi(AggState *aggstate,
 	TupleTableSlot *slot2 = peraggstate->uniqslot;
 	int			numTransInputs = peraggstate->numTransInputs;
 	int			numDistinctCols = peraggstate->numDistinctCols;
+	Datum		newAbbrevVal = (Datum) 0;
+	Datum		oldAbbrevVal = (Datum) 0;
 	bool		haveOldValue = false;
 	int			i;
 
@@ -949,7 +956,7 @@ process_ordered_aggregate_multi(AggState *aggstate,
 		ExecClearTuple(slot2);
 
 	while (tuplesort_gettupleslot(peraggstate->sortstates[aggstate->current_set],
-								  true, slot1))
+								  true, slot1, &newAbbrevVal))
 	{
 		/*
 		 * Extract the first numTransInputs columns as datums to pass to the
@@ -960,6 +967,7 @@ process_ordered_aggregate_multi(AggState *aggstate,
 
 		if (numDistinctCols == 0 ||
 			!haveOldValue ||
+			newAbbrevVal != oldAbbrevVal ||
 			!execTuplesMatch(slot1, slot2,
 							 numDistinctCols,
 							 peraggstate->sortColIdx,
@@ -983,6 +991,8 @@ process_ordered_aggregate_multi(AggState *aggstate,
 
 				slot2 = slot1;
 				slot1 = tmpslot;
+				/* avoid execTuplesMatch() calls by reusing abbreviated keys */
+				oldAbbrevVal = newAbbrevVal;
 				haveOldValue = true;
 			}
 		}
diff --git a/src/backend/executor/nodeSort.c b/src/backend/executor/nodeSort.c
index af1dccf..35dd993 100644
--- a/src/backend/executor/nodeSort.c
+++ b/src/backend/executor/nodeSort.c
@@ -137,7 +137,7 @@ ExecSort(SortState *node)
 	slot = node->ss.ps.ps_ResultTupleSlot;
 	(void) tuplesort_gettupleslot(tuplesortstate,
 								  ScanDirectionIsForward(dir),
-								  slot);
+								  slot, NULL);
 	return slot;
 }
 
diff --git a/src/backend/utils/adt/orderedsetaggs.c b/src/backend/utils/adt/orderedsetaggs.c
index 39ed85b..1a5f696 100644
--- a/src/backend/utils/adt/orderedsetaggs.c
+++ b/src/backend/utils/adt/orderedsetaggs.c
@@ -453,7 +453,7 @@ percentile_disc_final(PG_FUNCTION_ARGS)
 			elog(ERROR, "missing row in percentile_disc");
 	}
 
-	if (!tuplesort_getdatum(osastate->sortstate, true, &val, &isnull))
+	if (!tuplesort_getdatum(osastate->sortstate, true, &val, &isnull, NULL))
 		elog(ERROR, "missing row in percentile_disc");
 
 	/*
@@ -553,7 +553,7 @@ percentile_cont_final_common(FunctionCallInfo fcinfo,
 	if (!tuplesort_skiptuples(osastate->sortstate, first_row, true))
 		elog(ERROR, "missing row in percentile_cont");
 
-	if (!tuplesort_getdatum(osastate->sortstate, true, &first_val, &isnull))
+	if (!tuplesort_getdatum(osastate->sortstate, true, &first_val, &isnull, NULL))
 		elog(ERROR, "missing row in percentile_cont");
 	if (isnull)
 		PG_RETURN_NULL();
@@ -564,7 +564,7 @@ percentile_cont_final_common(FunctionCallInfo fcinfo,
 	}
 	else
 	{
-		if (!tuplesort_getdatum(osastate->sortstate, true, &second_val, &isnull))
+		if (!tuplesort_getdatum(osastate->sortstate, true, &second_val, &isnull, NULL))
 			elog(ERROR, "missing row in percentile_cont");
 
 		if (isnull)
@@ -792,7 +792,7 @@ percentile_disc_multi_final(PG_FUNCTION_ARGS)
 				if (!tuplesort_skiptuples(osastate->sortstate, target_row - rownum - 1, true))
 					elog(ERROR, "missing row in percentile_disc");
 
-				if (!tuplesort_getdatum(osastate->sortstate, true, &val, &isnull))
+				if (!tuplesort_getdatum(osastate->sortstate, true, &val, &isnull, NULL))
 					elog(ERROR, "missing row in percentile_disc");
 
 				rownum = target_row;
@@ -921,7 +921,8 @@ percentile_cont_multi_final_common(FunctionCallInfo fcinfo,
 				if (!tuplesort_skiptuples(osastate->sortstate, first_row - rownum - 1, true))
 					elog(ERROR, "missing row in percentile_cont");
 
-				if (!tuplesort_getdatum(osastate->sortstate, true, &first_val, &isnull) || isnull)
+				if (!tuplesort_getdatum(osastate->sortstate, true, &first_val,
+										&isnull, NULL) || isnull)
 					elog(ERROR, "missing row in percentile_cont");
 
 				rownum = first_row;
@@ -941,7 +942,8 @@ percentile_cont_multi_final_common(FunctionCallInfo fcinfo,
 			/* Fetch second_row if needed */
 			if (second_row > rownum)
 			{
-				if (!tuplesort_getdatum(osastate->sortstate, true, &second_val, &isnull) || isnull)
+				if (!tuplesort_getdatum(osastate->sortstate, true, &second_val,
+										&isnull, NULL) || isnull)
 					elog(ERROR, "missing row in percentile_cont");
 				rownum++;
 			}
@@ -1016,6 +1018,8 @@ mode_final(PG_FUNCTION_ARGS)
 	int64		last_val_freq = 0;
 	bool		last_val_is_mode = false;
 	FmgrInfo   *equalfn;
+	Datum		abbrev_val = (Datum) 0;
+	Datum		last_abbrev_val = (Datum) 0;
 	bool		shouldfree;
 
 	Assert(AggCheckCallContext(fcinfo, NULL) == AGG_CONTEXT_AGGREGATE);
@@ -1042,7 +1046,7 @@ mode_final(PG_FUNCTION_ARGS)
 	tuplesort_performsort(osastate->sortstate);
 
 	/* Scan tuples and count frequencies */
-	while (tuplesort_getdatum(osastate->sortstate, true, &val, &isnull))
+	while (tuplesort_getdatum(osastate->sortstate, true, &val, &isnull, &abbrev_val))
 	{
 		/* we don't expect any nulls, but ignore them if found */
 		if (isnull)
@@ -1054,8 +1058,10 @@ mode_final(PG_FUNCTION_ARGS)
 			mode_val = last_val = val;
 			mode_freq = last_val_freq = 1;
 			last_val_is_mode = true;
+			last_abbrev_val = abbrev_val;
 		}
-		else if (DatumGetBool(FunctionCall2(equalfn, val, last_val)))
+		else if (abbrev_val == last_abbrev_val &&
+				 DatumGetBool(FunctionCall2(equalfn, val, last_val)))
 		{
 			/* value equal to previous value, count it */
 			if (last_val_is_mode)
@@ -1078,6 +1084,8 @@ mode_final(PG_FUNCTION_ARGS)
 			if (shouldfree && !last_val_is_mode)
 				pfree(DatumGetPointer(last_val));
 			last_val = val;
+			/* avoid equality function calls by reusing abbreviated keys */
+			last_abbrev_val = abbrev_val;
 			last_val_freq = 1;
 			last_val_is_mode = false;
 		}
@@ -1181,7 +1189,7 @@ hypothetical_rank_common(FunctionCallInfo fcinfo, int flag,
 	tuplesort_performsort(osastate->sortstate);
 
 	/* iterate till we find the hypothetical row */
-	while (tuplesort_gettupleslot(osastate->sortstate, true, slot))
+	while (tuplesort_gettupleslot(osastate->sortstate, true, slot, NULL))
 	{
 		bool		isnull;
 		Datum		d = slot_getattr(slot, nargs + 1, &isnull);
@@ -1266,6 +1274,8 @@ hypothetical_dense_rank_final(PG_FUNCTION_ARGS)
 	int64		duplicate_count = 0;
 	OSAPerGroupState *osastate;
 	int			numDistinctCols;
+	Datum		abbrevVal = (Datum) 0;
+	Datum		abbrevOld = (Datum) 0;
 	AttrNumber *sortColIdx;
 	FmgrInfo   *equalfns;
 	TupleTableSlot *slot;
@@ -1342,7 +1352,7 @@ hypothetical_dense_rank_final(PG_FUNCTION_ARGS)
 	slot2 = extraslot;
 
 	/* iterate till we find the hypothetical row */
-	while (tuplesort_gettupleslot(osastate->sortstate, true, slot))
+	while (tuplesort_gettupleslot(osastate->sortstate, true, slot, &abbrevVal))
 	{
 		bool		isnull;
 		Datum		d = slot_getattr(slot, nargs + 1, &isnull);
@@ -1353,6 +1363,7 @@ hypothetical_dense_rank_final(PG_FUNCTION_ARGS)
 
 		/* count non-distinct tuples */
 		if (!TupIsNull(slot2) &&
+			abbrevVal == abbrevOld &&
 			execTuplesMatch(slot, slot2,
 							numDistinctCols,
 							sortColIdx,
@@ -1363,6 +1374,8 @@ hypothetical_dense_rank_final(PG_FUNCTION_ARGS)
 		tmpslot = slot2;
 		slot2 = slot;
 		slot = tmpslot;
+		/* avoid execTuplesMatch() calls by reusing abbreviated keys */
+		abbrevOld = abbrevVal;
 
 		rank++;
 
diff --git a/src/backend/utils/sort/tuplesort.c b/src/backend/utils/sort/tuplesort.c
index 435041a..0c1629d 100644
--- a/src/backend/utils/sort/tuplesort.c
+++ b/src/backend/utils/sort/tuplesort.c
@@ -1265,7 +1265,8 @@ tuplesort_putindextuplevalues(Tuplesortstate *state, Relation rel,
 		 * converter it won't expect NULL values, and cost model is not
 		 * required to account for NULL, so in that case we avoid calling
 		 * converter and just set datum1 to "void" representation (to be
-		 * consistent).
+		 * consistent, and to support cheap inequality tests for NULL
+		 * abbreviated keys).
 		 */
 		stup.datum1 = original;
 	}
@@ -1327,7 +1328,9 @@ tuplesort_putdatum(Tuplesortstate *state, Datum val, bool isNull)
 	 * stup.tuple and is treated as the canonical copy (e.g. to return via
 	 * tuplesort_getdatum or when writing to tape); stup.datum1 gets the
 	 * abbreviated value if abbreviation is happening, otherwise it's
-	 * identical to stup.tuple.
+	 * identical to stup.tuple. stup.datum1 has a "void" representation for
+	 * NULL values, which abbreviated keys require for cheap inequality
+	 * checks.
 	 */
 
 	if (isNull || state->datumTypeByVal)
@@ -1847,10 +1850,17 @@ tuplesort_gettuple_common(Tuplesortstate *state, bool forward,
  * Fetch the next tuple in either forward or back direction.
  * If successful, put tuple in slot and return TRUE; else, clear the slot
  * and return FALSE.
+ *
+ * Caller may optionally be passed back abbreviated value (on TRUE return
+ * value) when abbreviation was used, which can be used to cheaply avoid
+ * equality checks that might otherwise be required.  Caller can safely make a
+ * determination of "non-equal tuple" based on simple binary inequality.  NULL
+ * value in leading attribute will set abbreviated value to "void" abbreviated
+ * representation, which caller may rely on in abbreviated inequality check.
  */
 bool
 tuplesort_gettupleslot(Tuplesortstate *state, bool forward,
-					   TupleTableSlot *slot)
+					   TupleTableSlot *slot, Datum *abbrev)
 {
 	MemoryContext oldcontext = MemoryContextSwitchTo(state->sortcontext);
 	SortTuple	stup;
@@ -1863,6 +1873,10 @@ tuplesort_gettupleslot(Tuplesortstate *state, bool forward,
 
 	if (stup.tuple)
 	{
+		/* Record abbreviated key for caller */
+		if (state->sortKeys->abbrev_converter && abbrev)
+			*abbrev = stup.datum1;
+
 		ExecStoreMinimalTuple((MinimalTuple) stup.tuple, slot, should_free);
 		return true;
 	}
@@ -1918,10 +1932,17 @@ tuplesort_getindextuple(Tuplesortstate *state, bool forward,
  *
  * If the Datum is pass-by-ref type, the returned value is freshly palloc'd
  * and is now owned by the caller.
+ *
+ * Caller may optionally be passed back abbreviated value (on TRUE return
+ * value) when abbreviation was used, which can be used to cheaply avoid
+ * equality checks that might otherwise be required.  Caller can safely make a
+ * determination of "non-equal tuple" based on simple binary inequality.  NULL
+ * value in leading attribute will set abbreviated value to "void" abbreviated
+ * representation, which caller may rely on in abbreviated inequality check.
  */
 bool
 tuplesort_getdatum(Tuplesortstate *state, bool forward,
-				   Datum *val, bool *isNull)
+				   Datum *val, bool *isNull, Datum *abbrev)
 {
 	MemoryContext oldcontext = MemoryContextSwitchTo(state->sortcontext);
 	SortTuple	stup;
@@ -1933,6 +1954,10 @@ tuplesort_getdatum(Tuplesortstate *state, bool forward,
 		return false;
 	}
 
+	/* Record abbreviated key for caller */
+	if (state->sortKeys->abbrev_converter && abbrev)
+		*abbrev = stup.datum1;
+
 	if (stup.isnull1 || state->datumTypeByVal)
 	{
 		*val = stup.datum1;
@@ -3145,7 +3170,8 @@ copytup_heap(Tuplesortstate *state, SortTuple *stup, void *tup)
 		 * converter it won't expect NULL values, and cost model is not
 		 * required to account for NULL, so in that case we avoid calling
 		 * converter and just set datum1 to "void" representation (to be
-		 * consistent).
+		 * consistent, and to support cheap inequality tests for NULL
+		 * abbreviated keys).
 		 */
 		stup->datum1 = original;
 	}
@@ -3385,7 +3411,8 @@ copytup_cluster(Tuplesortstate *state, SortTuple *stup, void *tup)
 		 * converter it won't expect NULL values, and cost model is not
 		 * required to account for NULL, so in that case we avoid calling
 		 * converter and just set datum1 to "void" representation (to be
-		 * consistent).
+		 * consistent, and to support cheap inequality tests for NULL
+		 * abbreviated keys).
 		 */
 		stup->datum1 = original;
 	}
@@ -3687,7 +3714,8 @@ copytup_index(Tuplesortstate *state, SortTuple *stup, void *tup)
 		 * converter it won't expect NULL values, and cost model is not
 		 * required to account for NULL, so in that case we avoid calling
 		 * converter and just set datum1 to "void" representation (to be
-		 * consistent).
+		 * consistent, and to support cheap inequality tests for NULL
+		 * abbreviated keys).
 		 */
 		stup->datum1 = original;
 	}
diff --git a/src/include/utils/tuplesort.h b/src/include/utils/tuplesort.h
index de6fc56..9af1159 100644
--- a/src/include/utils/tuplesort.h
+++ b/src/include/utils/tuplesort.h
@@ -93,13 +93,13 @@ extern void tuplesort_putdatum(Tuplesortstate *state, Datum val,
 extern void tuplesort_performsort(Tuplesortstate *state);
 
 extern bool tuplesort_gettupleslot(Tuplesortstate *state, bool forward,
-					   TupleTableSlot *slot);
+					   TupleTableSlot *slot, Datum *abbrev);
 extern HeapTuple tuplesort_getheaptuple(Tuplesortstate *state, bool forward,
 					   bool *should_free);
 extern IndexTuple tuplesort_getindextuple(Tuplesortstate *state, bool forward,
 						bool *should_free);
 extern bool tuplesort_getdatum(Tuplesortstate *state, bool forward,
-				   Datum *val, bool *isNull);
+				   Datum *val, bool *isNull, Datum *abbrev);
 
 extern bool tuplesort_skiptuples(Tuplesortstate *state, int64 ntuples,
 					 bool forward);
-- 
1.9.1

From 8b1343903a68a1315a3b3334bcebc42cb722a470 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <peter.geoghega...@gmail.com>
Date: Thu, 9 Jul 2015 15:59:07 -0700
Subject: [PATCH 1/2] Tighten SortSupport abbreviated key contract

Revise the contract that SortSupport has with its clients to forbid an
abbreviated comparator ever returning 0 ("indeterminate") in the event
of two abbreviated keys not being bitwise equal.  This gives certain
cases the leeway to inexpensively determine tuple inequality from afar,
by comparing abbreviated keys using simple unsigned integer (Datum)
comparisons, and trusting that bitwise inequality amounts to leading
attribute inequality (and, therefore, tuple inequality).  Direct access
to SortSupport state is consequently not required, which will be
convenient for tuplesort clients (that only have an opaque state
pointer), while also saving us a few additional cycles.

Cheap inequality checks of this nature might otherwise give wrong
answers due (for example) to some opclass abbreviated comparator
interpreting non-identical abbreviated keys with respect to some state
built during memtuple copying (although that is probably still safe,
depending on the exact details).  Pass-by-reference abbreviated keys
could similarly cause problems, and so are now forbidden.  No existing
opclass with abbreviation support does anything like this, so no change
in any opclass SortSupport routine.

Backpatch to 9.5, where abbreviated keys were introduced.
---
 src/include/utils/sortsupport.h | 14 +++++++++++++-
 1 file changed, 13 insertions(+), 1 deletion(-)

diff --git a/src/include/utils/sortsupport.h b/src/include/utils/sortsupport.h
index 787404e..3d78c3a 100644
--- a/src/include/utils/sortsupport.h
+++ b/src/include/utils/sortsupport.h
@@ -116,7 +116,7 @@ typedef struct SortSupportData
 	 *
 	 * This allows opclass authors to supply a conversion routine, used to
 	 * create an alternative representation of the underlying type (an
-	 * "abbreviated key").  Typically, this representation is an ad-hoc,
+	 * "abbreviated key").  Invariably, this representation is an ad-hoc,
 	 * pass-by-value Datum format that only the opclass has knowledge of.  An
 	 * alternative comparator, used only with this alternative representation
 	 * must also be provided (which is assigned to "comparator").  This
@@ -135,6 +135,18 @@ typedef struct SortSupportData
 	 * cache miss penalties are expensive; to get good overall performance,
 	 * sort infrastructure must heavily weigh cache performance.
 	 *
+	 * There is one limited sense in which the core code has knowledge of the
+	 * abbreviated key format:  inequality.  The core system makes a generic
+	 * assumption that two non-bitwise-identical abbreviated keys could never
+	 * have the abbreviated comparator return a 0 or "inconclusive" result for
+	 * two non-identical abbreviated keys (this enables cheap inequality tests
+	 * made from a distance, without opclass involvement.  Furthermore, it
+	 * implies that abbreviated keys must always be pass-by-value).  Note that
+	 * it is still permitted for the abbreviated comparator to use state
+	 * generated during calls to abbrev_converter -- for example, a lookup
+	 * table could be used, provided any identifier baked into the abbreviated
+	 * representation is canonical.
+	 *
 	 * Opclass authors must consider the final cardinality of abbreviated keys
 	 * when devising an encoding scheme.  It's possible for a strategy to work
 	 * better than an alternative strategy with one usage pattern, while the
-- 
1.9.1

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] Reusing abbreviated keys during second pass of ordered [set] aggregates

Reply via email to