Re: Reduce TupleHashEntryData struct size by half

Jeff Davis Thu, 21 Nov 2024 12:38:26 -0800

New patch series attached.

  * a few cleanup patches, but 0001 and 0004 can affect initial hash
table sizing
  * also pack AggStatePerGroupData (to 10 bytes)
  * put additional data in the same chunk as firstTuple to avoid an
extra pointer an an extra allocation

On Tue, 2024-11-19 at 01:30 +0200, Heikki Linnakangas wrote:
> Sounds like a good idea. Needs some changes to the TupleTableSlotOps 
> interface to avoid the memcpy I presume.

Neither firstTuple nor the additional data have a size known at compile
time, so it can't be represented by a struct. It seems best to keep
firstTuple at the beginning so that it matches the SH_KEY_TYPE, and
then just repalloc() to make room at the end for the additional data,
which avoids the memcpy unless it crosses a power-of-two boundary.

> > 
> Queries that have a only a small number of groups might might benefit
> from storing a plain Datum/isnull array instead of a MinimalTuple.
> That 
> would take more memory when you have a lot of groups though.

The current MinimalTuple representation is pretty wasteful for small
grouping keys, so I don't think it would be hard to beat.

> > Alternatively, MinimalTuple is not very "minimal", and perhaps we
> > can
> > just make it better.
> 
> Yeah. It tries to be compatible with HeapTuple, but perhaps we should
> give up on that and pack it more tightly instead.

>From the perspective of HashAgg, that seems worthwhile.

In my current patch set, depending on the grouping key and aggregates,
the chunk size for the combination of the firstTuple with the
additional data can be just above 32 bytes, which pushes the chunk size
to 64 bytes. If we cut down the mintuple overhead a bit, more cases
would fit in 32 bytes. And I think we can: 32 bytes seems reasonable to
hold a lot of common cases where there's a small grouping key and a
small pergroup state.

Regards,
        Jeff Davis

From 95a2becb251415af6f82d6df871b66e8a429768b Mon Sep 17 00:00:00 2001
From: Jeff Davis <j...@j-davis.com>
Date: Wed, 20 Nov 2024 12:33:11 -0800
Subject: [PATCH v2 1/8] ExecInitAgg: update aggstate->numaggs and ->numtrans
 earlier.

Functions hash_agg_entry_size() and build_hash_tables() and are
relying on those to make accurate estimates.
---
 src/backend/executor/nodeAgg.c | 11 ++---------
 1 file changed, 2 insertions(+), 9 deletions(-)

diff --git a/src/backend/executor/nodeAgg.c b/src/backend/executor/nodeAgg.c
index 53ead77ece..300163538a 100644
--- a/src/backend/executor/nodeAgg.c
+++ b/src/backend/executor/nodeAgg.c
@@ -3379,8 +3379,8 @@ ExecInitAgg(Agg *node, EState *estate, int eflags)
 		max_aggno = Max(max_aggno, aggref->aggno);
 		max_transno = Max(max_transno, aggref->aggtransno);
 	}
-	numaggs = max_aggno + 1;
-	numtrans = max_transno + 1;
+	aggstate->numaggs = numaggs = max_aggno + 1;
+	aggstate->numtrans = numtrans = max_transno + 1;
 
 	/*
 	 * For each phase, prepare grouping set data and fmgr lookup data for
@@ -3943,13 +3943,6 @@ ExecInitAgg(Agg *node, EState *estate, int eflags)
 		ReleaseSysCache(aggTuple);
 	}
 
-	/*
-	 * Update aggstate->numaggs to be the number of unique aggregates found.
-	 * Also set numstates to the number of unique transition states found.
-	 */
-	aggstate->numaggs = numaggs;
-	aggstate->numtrans = numtrans;
-
 	/*
 	 * Last, check whether any more aggregates got added onto the node while
 	 * we processed the expressions for the aggregate arguments (including not
-- 
2.34.1

From 8e31098ffd338ad9e3d4797f11baed386b1bd714 Mon Sep 17 00:00:00 2001
From: Jeff Davis <j...@j-davis.com>
Date: Thu, 21 Nov 2024 12:06:53 -0800
Subject: [PATCH v2 2/8] Add missing typedefs.list entry for
 AggStatePerGroupData.

---
 src/include/executor/nodeAgg.h   | 2 +-
 src/tools/pgindent/typedefs.list | 1 +
 2 files changed, 2 insertions(+), 1 deletion(-)

diff --git a/src/include/executor/nodeAgg.h b/src/include/executor/nodeAgg.h
index 684779a6a3..db633fd672 100644
--- a/src/include/executor/nodeAgg.h
+++ b/src/include/executor/nodeAgg.h
@@ -264,7 +264,7 @@ typedef struct AggStatePerGroupData
 	 * NULL and not auto-replace it with a later input value. Only the first
 	 * non-NULL input will be auto-substituted.
 	 */
-}			AggStatePerGroupData;
+} AggStatePerGroupData;
 
 /*
  * AggStatePerPhaseData - per-grouping-set-phase state
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 08521d51a9..d52bdd2ff1 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -47,6 +47,7 @@ AggSplit
 AggState
 AggStatePerAgg
 AggStatePerGroup
+AggStatePerGroupData
 AggStatePerHash
 AggStatePerPhase
 AggStatePerTrans
-- 
2.34.1

From b372446f046736c40f7f10762623144d8a06aa9f Mon Sep 17 00:00:00 2001
From: Jeff Davis <j...@j-davis.com>
Date: Wed, 20 Nov 2024 12:59:16 -0800
Subject: [PATCH v2 3/8] Remove unused TupleHashTableData->entrysize.

---
 src/backend/executor/execGrouping.c | 1 -
 src/include/nodes/execnodes.h       | 1 -
 2 files changed, 2 deletions(-)

diff --git a/src/backend/executor/execGrouping.c b/src/backend/executor/execGrouping.c
index 774e4de882..467dbdc4cc 100644
--- a/src/backend/executor/execGrouping.c
+++ b/src/backend/executor/execGrouping.c
@@ -187,7 +187,6 @@ BuildTupleHashTableExt(PlanState *parent,
 	hashtable->tab_collations = collations;
 	hashtable->tablecxt = tablecxt;
 	hashtable->tempcxt = tempcxt;
-	hashtable->entrysize = entrysize;
 	hashtable->tableslot = NULL;	/* will be made on first lookup */
 	hashtable->inputslot = NULL;
 	hashtable->in_hash_funcs = NULL;
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 182a6956bb..0e6ec6d3a6 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -835,7 +835,6 @@ typedef struct TupleHashTableData
 	Oid		   *tab_collations; /* collations for hash and comparison */
 	MemoryContext tablecxt;		/* memory context containing table */
 	MemoryContext tempcxt;		/* context for function evaluations */
-	Size		entrysize;		/* actual size to make each hash entry */
 	TupleTableSlot *tableslot;	/* slot for referencing table entries */
 	/* The following fields are set transiently for each table search: */
 	TupleTableSlot *inputslot;	/* current input tuple's slot */
-- 
2.34.1

From c61d82728dd1454ffa62487e41d1c2bc9ee11564 Mon Sep 17 00:00:00 2001
From: Jeff Davis <j...@j-davis.com>
Date: Wed, 20 Nov 2024 12:50:53 -0800
Subject: [PATCH v2 4/8] nodeSetOp.c: missing additionalsize for
 BuildTupleHashTable().

This can affect the calculations for 'nbuckets'.

Also, furure work for HashAgg will rely on the correct
'additionalsize'.
---
 src/backend/executor/nodeSetOp.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/src/backend/executor/nodeSetOp.c b/src/backend/executor/nodeSetOp.c
index a8ac68b482..4e58a571b7 100644
--- a/src/backend/executor/nodeSetOp.c
+++ b/src/backend/executor/nodeSetOp.c
@@ -134,7 +134,7 @@ build_hash_table(SetOpState *setopstate)
 												   setopstate->hashfunctions,
 												   node->dupCollations,
 												   node->numGroups,
-												   0,
+												   sizeof(SetOpStatePerGroupData),
 												   setopstate->ps.state->es_query_cxt,
 												   setopstate->tableContext,
 												   econtext->ecxt_per_tuple_memory,
-- 
2.34.1

From 6a764dfb883a4153d22fc73901fc2ddbf80e2083 Mon Sep 17 00:00:00 2001
From: Jeff Davis <j...@j-davis.com>
Date: Thu, 14 Nov 2024 11:47:51 -0800
Subject: [PATCH v2 5/8] TupleHashTable: store additional data along with
 tuple.

Previously, the caller needed to allocate the memory and the
TupleHashTable would store a pointer to it. That wastes space for the
palloc overhead as well as the size of the pointer itself.

Now, the TupleHashTable relies on the caller to correctly specify the
additionalsize, and allocates that amount of space. The caller can
then request a pointer into that space.
---
 src/backend/executor/execGrouping.c | 59 ++++++++++++++++++++++++++++-
 src/backend/executor/nodeAgg.c      | 20 ++++------
 src/backend/executor/nodeSetOp.c    | 19 ++++------
 src/backend/executor/nodeSubplan.c  |  2 +-
 src/include/executor/executor.h     |  3 ++
 src/include/nodes/execnodes.h       | 10 +----
 6 files changed, 78 insertions(+), 35 deletions(-)

diff --git a/src/backend/executor/execGrouping.c b/src/backend/executor/execGrouping.c
index 467dbdc4cc..d4a3b3280e 100644
--- a/src/backend/executor/execGrouping.c
+++ b/src/backend/executor/execGrouping.c
@@ -20,6 +20,13 @@
 #include "miscadmin.h"
 #include "utils/lsyscache.h"
 
+typedef struct TupleHashEntryData
+{
+	MinimalTuple firstTuple;	/* copy of first tuple in this group */
+	uint32		status;			/* hash status */
+	uint32		hash;			/* hash value (cached) */
+} TupleHashEntryData;
+
 static int	TupleHashTableMatch(struct tuplehash_hash *tb, const MinimalTuple tuple1, const MinimalTuple tuple2);
 static inline uint32 TupleHashTableHash_internal(struct tuplehash_hash *tb,
 												 const MinimalTuple tuple);
@@ -187,6 +194,7 @@ BuildTupleHashTableExt(PlanState *parent,
 	hashtable->tab_collations = collations;
 	hashtable->tablecxt = tablecxt;
 	hashtable->tempcxt = tempcxt;
+	hashtable->additionalsize = additionalsize;
 	hashtable->tableslot = NULL;	/* will be made on first lookup */
 	hashtable->inputslot = NULL;
 	hashtable->in_hash_funcs = NULL;
@@ -287,6 +295,15 @@ ResetTupleHashTable(TupleHashTable hashtable)
 	tuplehash_reset(hashtable->hashtab);
 }
 
+/*
+ * Return size of the hash bucket. Useful for estimating memory usage.
+ */
+size_t
+TupleHashEntrySize(void)
+{
+	return sizeof(TupleHashEntryData);
+}
+
 /*
  * Find or create a hashtable entry for the tuple group containing the
  * given tuple.  The tuple must be the same type as the hashtable entries.
@@ -353,6 +370,26 @@ TupleHashTableHash(TupleHashTable hashtable, TupleTableSlot *slot)
 	return hash;
 }
 
+MinimalTuple
+TupleHashEntryGetTuple(TupleHashEntry entry)
+{
+	return entry->firstTuple;
+}
+
+/*
+ * Get a pointer into the additional space allocated for this entry. The
+ * amount of space available is the additionalsize specified to
+ * BuildTupleHashTable(). If additionalsize was specified as zero, no
+ * additional space is available and this function should not be called.
+ */
+void *
+TupleHashEntryGetAdditional(TupleHashEntry entry)
+{
+	Assert(GetMemoryChunkSpace(entry->firstTuple) > MAXALIGN(entry->firstTuple->t_len));
+
+	return (char *) entry->firstTuple + MAXALIGN(entry->firstTuple->t_len);
+}
+
 /*
  * A variant of LookupTupleHashEntry for callers that have already computed
  * the hash value.
@@ -513,13 +550,31 @@ LookupTupleHashEntry_internal(TupleHashTable hashtable, TupleTableSlot *slot,
 		}
 		else
 		{
+			MinimalTuple firstTuple;
+			size_t totalsize; /* including alignment and additionalsize */
+
 			/* created new entry */
 			*isnew = true;
 			/* zero caller data */
-			entry->additional = NULL;
 			MemoryContextSwitchTo(hashtable->tablecxt);
+
 			/* Copy the first tuple into the table context */
-			entry->firstTuple = ExecCopySlotMinimalTuple(slot);
+			firstTuple = ExecCopySlotMinimalTuple(slot);
+
+			/*
+			 * Allocate additional space right after the MinimalTuple of size
+			 * additionalsize. The caller can get a pointer to this data with
+			 * TupleHashEntryGetAdditional(), and store arbitrary data there.
+			 *
+			 * This avoids the need to store an extra pointer or allocate an
+			 * additional chunk, which would waste memory.
+			 */
+			totalsize = MAXALIGN(firstTuple->t_len) + hashtable->additionalsize;
+			firstTuple = repalloc(firstTuple, totalsize);
+			memset((char *) firstTuple + firstTuple->t_len, 0,
+				   totalsize - firstTuple->t_len);
+
+			entry->firstTuple = firstTuple;
 		}
 	}
 	else
diff --git a/src/backend/executor/nodeAgg.c b/src/backend/executor/nodeAgg.c
index 300163538a..bb20006e9e 100644
--- a/src/backend/executor/nodeAgg.c
+++ b/src/backend/executor/nodeAgg.c
@@ -1713,7 +1713,7 @@ hash_agg_entry_size(int numTrans, Size tupleWidth, Size transitionSpace)
 		transitionChunkSize = 0;
 
 	return
-		sizeof(TupleHashEntryData) +
+		TupleHashEntrySize() +
 		tupleChunkSize +
 		pergroupChunkSize +
 		transitionChunkSize;
@@ -1954,7 +1954,7 @@ hash_agg_update_metrics(AggState *aggstate, bool from_tape, int npartitions)
 	if (aggstate->hash_ngroups_current > 0)
 	{
 		aggstate->hashentrysize =
-			sizeof(TupleHashEntryData) +
+			TupleHashEntrySize() +
 			(hashkey_mem / (double) aggstate->hash_ngroups_current);
 	}
 }
@@ -2055,11 +2055,7 @@ initialize_hash_entry(AggState *aggstate, TupleHashTable hashtable,
 	if (aggstate->numtrans == 0)
 		return;
 
-	pergroup = (AggStatePerGroup)
-		MemoryContextAlloc(hashtable->tablecxt,
-						   sizeof(AggStatePerGroupData) * aggstate->numtrans);
-
-	entry->additional = pergroup;
+	pergroup = (AggStatePerGroup) TupleHashEntryGetAdditional(entry);
 
 	/*
 	 * Initialize aggregates for new tuple group, lookup_hash_entries()
@@ -2123,7 +2119,7 @@ lookup_hash_entries(AggState *aggstate)
 		{
 			if (isnew)
 				initialize_hash_entry(aggstate, hashtable, entry);
-			pergroup[setno] = entry->additional;
+			pergroup[setno] = TupleHashEntryGetAdditional(entry);
 		}
 		else
 		{
@@ -2681,7 +2677,7 @@ agg_refill_hash_table(AggState *aggstate)
 		{
 			if (isnew)
 				initialize_hash_entry(aggstate, perhash->hashtable, entry);
-			aggstate->hash_pergroup[batch->setno] = entry->additional;
+			aggstate->hash_pergroup[batch->setno] = TupleHashEntryGetAdditional(entry);
 			advance_aggregates(aggstate);
 		}
 		else
@@ -2773,7 +2769,7 @@ agg_retrieve_hash_table_in_memory(AggState *aggstate)
 	ExprContext *econtext;
 	AggStatePerAgg peragg;
 	AggStatePerGroup pergroup;
-	TupleHashEntryData *entry;
+	TupleHashEntry entry;
 	TupleTableSlot *firstSlot;
 	TupleTableSlot *result;
 	AggStatePerHash perhash;
@@ -2845,7 +2841,7 @@ agg_retrieve_hash_table_in_memory(AggState *aggstate)
 		 * Transform representative tuple back into one with the right
 		 * columns.
 		 */
-		ExecStoreMinimalTuple(entry->firstTuple, hashslot, false);
+		ExecStoreMinimalTuple(TupleHashEntryGetTuple(entry), hashslot, false);
 		slot_getallattrs(hashslot);
 
 		ExecClearTuple(firstSlot);
@@ -2861,7 +2857,7 @@ agg_retrieve_hash_table_in_memory(AggState *aggstate)
 		}
 		ExecStoreVirtualTuple(firstSlot);
 
-		pergroup = (AggStatePerGroup) entry->additional;
+		pergroup = (AggStatePerGroup) TupleHashEntryGetAdditional(entry);
 
 		/*
 		 * Use the representative input tuple for any references to
diff --git a/src/backend/executor/nodeSetOp.c b/src/backend/executor/nodeSetOp.c
index 4e58a571b7..efbc455cee 100644
--- a/src/backend/executor/nodeSetOp.c
+++ b/src/backend/executor/nodeSetOp.c
@@ -364,7 +364,7 @@ setop_fill_hash_table(SetOpState *setopstate)
 	{
 		TupleTableSlot *outerslot;
 		int			flag;
-		TupleHashEntryData *entry;
+		TupleHashEntry entry;
 		bool		isnew;
 
 		outerslot = ExecProcNode(outerPlan);
@@ -385,15 +385,10 @@ setop_fill_hash_table(SetOpState *setopstate)
 
 			/* If new tuple group, initialize counts */
 			if (isnew)
-			{
-				entry->additional = (SetOpStatePerGroup)
-					MemoryContextAlloc(setopstate->hashtable->tablecxt,
-									   sizeof(SetOpStatePerGroupData));
-				initialize_counts((SetOpStatePerGroup) entry->additional);
-			}
+				initialize_counts((SetOpStatePerGroup) TupleHashEntryGetAdditional(entry));
 
 			/* Advance the counts */
-			advance_counts((SetOpStatePerGroup) entry->additional, flag);
+			advance_counts((SetOpStatePerGroup) TupleHashEntryGetAdditional(entry), flag);
 		}
 		else
 		{
@@ -406,7 +401,7 @@ setop_fill_hash_table(SetOpState *setopstate)
 
 			/* Advance the counts if entry is already present */
 			if (entry)
-				advance_counts((SetOpStatePerGroup) entry->additional, flag);
+				advance_counts((SetOpStatePerGroup) TupleHashEntryGetAdditional(entry), flag);
 		}
 
 		/* Must reset expression context after each hashtable lookup */
@@ -424,7 +419,7 @@ setop_fill_hash_table(SetOpState *setopstate)
 static TupleTableSlot *
 setop_retrieve_hash_table(SetOpState *setopstate)
 {
-	TupleHashEntryData *entry;
+	TupleHashEntry entry;
 	TupleTableSlot *resultTupleSlot;
 
 	/*
@@ -454,12 +449,12 @@ setop_retrieve_hash_table(SetOpState *setopstate)
 		 * See if we should emit any copies of this tuple, and if so return
 		 * the first copy.
 		 */
-		set_output_count(setopstate, (SetOpStatePerGroup) entry->additional);
+		set_output_count(setopstate, (SetOpStatePerGroup) TupleHashEntryGetAdditional(entry));
 
 		if (setopstate->numOutput > 0)
 		{
 			setopstate->numOutput--;
-			return ExecStoreMinimalTuple(entry->firstTuple,
+			return ExecStoreMinimalTuple(TupleHashEntryGetTuple(entry),
 										 resultTupleSlot,
 										 false);
 		}
diff --git a/src/backend/executor/nodeSubplan.c b/src/backend/executor/nodeSubplan.c
index 236222d72a..63aff24251 100644
--- a/src/backend/executor/nodeSubplan.c
+++ b/src/backend/executor/nodeSubplan.c
@@ -746,7 +746,7 @@ findPartialMatch(TupleHashTable hashtable, TupleTableSlot *slot,
 	{
 		CHECK_FOR_INTERRUPTS();
 
-		ExecStoreMinimalTuple(entry->firstTuple, hashtable->tableslot, false);
+		ExecStoreMinimalTuple(TupleHashEntryGetTuple(entry), hashtable->tableslot, false);
 		if (!execTuplesUnequal(slot, hashtable->tableslot,
 							   numCols, keyColIdx,
 							   eqfunctions,
diff --git a/src/include/executor/executor.h b/src/include/executor/executor.h
index 69c3ebff00..d26968ecf7 100644
--- a/src/include/executor/executor.h
+++ b/src/include/executor/executor.h
@@ -154,9 +154,12 @@ extern TupleHashEntry LookupTupleHashEntry(TupleHashTable hashtable,
 										   bool *isnew, uint32 *hash);
 extern uint32 TupleHashTableHash(TupleHashTable hashtable,
 								 TupleTableSlot *slot);
+extern size_t TupleHashEntrySize(void);
 extern TupleHashEntry LookupTupleHashEntryHash(TupleHashTable hashtable,
 											   TupleTableSlot *slot,
 											   bool *isnew, uint32 hash);
+extern MinimalTuple TupleHashEntryGetTuple(TupleHashEntry entry);
+extern void *TupleHashEntryGetAdditional(TupleHashEntry entry);
 extern TupleHashEntry FindTupleHashEntry(TupleHashTable hashtable,
 										 TupleTableSlot *slot,
 										 ExprState *eqcomp,
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 0e6ec6d3a6..68b8b28803 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -806,17 +806,10 @@ typedef struct ExecAuxRowMark
  * tab_eq_func respectively.
  * ----------------------------------------------------------------
  */
+typedef struct TupleHashEntryData TupleHashEntryData;
 typedef struct TupleHashEntryData *TupleHashEntry;
 typedef struct TupleHashTableData *TupleHashTable;
 
-typedef struct TupleHashEntryData
-{
-	MinimalTuple firstTuple;	/* copy of first tuple in this group */
-	void	   *additional;		/* user data */
-	uint32		status;			/* hash status */
-	uint32		hash;			/* hash value (cached) */
-} TupleHashEntryData;
-
 /* define parameters necessary to generate the tuple hash table interface */
 #define SH_PREFIX tuplehash
 #define SH_ELEMENT_TYPE TupleHashEntryData
@@ -835,6 +828,7 @@ typedef struct TupleHashTableData
 	Oid		   *tab_collations; /* collations for hash and comparison */
 	MemoryContext tablecxt;		/* memory context containing table */
 	MemoryContext tempcxt;		/* context for function evaluations */
+	Size		additionalsize;	/* size of additional data */
 	TupleTableSlot *tableslot;	/* slot for referencing table entries */
 	/* The following fields are set transiently for each table search: */
 	TupleTableSlot *inputslot;	/* current input tuple's slot */
-- 
2.34.1

From 5c67d3943a9e02b933d8e18771b0fee797a8856f Mon Sep 17 00:00:00 2001
From: Jeff Davis <j...@j-davis.com>
Date: Fri, 15 Nov 2024 12:21:04 -0800
Subject: [PATCH v2 6/8] simplehash: don't require a "status" field.

Callers may define SH_ENTRY_IS_EMPTY(), SH_ENTRY_SET_EMPTY(), and
SH_ENTRY_SET_IN_USE() as an alternative, which can allow for a more
compact entry size. That reduces the memory overhead, especially when
the hash table is sparse.
---
 src/include/lib/simplehash.h | 68 ++++++++++++++++++++++++------------
 1 file changed, 45 insertions(+), 23 deletions(-)

diff --git a/src/include/lib/simplehash.h b/src/include/lib/simplehash.h
index 3e1b1f9461..213d8c372e 100644
--- a/src/include/lib/simplehash.h
+++ b/src/include/lib/simplehash.h
@@ -50,9 +50,18 @@
  *	  - SH_HASH_KEY(table, key) - generate hash for the key
  *	  - SH_STORE_HASH - if defined the hash is stored in the elements
  *	  - SH_GET_HASH(tb, a) - return the field to store the hash in
+ *	  - SH_ENTRY_IS_EMPTY(tb, a) - return true if the entry is empty
+ *	  - SH_ENTRY_SET_EMPTY(tb, a) - set entry to empty
+ *	  - SH_ENTRY_SET_IN_USE(tb, a) - set entry to "in use"
  *
  *	  The element type is required to contain a "status" member that can store
- *	  the range of values defined in the SH_STATUS enum.
+ *	  the range of values defined in the SH_STATUS enum. Alternatively,
+ *	  callers may define all of SH_ENTRY_IS_EMPTY, SH_ENTRY_SET_EMPTY,
+ *	  SH_ENTRY_SET_IN_USE to allow for a more compact entry size. For example,
+ *	  if the SH_ELEMENT_TYPE contains a pointer, the caller may decide that
+ *	  the entry is empty if the pointer is NULL and "in use" if the pointer is
+ *	  non-NULL. NB: if the entire entry is zero, then SH_ENTRY_IS_EMPTY()
+ *	  *must* evaluate to true.
  *
  *	  While SH_STORE_HASH (and subsequently SH_GET_HASH) are optional, because
  *	  the hash table implementation needs to compare hashes to move elements
@@ -277,6 +286,18 @@ SH_SCOPE void SH_STAT(SH_TYPE * tb);
 #define SH_GROW_MIN_FILLFACTOR 0.1
 #endif
 
+#ifndef SH_ENTRY_IS_EMPTY
+#define SH_ENTRY_IS_EMPTY(tb, a) ((a)->status == SH_STATUS_EMPTY)
+#endif
+
+#ifndef SH_ENTRY_SET_EMPTY
+#define SH_ENTRY_SET_EMPTY(tb, a) do { (a)->status = SH_STATUS_EMPTY; } while (0)
+#endif
+
+#ifndef SH_ENTRY_SET_IN_USE
+#define SH_ENTRY_SET_IN_USE(tb, a) do { (a)->status = SH_STATUS_IN_USE; } while (0)
+#endif
+
 #ifdef SH_STORE_HASH
 #define SH_COMPARE_KEYS(tb, ahash, akey, b) (ahash == SH_GET_HASH(tb, b) && SH_EQUAL(tb, b->SH_KEY, akey))
 #else
@@ -540,7 +561,7 @@ SH_GROW(SH_TYPE * tb, uint64 newsize)
 		uint32		hash;
 		uint32		optimal;
 
-		if (oldentry->status != SH_STATUS_IN_USE)
+		if (SH_ENTRY_IS_EMPTY(tb, oldentry))
 		{
 			startelem = i;
 			break;
@@ -562,7 +583,7 @@ SH_GROW(SH_TYPE * tb, uint64 newsize)
 	{
 		SH_ELEMENT_TYPE *oldentry = &olddata[copyelem];
 
-		if (oldentry->status == SH_STATUS_IN_USE)
+		if (!SH_ENTRY_IS_EMPTY(tb, oldentry))
 		{
 			uint32		hash;
 			uint32		startelem2;
@@ -578,7 +599,7 @@ SH_GROW(SH_TYPE * tb, uint64 newsize)
 			{
 				newentry = &newdata[curelem];
 
-				if (newentry->status == SH_STATUS_EMPTY)
+				if (SH_ENTRY_IS_EMPTY(tb, newentry))
 				{
 					break;
 				}
@@ -649,14 +670,14 @@ restart:
 		SH_ELEMENT_TYPE *entry = &data[curelem];
 
 		/* any empty bucket can directly be used */
-		if (entry->status == SH_STATUS_EMPTY)
+		if (SH_ENTRY_IS_EMPTY(tb, entry))
 		{
 			tb->members++;
 			entry->SH_KEY = key;
 #ifdef SH_STORE_HASH
 			SH_GET_HASH(tb, entry) = hash;
 #endif
-			entry->status = SH_STATUS_IN_USE;
+			SH_ENTRY_SET_IN_USE(tb, entry);
 			*found = false;
 			return entry;
 		}
@@ -671,7 +692,7 @@ restart:
 
 		if (SH_COMPARE_KEYS(tb, hash, key, entry))
 		{
-			Assert(entry->status == SH_STATUS_IN_USE);
+			Assert(!SH_ENTRY_IS_EMPTY(tb, entry));
 			*found = true;
 			return entry;
 		}
@@ -695,7 +716,7 @@ restart:
 				emptyelem = SH_NEXT(tb, emptyelem, startelem);
 				emptyentry = &data[emptyelem];
 
-				if (emptyentry->status == SH_STATUS_EMPTY)
+				if (SH_ENTRY_IS_EMPTY(tb, emptyentry))
 				{
 					lastentry = emptyentry;
 					break;
@@ -744,7 +765,7 @@ restart:
 #ifdef SH_STORE_HASH
 			SH_GET_HASH(tb, entry) = hash;
 #endif
-			entry->status = SH_STATUS_IN_USE;
+			SH_ENTRY_SET_IN_USE(tb, entry);
 			*found = false;
 			return entry;
 		}
@@ -806,13 +827,11 @@ SH_LOOKUP_HASH_INTERNAL(SH_TYPE * tb, SH_KEY_TYPE key, uint32 hash)
 	{
 		SH_ELEMENT_TYPE *entry = &tb->data[curelem];
 
-		if (entry->status == SH_STATUS_EMPTY)
+		if (SH_ENTRY_IS_EMPTY(tb, entry))
 		{
 			return NULL;
 		}
 
-		Assert(entry->status == SH_STATUS_IN_USE);
-
 		if (SH_COMPARE_KEYS(tb, hash, key, entry))
 			return entry;
 
@@ -864,10 +883,10 @@ SH_DELETE(SH_TYPE * tb, SH_KEY_TYPE key)
 	{
 		SH_ELEMENT_TYPE *entry = &tb->data[curelem];
 
-		if (entry->status == SH_STATUS_EMPTY)
+		if (SH_ENTRY_IS_EMPTY(tb, entry))
 			return false;
 
-		if (entry->status == SH_STATUS_IN_USE &&
+		if (!SH_ENTRY_IS_EMPTY(tb, entry) &&
 			SH_COMPARE_KEYS(tb, hash, key, entry))
 		{
 			SH_ELEMENT_TYPE *lastentry = entry;
@@ -890,9 +909,9 @@ SH_DELETE(SH_TYPE * tb, SH_KEY_TYPE key)
 				curelem = SH_NEXT(tb, curelem, startelem);
 				curentry = &tb->data[curelem];
 
-				if (curentry->status != SH_STATUS_IN_USE)
+				if (SH_ENTRY_IS_EMPTY(tb, curentry))
 				{
-					lastentry->status = SH_STATUS_EMPTY;
+					SH_ENTRY_SET_EMPTY(tb, lastentry);
 					break;
 				}
 
@@ -902,7 +921,7 @@ SH_DELETE(SH_TYPE * tb, SH_KEY_TYPE key)
 				/* current is at optimal position, done */
 				if (curoptimal == curelem)
 				{
-					lastentry->status = SH_STATUS_EMPTY;
+					SH_ENTRY_SET_EMPTY(tb, lastentry);
 					break;
 				}
 
@@ -953,9 +972,9 @@ SH_DELETE_ITEM(SH_TYPE * tb, SH_ELEMENT_TYPE * entry)
 		curelem = SH_NEXT(tb, curelem, startelem);
 		curentry = &tb->data[curelem];
 
-		if (curentry->status != SH_STATUS_IN_USE)
+		if (SH_ENTRY_IS_EMPTY(tb, curentry))
 		{
-			lastentry->status = SH_STATUS_EMPTY;
+			SH_ENTRY_SET_EMPTY(tb, lastentry);
 			break;
 		}
 
@@ -965,7 +984,7 @@ SH_DELETE_ITEM(SH_TYPE * tb, SH_ELEMENT_TYPE * entry)
 		/* current is at optimal position, done */
 		if (curoptimal == curelem)
 		{
-			lastentry->status = SH_STATUS_EMPTY;
+			SH_ENTRY_SET_EMPTY(tb, lastentry);
 			break;
 		}
 
@@ -993,7 +1012,7 @@ SH_START_ITERATE(SH_TYPE * tb, SH_ITERATOR * iter)
 	{
 		SH_ELEMENT_TYPE *entry = &tb->data[i];
 
-		if (entry->status != SH_STATUS_IN_USE)
+		if (SH_ENTRY_IS_EMPTY(tb, entry))
 		{
 			startelem = i;
 			break;
@@ -1055,7 +1074,7 @@ SH_ITERATE(SH_TYPE * tb, SH_ITERATOR * iter)
 
 		if ((iter->cur & tb->sizemask) == (iter->end & tb->sizemask))
 			iter->done = true;
-		if (elem->status == SH_STATUS_IN_USE)
+		if (!SH_ENTRY_IS_EMPTY(tb, elem))
 		{
 			return elem;
 		}
@@ -1091,7 +1110,7 @@ SH_STAT(SH_TYPE * tb)
 
 		elem = &tb->data[i];
 
-		if (elem->status != SH_STATUS_IN_USE)
+		if (SH_ENTRY_IS_EMPTY(tb, elem))
 			continue;
 
 		hash = SH_ENTRY_HASH(tb, elem);
@@ -1149,6 +1168,9 @@ SH_STAT(SH_TYPE * tb)
 #undef SH_KEY
 #undef SH_ELEMENT_TYPE
 #undef SH_HASH_KEY
+#undef SH_ENTRY_IS_EMPTY
+#undef SH_ENTRY_SET_EMPTY
+#undef SH_ENTRY_SET_IN_USE
 #undef SH_SCOPE
 #undef SH_DECLARE
 #undef SH_DEFINE
-- 
2.34.1

From cdd20115968128571134eebe94e00a8b0adb18e0 Mon Sep 17 00:00:00 2001
From: Jeff Davis <j...@j-davis.com>
Date: Fri, 15 Nov 2024 13:03:32 -0800
Subject: [PATCH v2 7/8] TupleHashTable: reduce overhead by removing "status"
 field.

Make use of new simplehash API introduced in commit 1234567890.

Because NULL now means that the entry is empty, use a unique dummy
pointer to refer to the inputslot instead. It also makes the code more
readable.
---
 src/backend/executor/execGrouping.c | 33 +++++++++++++++++++----------
 1 file changed, 22 insertions(+), 11 deletions(-)

diff --git a/src/backend/executor/execGrouping.c b/src/backend/executor/execGrouping.c
index d4a3b3280e..81ccd0e6d0 100644
--- a/src/backend/executor/execGrouping.c
+++ b/src/backend/executor/execGrouping.c
@@ -23,7 +23,6 @@
 typedef struct TupleHashEntryData
 {
 	MinimalTuple firstTuple;	/* copy of first tuple in this group */
-	uint32		status;			/* hash status */
 	uint32		hash;			/* hash value (cached) */
 } TupleHashEntryData;
 
@@ -34,6 +33,14 @@ static inline TupleHashEntry LookupTupleHashEntry_internal(TupleHashTable hashta
 														   TupleTableSlot *slot,
 														   bool *isnew, uint32 hash);
 
+/*
+ * Create a dummy pointer used to mean the input slot. The pointer will only
+ * be used for comparison and never dereferenced, and must not be equal to a
+ * valid pointer. Distinct from NULL, because NULL means that the hash entry
+ * itself is empty.
+ */
+const static MinimalTuple FIRSTTUPLE_INPUTSLOT = (MinimalTuple) 0x1;
+
 /*
  * Define parameters for tuple hash table code generation. The interface is
  * *also* declared in execnodes.h (to generate the types, which are externally
@@ -48,6 +55,9 @@ static inline TupleHashEntry LookupTupleHashEntry_internal(TupleHashTable hashta
 #define SH_SCOPE extern
 #define SH_STORE_HASH
 #define SH_GET_HASH(tb, a) a->hash
+#define SH_ENTRY_IS_EMPTY(tb, a) ((a)->firstTuple == NULL)
+#define SH_ENTRY_SET_EMPTY(tb, a) do { (a)->firstTuple = NULL; } while (0)
+#define SH_ENTRY_SET_IN_USE(tb, a) Assert((a)->firstTuple != NULL)
 #define SH_DEFINE
 #include "lib/simplehash.h"
 
@@ -335,7 +345,7 @@ LookupTupleHashEntry(TupleHashTable hashtable, TupleTableSlot *slot,
 	hashtable->in_hash_funcs = hashtable->tab_hash_funcs;
 	hashtable->cur_eq_func = hashtable->tab_eq_func;
 
-	local_hash = TupleHashTableHash_internal(hashtable->hashtab, NULL);
+	local_hash = TupleHashTableHash_internal(hashtable->hashtab, FIRSTTUPLE_INPUTSLOT);
 	entry = LookupTupleHashEntry_internal(hashtable, slot, isnew, local_hash);
 
 	if (hash != NULL)
@@ -363,7 +373,8 @@ TupleHashTableHash(TupleHashTable hashtable, TupleTableSlot *slot)
 	/* Need to run the hash functions in short-lived context */
 	oldContext = MemoryContextSwitchTo(hashtable->tempcxt);
 
-	hash = TupleHashTableHash_internal(hashtable->hashtab, NULL);
+	hash = TupleHashTableHash_internal(hashtable->hashtab,
+									   FIRSTTUPLE_INPUTSLOT);
 
 	MemoryContextSwitchTo(oldContext);
 
@@ -444,7 +455,7 @@ FindTupleHashEntry(TupleHashTable hashtable, TupleTableSlot *slot,
 	hashtable->cur_eq_func = eqcomp;
 
 	/* Search the hash table */
-	key = NULL;					/* flag to reference inputslot */
+	key = FIRSTTUPLE_INPUTSLOT;
 	entry = tuplehash_lookup(hashtable->hashtab, key);
 	MemoryContextSwitchTo(oldContext);
 
@@ -452,9 +463,9 @@ FindTupleHashEntry(TupleHashTable hashtable, TupleTableSlot *slot,
 }
 
 /*
- * If tuple is NULL, use the input slot instead. This convention avoids the
- * need to materialize virtual input tuples unless they actually need to get
- * copied into the table.
+ * If tuple is FIRSTTUPLE_INPUTSLOT, use the input slot instead. This
+ * convention avoids the need to materialize virtual input tuples unless they
+ * actually need to get copied into the table.
  *
  * Also, the caller must select an appropriate memory context for running
  * the hash functions. (dynahash.c doesn't change CurrentMemoryContext.)
@@ -471,7 +482,7 @@ TupleHashTableHash_internal(struct tuplehash_hash *tb,
 	FmgrInfo   *hashfunctions;
 	int			i;
 
-	if (tuple == NULL)
+	if (tuple == FIRSTTUPLE_INPUTSLOT)
 	{
 		/* Process the current input tuple for the table */
 		slot = hashtable->inputslot;
@@ -537,7 +548,7 @@ LookupTupleHashEntry_internal(TupleHashTable hashtable, TupleTableSlot *slot,
 	bool		found;
 	MinimalTuple key;
 
-	key = NULL;					/* flag to reference inputslot */
+	key = FIRSTTUPLE_INPUTSLOT;
 
 	if (isnew)
 	{
@@ -602,10 +613,10 @@ TupleHashTableMatch(struct tuplehash_hash *tb, const MinimalTuple tuple1, const
 	 * LookupTupleHashEntry's dummy TupleHashEntryData.  The other direction
 	 * could be supported too, but is not currently required.
 	 */
-	Assert(tuple1 != NULL);
+	Assert(tuple1 && tuple1 != FIRSTTUPLE_INPUTSLOT);
 	slot1 = hashtable->tableslot;
 	ExecStoreMinimalTuple(tuple1, slot1, false);
-	Assert(tuple2 == NULL);
+	Assert(tuple2 == FIRSTTUPLE_INPUTSLOT);
 	slot2 = hashtable->inputslot;
 
 	/* For crosstype comparisons, the inputslot must be first */
-- 
2.34.1

From 6a1ef8eb846d76c77958050e8aafc50d528064b1 Mon Sep 17 00:00:00 2001
From: Jeff Davis <j...@j-davis.com>
Date: Fri, 15 Nov 2024 18:52:06 -0800
Subject: [PATCH v2 8/8] Pack TupleHashEntryData and AggStatePerGroupData.

---
 src/backend/executor/execGrouping.c | 6 +++++-
 src/include/executor/nodeAgg.h      | 6 +++++-
 2 files changed, 10 insertions(+), 2 deletions(-)

diff --git a/src/backend/executor/execGrouping.c b/src/backend/executor/execGrouping.c
index 81ccd0e6d0..d28d420fd8 100644
--- a/src/backend/executor/execGrouping.c
+++ b/src/backend/executor/execGrouping.c
@@ -24,7 +24,11 @@ typedef struct TupleHashEntryData
 {
 	MinimalTuple firstTuple;	/* copy of first tuple in this group */
 	uint32		hash;			/* hash value (cached) */
-} TupleHashEntryData;
+}
+#ifdef pg_attribute_packed
+	pg_attribute_packed()
+#endif
+TupleHashEntryData;
 
 static int	TupleHashTableMatch(struct tuplehash_hash *tb, const MinimalTuple tuple1, const MinimalTuple tuple2);
 static inline uint32 TupleHashTableHash_internal(struct tuplehash_hash *tb,
diff --git a/src/include/executor/nodeAgg.h b/src/include/executor/nodeAgg.h
index db633fd672..c11b3a31fb 100644
--- a/src/include/executor/nodeAgg.h
+++ b/src/include/executor/nodeAgg.h
@@ -264,7 +264,11 @@ typedef struct AggStatePerGroupData
 	 * NULL and not auto-replace it with a later input value. Only the first
 	 * non-NULL input will be auto-substituted.
 	 */
-} AggStatePerGroupData;
+}
+#ifdef pg_attribute_packed
+	pg_attribute_packed()
+#endif
+AggStatePerGroupData;
 
 /*
  * AggStatePerPhaseData - per-grouping-set-phase state
-- 
2.34.1

Re: Reduce TupleHashEntryData struct size by half

Reply via email to