Re: Reduce TupleHashEntryData struct size by half

Jeff Davis Tue, 04 Mar 2025 17:36:54 -0800

On Fri, 2025-02-28 at 17:09 +1300, David Rowley wrote:
> * my_log2() takes a "long" parameter type but transitionSpace is a
> "Size". These types aren't the same width on all platforms we
> support.
> Maybe pg_nextpower2_size_t() is a better option.


Done.

> * Should the following have MAXALIGN(tupleSize) on the right side of
> the expression?

Done.

> Maybe you could just replace the while loop and the subsequent "if"
> check with:

Done.

> * In hash_create_memory(), can you get rid of the minContextSize and
> initBlockSize variables?

Done.

> * Is it worth an Assert() theck additionalsize > 0?

One caller happens to call it unconditionally. It seems better to
return NULL if additionalsize == 0.

> create table c (a int not null);
> insert into c select a from generate_Series(1,1000) a;
> vacuum freeze analyze c;

The schema you posted is the narrower table, and your numbers better
match the wide table you posted before. Was there a mixup?

> master: 3653.9853988
> v7-0001: 3741.815979
> v7-0001+0002: 3737.4313064
> v7-0001+0002+0003: 3834.6271706
> v7-0001+0002+0004+0004: 3912.1158887
> 
> I also tried out changing hash_agg_check_limits() so that it no
> longer
> calls MemoryContextMemAllocated and instead uses ->mem_allocated
> directly and with all the other patches got:
> 
> v7-0001+0002+0004+0004+extra: 4049.0732381

Great!

> We probably shouldn't do exactly that as it be better not to access
> that internal field from outside the memory context code.  A static
> inline function in memutils.h that handles the non-recursive callers
> might be nice.

Both the metacxt as well as the context used for byref transition
values can have child contexts, so we should keep the recursion. I just
inlined MemoryContextMemAllocated() and MemoryContextTraverseNext(). 

> I've attached my results of running your test in graph form.

Thank you!

My results (with wide tables):

                        GROUP BY      EXCEPT
   master:                  2151        1732
   entire v8 series:        2054        1740

In some of the patches in the middle of the series, I ran into some
hard-to-explain regressions, so consider my results preliminary. I may
need to profile and figure out what's going on. I didn't see any
overall regression.

But the series overall seems about even, while the memory consumption
is ~35% lower for the example I posted in the first message in the
thread.


>   How about hacking something up that
> adds an additionalsize field to TupleDesc and then set that field to
> your additional size and have heap_form_minimal_tuple() allocate that
> much extra memory?

I assume we wouldn't want to actually add a field to TupleDescData,
right?

When I reworked the ExecCopySlotMinimalTupleExtra() API to place the
extra memory before the tuple, it worked out to be a bit cleaner with
fewer special cases, so I'm fine with that API now.

Regards,
        Jeff Davis

From 4eace20edcf4108e36deb4369f73327147b436fc Mon Sep 17 00:00:00 2001
From: Jeff Davis <j...@j-davis.com>
Date: Wed, 8 Jan 2025 11:46:35 -0800
Subject: [PATCH v8 1/7] HashAgg: use Bump allocator for hash TupleHashTable
 entries.

The entries aren't freed until the entire hash table is destroyed, so
use the Bump allocator to improve allocation speed, avoid wasting
space on the chunk header, and avoid wasting space due to the
power-of-two allocations.

Discussion: https://postgr.es/m/caaphdvqv1anb4cm36fzrwivxrevbo_lsg_eq3nqdxtjecaa...@mail.gmail.com
Reviewed-by: David Rowley
---
 src/backend/executor/execUtils.c |  17 +++--
 src/backend/executor/nodeAgg.c   | 111 ++++++++++++++++++++++++++-----
 src/include/nodes/execnodes.h    |   5 +-
 3 files changed, 104 insertions(+), 29 deletions(-)

diff --git a/src/backend/executor/execUtils.c b/src/backend/executor/execUtils.c
index 39d6f4d819e..2e0062a8f27 100644
--- a/src/backend/executor/execUtils.c
+++ b/src/backend/executor/execUtils.c
@@ -322,19 +322,18 @@ CreateExprContext(EState *estate)
 ExprContext *
 CreateWorkExprContext(EState *estate)
 {
-	Size		minContextSize = ALLOCSET_DEFAULT_MINSIZE;
-	Size		initBlockSize = ALLOCSET_DEFAULT_INITSIZE;
 	Size		maxBlockSize = ALLOCSET_DEFAULT_MAXSIZE;
 
-	/* choose the maxBlockSize to be no larger than 1/16 of work_mem */
-	while (maxBlockSize > work_mem * (Size) 1024 / 16)
-		maxBlockSize >>= 1;
+	maxBlockSize = pg_prevpower2_size_t(work_mem * (Size) 1024 / 16);
 
-	if (maxBlockSize < ALLOCSET_DEFAULT_INITSIZE)
-		maxBlockSize = ALLOCSET_DEFAULT_INITSIZE;
+	/* But no bigger than ALLOCSET_DEFAULT_MAXSIZE */
+	maxBlockSize = Min(maxBlockSize, ALLOCSET_DEFAULT_MAXSIZE);
 
-	return CreateExprContextInternal(estate, minContextSize,
-									 initBlockSize, maxBlockSize);
+	/* and no smaller than ALLOCSET_DEFAULT_INITSIZE */
+	maxBlockSize = Max(maxBlockSize, ALLOCSET_DEFAULT_INITSIZE);
+
+	return CreateExprContextInternal(estate, ALLOCSET_DEFAULT_MINSIZE,
+									 ALLOCSET_DEFAULT_INITSIZE, maxBlockSize);
 }
 
 /* ----------------
diff --git a/src/backend/executor/nodeAgg.c b/src/backend/executor/nodeAgg.c
index ceb8c8a8039..785fc5c00f3 100644
--- a/src/backend/executor/nodeAgg.c
+++ b/src/backend/executor/nodeAgg.c
@@ -406,6 +406,7 @@ static void build_hash_tables(AggState *aggstate);
 static void build_hash_table(AggState *aggstate, int setno, long nbuckets);
 static void hashagg_recompile_expressions(AggState *aggstate, bool minslot,
 										  bool nullcheck);
+static void hash_create_memory(AggState *aggstate);
 static long hash_choose_num_buckets(double hashentrysize,
 									long ngroups, Size memory);
 static int	hash_choose_num_partitions(double input_groups,
@@ -1512,7 +1513,7 @@ build_hash_table(AggState *aggstate, int setno, long nbuckets)
 {
 	AggStatePerHash perhash = &aggstate->perhash[setno];
 	MemoryContext metacxt = aggstate->hash_metacxt;
-	MemoryContext hashcxt = aggstate->hashcontext->ecxt_per_tuple_memory;
+	MemoryContext tablecxt = aggstate->hash_tablecxt;
 	MemoryContext tmpcxt = aggstate->tmpcontext->ecxt_per_tuple_memory;
 	Size		additionalsize;
 
@@ -1538,7 +1539,7 @@ build_hash_table(AggState *aggstate, int setno, long nbuckets)
 											 nbuckets,
 											 additionalsize,
 											 metacxt,
-											 hashcxt,
+											 tablecxt,
 											 tmpcxt,
 											 DO_AGGSPLIT_SKIPFINAL(aggstate->aggsplit));
 }
@@ -1709,15 +1710,19 @@ hash_agg_entry_size(int numTrans, Size tupleWidth, Size transitionSpace)
 							 tupleWidth);
 	Size		pergroupSize = numTrans * sizeof(AggStatePerGroupData);
 
-	tupleChunkSize = CHUNKHDRSZ + tupleSize;
-
-	if (pergroupSize > 0)
-		pergroupChunkSize = CHUNKHDRSZ + pergroupSize;
-	else
-		pergroupChunkSize = 0;
+	/*
+	 * Entries use the Bump allocator, so the chunk sizes are the same as the
+	 * requested sizes.
+	 */
+	tupleChunkSize = MAXALIGN(tupleSize);
+	pergroupChunkSize = pergroupSize;
 
+	/*
+	 * Transition values use AllocSet, which has a chunk header and also uses
+	 * power-of-two allocations.
+	 */
 	if (transitionSpace > 0)
-		transitionChunkSize = CHUNKHDRSZ + transitionSpace;
+		transitionChunkSize = pg_nextpower2_size_t(transitionSpace);
 	else
 		transitionChunkSize = 0;
 
@@ -1867,8 +1872,11 @@ hash_agg_check_limits(AggState *aggstate)
 	uint64		ngroups = aggstate->hash_ngroups_current;
 	Size		meta_mem = MemoryContextMemAllocated(aggstate->hash_metacxt,
 													 true);
-	Size		hashkey_mem = MemoryContextMemAllocated(aggstate->hashcontext->ecxt_per_tuple_memory,
-														true);
+	Size		entry_mem = MemoryContextMemAllocated(aggstate->hash_tablecxt,
+													  true);
+	Size		tval_mem = MemoryContextMemAllocated(aggstate->hashcontext->ecxt_per_tuple_memory,
+													 true);
+	Size		total_mem = meta_mem + entry_mem + tval_mem;
 	bool		do_spill = false;
 
 #ifdef USE_INJECTION_POINTS
@@ -1887,7 +1895,7 @@ hash_agg_check_limits(AggState *aggstate)
 	 * can be sure to make progress even in edge cases.
 	 */
 	if (aggstate->hash_ngroups_current > 0 &&
-		(meta_mem + hashkey_mem > aggstate->hash_mem_limit ||
+		(total_mem > aggstate->hash_mem_limit ||
 		 ngroups > aggstate->hash_ngroups_limit))
 	{
 		do_spill = true;
@@ -1942,6 +1950,7 @@ static void
 hash_agg_update_metrics(AggState *aggstate, bool from_tape, int npartitions)
 {
 	Size		meta_mem;
+	Size		entry_mem;
 	Size		hashkey_mem;
 	Size		buffer_mem;
 	Size		total_mem;
@@ -1953,7 +1962,10 @@ hash_agg_update_metrics(AggState *aggstate, bool from_tape, int npartitions)
 	/* memory for the hash table itself */
 	meta_mem = MemoryContextMemAllocated(aggstate->hash_metacxt, true);
 
-	/* memory for the group keys and transition states */
+	/* memory for hash entries */
+	entry_mem = MemoryContextMemAllocated(aggstate->hash_tablecxt, true);
+
+	/* memory for byref transition states */
 	hashkey_mem = MemoryContextMemAllocated(aggstate->hashcontext->ecxt_per_tuple_memory, true);
 
 	/* memory for read/write tape buffers, if spilled */
@@ -1962,7 +1974,7 @@ hash_agg_update_metrics(AggState *aggstate, bool from_tape, int npartitions)
 		buffer_mem += HASHAGG_READ_BUFFER_SIZE;
 
 	/* update peak mem */
-	total_mem = meta_mem + hashkey_mem + buffer_mem;
+	total_mem = meta_mem + entry_mem + hashkey_mem + buffer_mem;
 	if (total_mem > aggstate->hash_mem_peak)
 		aggstate->hash_mem_peak = total_mem;
 
@@ -1984,6 +1996,64 @@ hash_agg_update_metrics(AggState *aggstate, bool from_tape, int npartitions)
 	}
 }
 
+/*
+ * Create memory contexts used for hash aggregation.
+ */
+static void
+hash_create_memory(AggState *aggstate)
+{
+	Size		maxBlockSize = ALLOCSET_DEFAULT_MAXSIZE;
+
+	/*
+	 * The hashcontext's per-tuple memory will be used for byref transition
+	 * values and returned by AggCheckCallContext().
+	 */
+	aggstate->hashcontext = CreateWorkExprContext(aggstate->ss.ps.state);
+
+	/*
+	 * The meta context will be used for the bucket array of
+	 * TupleHashEntryData (or arrays, in the case of grouping sets). As the
+	 * hash table grows, the bucket array will double in size and the old one
+	 * will be freed, so an AllocSet is appropriate. For large bucket arrays,
+	 * the large allocation path will be used, so it's not worth worrying
+	 * about wasting space due to power-of-two allocations.
+	 */
+	aggstate->hash_metacxt = AllocSetContextCreate(aggstate->ss.ps.state->es_query_cxt,
+												   "HashAgg meta context",
+												   ALLOCSET_DEFAULT_SIZES);
+
+	/*
+	 * The hash entries themselves, which include the grouping key
+	 * (firstTuple) and pergroup data, are stored in the table context. The
+	 * bump allocator can be used because the entries are not freed until the
+	 * entire hash table is reset. The bump allocator is faster for
+	 * allocations and avoids wasting space on the chunk header or
+	 * power-of-two allocations.
+	 *
+	 * Like CreateWorkExprContext(), use smaller sizings for smaller work_mem,
+	 * to avoid large jumps in memory usage.
+	 */
+
+	/*
+	 * Like CreateWorkExprContext(), use smaller sizings for smaller work_mem,
+	 * to avoid large jumps in memory usage.
+	 */
+	maxBlockSize = pg_prevpower2_size_t(work_mem * (Size) 1024 / 16);
+
+	/* But no bigger than ALLOCSET_DEFAULT_MAXSIZE */
+	maxBlockSize = Min(maxBlockSize, ALLOCSET_DEFAULT_MAXSIZE);
+
+	/* and no smaller than ALLOCSET_DEFAULT_INITSIZE */
+	maxBlockSize = Max(maxBlockSize, ALLOCSET_DEFAULT_INITSIZE);
+
+	aggstate->hash_tablecxt = BumpContextCreate(aggstate->ss.ps.state->es_query_cxt,
+												"HashAgg table context",
+												ALLOCSET_DEFAULT_MINSIZE,
+												ALLOCSET_DEFAULT_INITSIZE,
+												maxBlockSize);
+
+}
+
 /*
  * Choose a reasonable number of buckets for the initial hash table size.
  */
@@ -2645,6 +2715,7 @@ agg_refill_hash_table(AggState *aggstate)
 
 	/* free memory and reset hash tables */
 	ReScanExprContext(aggstate->hashcontext);
+	MemoryContextReset(aggstate->hash_tablecxt);
 	for (int setno = 0; setno < aggstate->num_hashes; setno++)
 		ResetTupleHashTable(aggstate->perhash[setno].hashtable);
 
@@ -3329,7 +3400,7 @@ ExecInitAgg(Agg *node, EState *estate, int eflags)
 	}
 
 	if (use_hashing)
-		aggstate->hashcontext = CreateWorkExprContext(estate);
+		hash_create_memory(aggstate);
 
 	ExecAssignExprContext(estate, &aggstate->ss.ps);
 
@@ -3624,9 +3695,6 @@ ExecInitAgg(Agg *node, EState *estate, int eflags)
 		Plan	   *outerplan = outerPlan(node);
 		uint64		totalGroups = 0;
 
-		aggstate->hash_metacxt = AllocSetContextCreate(aggstate->ss.ps.state->es_query_cxt,
-													   "HashAgg meta context",
-													   ALLOCSET_DEFAULT_SIZES);
 		aggstate->hash_spill_rslot = ExecInitExtraTupleSlot(estate, scanDesc,
 															&TTSOpsMinimalTuple);
 		aggstate->hash_spill_wslot = ExecInitExtraTupleSlot(estate, scanDesc,
@@ -4371,6 +4439,12 @@ ExecEndAgg(AggState *node)
 		MemoryContextDelete(node->hash_metacxt);
 		node->hash_metacxt = NULL;
 	}
+	if (node->hash_tablecxt != NULL)
+	{
+		MemoryContextDelete(node->hash_tablecxt);
+		node->hash_tablecxt = NULL;
+	}
+
 
 	for (transno = 0; transno < node->numtrans; transno++)
 	{
@@ -4487,6 +4561,7 @@ ExecReScanAgg(AggState *node)
 		node->hash_ngroups_current = 0;
 
 		ReScanExprContext(node->hashcontext);
+		MemoryContextReset(node->hash_tablecxt);
 		/* Rebuild an empty hash table */
 		build_hash_tables(node);
 		node->table_filled = false;
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index a323fa98bbb..d9ad6ed63a6 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -2569,7 +2569,8 @@ typedef struct AggState
 	/* these fields are used in AGG_HASHED and AGG_MIXED modes: */
 	bool		table_filled;	/* hash table filled yet? */
 	int			num_hashes;
-	MemoryContext hash_metacxt; /* memory for hash table itself */
+	MemoryContext hash_metacxt; /* memory for hash table bucket array */
+	MemoryContext hash_tablecxt;	/* memory for hash table entries */
 	struct LogicalTapeSet *hash_tapeset;	/* tape set for hash spill tapes */
 	struct HashAggSpill *hash_spills;	/* HashAggSpill for each grouping set,
 										 * exists only during first pass */
@@ -2595,7 +2596,7 @@ typedef struct AggState
 										 * per-group pointers */
 
 	/* support for evaluation of agg input expressions: */
-#define FIELDNO_AGGSTATE_ALL_PERGROUPS 53
+#define FIELDNO_AGGSTATE_ALL_PERGROUPS 54
 	AggStatePerGroup *all_pergroups;	/* array of first ->pergroups, than
 										 * ->hash_pergroup */
 	SharedAggInfo *shared_info; /* one entry per worker */
-- 
2.34.1

From 48686d14db7a899aaa21ca303f7de35a8e8bc94c Mon Sep 17 00:00:00 2001
From: Jeff Davis <j...@j-davis.com>
Date: Mon, 13 Jan 2025 14:37:02 -0800
Subject: [PATCH v8 2/7] Create accessor functions for TupleHashEntry.

---
 src/backend/executor/execGrouping.c | 54 ++++++++++++++++++++++++++---
 src/backend/executor/nodeAgg.c      | 32 ++++++++---------
 src/backend/executor/nodeSetOp.c    | 33 ++++++++++++------
 src/backend/executor/nodeSubplan.c  |  2 +-
 src/include/executor/executor.h     |  4 +++
 src/include/nodes/execnodes.h       | 11 ++----
 6 files changed, 95 insertions(+), 41 deletions(-)

diff --git a/src/backend/executor/execGrouping.c b/src/backend/executor/execGrouping.c
index 33b124fbb0a..4a90db94f8d 100644
--- a/src/backend/executor/execGrouping.c
+++ b/src/backend/executor/execGrouping.c
@@ -20,6 +20,14 @@
 #include "miscadmin.h"
 #include "utils/lsyscache.h"
 
+struct TupleHashEntryData
+{
+	MinimalTuple firstTuple;	/* copy of first tuple in this group */
+	void	   *additional;		/* user data */
+	uint32		status;			/* hash status */
+	uint32		hash;			/* hash value (cached) */
+};
+
 static int	TupleHashTableMatch(struct tuplehash_hash *tb, const MinimalTuple tuple1, const MinimalTuple tuple2);
 static inline uint32 TupleHashTableHash_internal(struct tuplehash_hash *tb,
 												 const MinimalTuple tuple);
@@ -174,13 +182,15 @@ BuildTupleHashTable(PlanState *parent,
 					bool use_variable_hash_iv)
 {
 	TupleHashTable hashtable;
-	Size		entrysize = sizeof(TupleHashEntryData) + additionalsize;
+	Size		entrysize;
 	Size		hash_mem_limit;
 	MemoryContext oldcontext;
 	bool		allow_jit;
 	uint32		hash_iv = 0;
 
 	Assert(nbuckets > 0);
+	additionalsize = MAXALIGN(additionalsize);
+	entrysize = sizeof(TupleHashEntryData) + additionalsize;
 
 	/* Limit initial table size request to not more than hash_mem */
 	hash_mem_limit = get_hash_memory_limit() / entrysize;
@@ -196,6 +206,7 @@ BuildTupleHashTable(PlanState *parent,
 	hashtable->tab_collations = collations;
 	hashtable->tablecxt = tablecxt;
 	hashtable->tempcxt = tempcxt;
+	hashtable->additionalsize = additionalsize;
 	hashtable->tableslot = NULL;	/* will be made on first lookup */
 	hashtable->inputslot = NULL;
 	hashtable->in_hash_expr = NULL;
@@ -273,6 +284,38 @@ ResetTupleHashTable(TupleHashTable hashtable)
 	tuplehash_reset(hashtable->hashtab);
 }
 
+/*
+ * Return size of the hash bucket. Useful for estimating memory usage.
+ */
+size_t
+TupleHashEntrySize(void)
+{
+	return sizeof(TupleHashEntryData);
+}
+
+/*
+ * Return tuple from hash entry.
+ */
+MinimalTuple
+TupleHashEntryGetTuple(TupleHashEntry entry)
+{
+	return entry->firstTuple;
+}
+
+/*
+ * Get a pointer into the additional space allocated for this entry. The
+ * memory will be maxaligned and zeroed.
+ *
+ * The amount of space available is the additionalsize requested in the call
+ * to BuildTupleHashTable(). If additionalsize was specified as zero, return
+ * NULL.
+ */
+void *
+TupleHashEntryGetAdditional(TupleHashTable hashtable, TupleHashEntry entry)
+{
+	return entry->additional;
+}
+
 /*
  * Find or create a hashtable entry for the tuple group containing the
  * given tuple.  The tuple must be the same type as the hashtable entries.
@@ -479,11 +522,14 @@ LookupTupleHashEntry_internal(TupleHashTable hashtable, TupleTableSlot *slot,
 		{
 			/* created new entry */
 			*isnew = true;
-			/* zero caller data */
-			entry->additional = NULL;
+
 			MemoryContextSwitchTo(hashtable->tablecxt);
-			/* Copy the first tuple into the table context */
+
 			entry->firstTuple = ExecCopySlotMinimalTuple(slot);
+			if (hashtable->additionalsize > 0)
+				entry->additional = palloc0(hashtable->additionalsize);
+			else
+				entry->additional = NULL;
 		}
 	}
 	else
diff --git a/src/backend/executor/nodeAgg.c b/src/backend/executor/nodeAgg.c
index 785fc5c00f3..62c7d585c8a 100644
--- a/src/backend/executor/nodeAgg.c
+++ b/src/backend/executor/nodeAgg.c
@@ -1494,7 +1494,7 @@ build_hash_tables(AggState *aggstate)
 #ifdef USE_INJECTION_POINTS
 		if (IS_INJECTION_POINT_ATTACHED("hash-aggregate-oversize-table"))
 		{
-			nbuckets = memory / sizeof(TupleHashEntryData);
+			nbuckets = memory / TupleHashEntrySize();
 			INJECTION_POINT_CACHED("hash-aggregate-oversize-table");
 		}
 #endif
@@ -1727,7 +1727,7 @@ hash_agg_entry_size(int numTrans, Size tupleWidth, Size transitionSpace)
 		transitionChunkSize = 0;
 
 	return
-		sizeof(TupleHashEntryData) +
+		TupleHashEntrySize() +
 		tupleChunkSize +
 		pergroupChunkSize +
 		transitionChunkSize;
@@ -1991,7 +1991,7 @@ hash_agg_update_metrics(AggState *aggstate, bool from_tape, int npartitions)
 	if (aggstate->hash_ngroups_current > 0)
 	{
 		aggstate->hashentrysize =
-			sizeof(TupleHashEntryData) +
+			TupleHashEntrySize() +
 			(hashkey_mem / (double) aggstate->hash_ngroups_current);
 	}
 }
@@ -2150,11 +2150,7 @@ initialize_hash_entry(AggState *aggstate, TupleHashTable hashtable,
 	if (aggstate->numtrans == 0)
 		return;
 
-	pergroup = (AggStatePerGroup)
-		MemoryContextAlloc(hashtable->tablecxt,
-						   sizeof(AggStatePerGroupData) * aggstate->numtrans);
-
-	entry->additional = pergroup;
+	pergroup = (AggStatePerGroup) TupleHashEntryGetAdditional(hashtable, entry);
 
 	/*
 	 * Initialize aggregates for new tuple group, lookup_hash_entries()
@@ -2216,7 +2212,7 @@ lookup_hash_entries(AggState *aggstate)
 		{
 			if (isnew)
 				initialize_hash_entry(aggstate, hashtable, entry);
-			pergroup[setno] = entry->additional;
+			pergroup[setno] = TupleHashEntryGetAdditional(hashtable, entry);
 		}
 		else
 		{
@@ -2751,6 +2747,7 @@ agg_refill_hash_table(AggState *aggstate)
 	{
 		TupleTableSlot *spillslot = aggstate->hash_spill_rslot;
 		TupleTableSlot *hashslot = perhash->hashslot;
+		TupleHashTable hashtable = perhash->hashtable;
 		TupleHashEntry entry;
 		MinimalTuple tuple;
 		uint32		hash;
@@ -2769,14 +2766,14 @@ agg_refill_hash_table(AggState *aggstate)
 		prepare_hash_slot(perhash,
 						  aggstate->tmpcontext->ecxt_outertuple,
 						  hashslot);
-		entry = LookupTupleHashEntryHash(perhash->hashtable, hashslot,
+		entry = LookupTupleHashEntryHash(hashtable, hashslot,
 										 p_isnew, hash);
 
 		if (entry != NULL)
 		{
 			if (isnew)
-				initialize_hash_entry(aggstate, perhash->hashtable, entry);
-			aggstate->hash_pergroup[batch->setno] = entry->additional;
+				initialize_hash_entry(aggstate, hashtable, entry);
+			aggstate->hash_pergroup[batch->setno] = TupleHashEntryGetAdditional(hashtable, entry);
 			advance_aggregates(aggstate);
 		}
 		else
@@ -2868,7 +2865,7 @@ agg_retrieve_hash_table_in_memory(AggState *aggstate)
 	ExprContext *econtext;
 	AggStatePerAgg peragg;
 	AggStatePerGroup pergroup;
-	TupleHashEntryData *entry;
+	TupleHashEntry entry;
 	TupleTableSlot *firstSlot;
 	TupleTableSlot *result;
 	AggStatePerHash perhash;
@@ -2895,6 +2892,7 @@ agg_retrieve_hash_table_in_memory(AggState *aggstate)
 	for (;;)
 	{
 		TupleTableSlot *hashslot = perhash->hashslot;
+		TupleHashTable hashtable = perhash->hashtable;
 		int			i;
 
 		CHECK_FOR_INTERRUPTS();
@@ -2902,7 +2900,7 @@ agg_retrieve_hash_table_in_memory(AggState *aggstate)
 		/*
 		 * Find the next entry in the hash table
 		 */
-		entry = ScanTupleHashTable(perhash->hashtable, &perhash->hashiter);
+		entry = ScanTupleHashTable(hashtable, &perhash->hashiter);
 		if (entry == NULL)
 		{
 			int			nextset = aggstate->current_set + 1;
@@ -2917,7 +2915,7 @@ agg_retrieve_hash_table_in_memory(AggState *aggstate)
 
 				perhash = &aggstate->perhash[aggstate->current_set];
 
-				ResetTupleHashIterator(perhash->hashtable, &perhash->hashiter);
+				ResetTupleHashIterator(hashtable, &perhash->hashiter);
 
 				continue;
 			}
@@ -2940,7 +2938,7 @@ agg_retrieve_hash_table_in_memory(AggState *aggstate)
 		 * Transform representative tuple back into one with the right
 		 * columns.
 		 */
-		ExecStoreMinimalTuple(entry->firstTuple, hashslot, false);
+		ExecStoreMinimalTuple(TupleHashEntryGetTuple(entry), hashslot, false);
 		slot_getallattrs(hashslot);
 
 		ExecClearTuple(firstSlot);
@@ -2956,7 +2954,7 @@ agg_retrieve_hash_table_in_memory(AggState *aggstate)
 		}
 		ExecStoreVirtualTuple(firstSlot);
 
-		pergroup = (AggStatePerGroup) entry->additional;
+		pergroup = (AggStatePerGroup) TupleHashEntryGetAdditional(hashtable, entry);
 
 		/*
 		 * Use the representative input tuple for any references to
diff --git a/src/backend/executor/nodeSetOp.c b/src/backend/executor/nodeSetOp.c
index 5b7ff9c3748..4068481a523 100644
--- a/src/backend/executor/nodeSetOp.c
+++ b/src/backend/executor/nodeSetOp.c
@@ -424,7 +424,9 @@ setop_fill_hash_table(SetOpState *setopstate)
 	for (;;)
 	{
 		TupleTableSlot *outerslot;
+		TupleHashTable hashtable = setopstate->hashtable;
 		TupleHashEntryData *entry;
+		SetOpStatePerGroup pergroup;
 		bool		isnew;
 
 		outerslot = ExecProcNode(outerPlan);
@@ -433,20 +435,20 @@ setop_fill_hash_table(SetOpState *setopstate)
 		have_tuples = true;
 
 		/* Find or build hashtable entry for this tuple's group */
-		entry = LookupTupleHashEntry(setopstate->hashtable,
+		entry = LookupTupleHashEntry(hashtable,
 									 outerslot,
 									 &isnew, NULL);
 
+		pergroup = TupleHashEntryGetAdditional(hashtable, entry);
 		/* If new tuple group, initialize counts to zero */
 		if (isnew)
 		{
-			entry->additional = (SetOpStatePerGroup)
-				MemoryContextAllocZero(setopstate->hashtable->tablecxt,
-									   sizeof(SetOpStatePerGroupData));
+			pergroup->numLeft = 0;
+			pergroup->numRight = 0;
 		}
 
 		/* Advance the counts */
-		((SetOpStatePerGroup) entry->additional)->numLeft++;
+		pergroup->numLeft++;
 
 		/* Must reset expression context after each hashtable lookup */
 		ResetExprContext(econtext);
@@ -465,6 +467,7 @@ setop_fill_hash_table(SetOpState *setopstate)
 		for (;;)
 		{
 			TupleTableSlot *innerslot;
+			TupleHashTable hashtable = setopstate->hashtable;
 			TupleHashEntryData *entry;
 
 			innerslot = ExecProcNode(innerPlan);
@@ -472,13 +475,17 @@ setop_fill_hash_table(SetOpState *setopstate)
 				break;
 
 			/* For tuples not seen previously, do not make hashtable entry */
-			entry = LookupTupleHashEntry(setopstate->hashtable,
+			entry = LookupTupleHashEntry(hashtable,
 										 innerslot,
 										 NULL, NULL);
 
 			/* Advance the counts if entry is already present */
 			if (entry)
-				((SetOpStatePerGroup) entry->additional)->numRight++;
+			{
+				SetOpStatePerGroup pergroup = TupleHashEntryGetAdditional(hashtable, entry);
+
+				pergroup->numRight++;
+			}
 
 			/* Must reset expression context after each hashtable lookup */
 			ResetExprContext(econtext);
@@ -496,7 +503,7 @@ setop_fill_hash_table(SetOpState *setopstate)
 static TupleTableSlot *
 setop_retrieve_hash_table(SetOpState *setopstate)
 {
-	TupleHashEntryData *entry;
+	TupleHashEntry entry;
 	TupleTableSlot *resultTupleSlot;
 
 	/*
@@ -509,12 +516,15 @@ setop_retrieve_hash_table(SetOpState *setopstate)
 	 */
 	while (!setopstate->setop_done)
 	{
+		TupleHashTable hashtable = setopstate->hashtable;
+		SetOpStatePerGroup pergroup;
+
 		CHECK_FOR_INTERRUPTS();
 
 		/*
 		 * Find the next entry in the hash table
 		 */
-		entry = ScanTupleHashTable(setopstate->hashtable, &setopstate->hashiter);
+		entry = ScanTupleHashTable(hashtable, &setopstate->hashiter);
 		if (entry == NULL)
 		{
 			/* No more entries in hashtable, so done */
@@ -526,12 +536,13 @@ setop_retrieve_hash_table(SetOpState *setopstate)
 		 * See if we should emit any copies of this tuple, and if so return
 		 * the first copy.
 		 */
-		set_output_count(setopstate, (SetOpStatePerGroup) entry->additional);
+		pergroup = TupleHashEntryGetAdditional(hashtable, entry);
+		set_output_count(setopstate, pergroup);
 
 		if (setopstate->numOutput > 0)
 		{
 			setopstate->numOutput--;
-			return ExecStoreMinimalTuple(entry->firstTuple,
+			return ExecStoreMinimalTuple(TupleHashEntryGetTuple(entry),
 										 resultTupleSlot,
 										 false);
 		}
diff --git a/src/backend/executor/nodeSubplan.c b/src/backend/executor/nodeSubplan.c
index 49767ed6a52..f7f6fc2da0b 100644
--- a/src/backend/executor/nodeSubplan.c
+++ b/src/backend/executor/nodeSubplan.c
@@ -753,7 +753,7 @@ findPartialMatch(TupleHashTable hashtable, TupleTableSlot *slot,
 	{
 		CHECK_FOR_INTERRUPTS();
 
-		ExecStoreMinimalTuple(entry->firstTuple, hashtable->tableslot, false);
+		ExecStoreMinimalTuple(TupleHashEntryGetTuple(entry), hashtable->tableslot, false);
 		if (!execTuplesUnequal(slot, hashtable->tableslot,
 							   numCols, keyColIdx,
 							   eqfunctions,
diff --git a/src/include/executor/executor.h b/src/include/executor/executor.h
index d12e3f451d2..b856afdf563 100644
--- a/src/include/executor/executor.h
+++ b/src/include/executor/executor.h
@@ -149,6 +149,10 @@ extern TupleHashEntry LookupTupleHashEntry(TupleHashTable hashtable,
 										   bool *isnew, uint32 *hash);
 extern uint32 TupleHashTableHash(TupleHashTable hashtable,
 								 TupleTableSlot *slot);
+extern size_t TupleHashEntrySize(void);
+extern MinimalTuple TupleHashEntryGetTuple(TupleHashEntry entry);
+extern void *TupleHashEntryGetAdditional(TupleHashTable hashtable,
+										 TupleHashEntry entry);
 extern TupleHashEntry LookupTupleHashEntryHash(TupleHashTable hashtable,
 											   TupleTableSlot *slot,
 											   bool *isnew, uint32 hash);
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index d9ad6ed63a6..7adfca86dfb 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -834,17 +834,11 @@ typedef struct ExecAuxRowMark
  * point to tab_hash_expr and tab_eq_func respectively.
  * ----------------------------------------------------------------
  */
+struct TupleHashEntryData;
+typedef struct TupleHashEntryData TupleHashEntryData;
 typedef struct TupleHashEntryData *TupleHashEntry;
 typedef struct TupleHashTableData *TupleHashTable;
 
-typedef struct TupleHashEntryData
-{
-	MinimalTuple firstTuple;	/* copy of first tuple in this group */
-	void	   *additional;		/* user data */
-	uint32		status;			/* hash status */
-	uint32		hash;			/* hash value (cached) */
-} TupleHashEntryData;
-
 /* define parameters necessary to generate the tuple hash table interface */
 #define SH_PREFIX tuplehash
 #define SH_ELEMENT_TYPE TupleHashEntryData
@@ -863,6 +857,7 @@ typedef struct TupleHashTableData
 	Oid		   *tab_collations; /* collations for hash and comparison */
 	MemoryContext tablecxt;		/* memory context containing table */
 	MemoryContext tempcxt;		/* context for function evaluations */
+	Size		additionalsize; /* size of additional data */
 	TupleTableSlot *tableslot;	/* slot for referencing table entries */
 	/* The following fields are set transiently for each table search: */
 	TupleTableSlot *inputslot;	/* current input tuple's slot */
-- 
2.34.1

From 03e63a2c0769522fb20a394db7333692def8cc88 Mon Sep 17 00:00:00 2001
From: Jeff Davis <j...@j-davis.com>
Date: Tue, 4 Mar 2025 15:51:19 -0800
Subject: [PATCH v8 3/7] Inline TupleHashEntryGetTuple() and
 ...GetAdditional().

---
 src/backend/executor/execGrouping.c | 40 -----------------------------
 src/include/executor/executor.h     | 38 ++++++++++++++++++++++++---
 src/include/nodes/execnodes.h       | 10 ++++++--
 3 files changed, 42 insertions(+), 46 deletions(-)

diff --git a/src/backend/executor/execGrouping.c b/src/backend/executor/execGrouping.c
index 4a90db94f8d..a9d212aaec6 100644
--- a/src/backend/executor/execGrouping.c
+++ b/src/backend/executor/execGrouping.c
@@ -20,14 +20,6 @@
 #include "miscadmin.h"
 #include "utils/lsyscache.h"
 
-struct TupleHashEntryData
-{
-	MinimalTuple firstTuple;	/* copy of first tuple in this group */
-	void	   *additional;		/* user data */
-	uint32		status;			/* hash status */
-	uint32		hash;			/* hash value (cached) */
-};
-
 static int	TupleHashTableMatch(struct tuplehash_hash *tb, const MinimalTuple tuple1, const MinimalTuple tuple2);
 static inline uint32 TupleHashTableHash_internal(struct tuplehash_hash *tb,
 												 const MinimalTuple tuple);
@@ -284,38 +276,6 @@ ResetTupleHashTable(TupleHashTable hashtable)
 	tuplehash_reset(hashtable->hashtab);
 }
 
-/*
- * Return size of the hash bucket. Useful for estimating memory usage.
- */
-size_t
-TupleHashEntrySize(void)
-{
-	return sizeof(TupleHashEntryData);
-}
-
-/*
- * Return tuple from hash entry.
- */
-MinimalTuple
-TupleHashEntryGetTuple(TupleHashEntry entry)
-{
-	return entry->firstTuple;
-}
-
-/*
- * Get a pointer into the additional space allocated for this entry. The
- * memory will be maxaligned and zeroed.
- *
- * The amount of space available is the additionalsize requested in the call
- * to BuildTupleHashTable(). If additionalsize was specified as zero, return
- * NULL.
- */
-void *
-TupleHashEntryGetAdditional(TupleHashTable hashtable, TupleHashEntry entry)
-{
-	return entry->additional;
-}
-
 /*
  * Find or create a hashtable entry for the tuple group containing the
  * given tuple.  The tuple must be the same type as the hashtable entries.
diff --git a/src/include/executor/executor.h b/src/include/executor/executor.h
index b856afdf563..d0aa0e94241 100644
--- a/src/include/executor/executor.h
+++ b/src/include/executor/executor.h
@@ -149,10 +149,6 @@ extern TupleHashEntry LookupTupleHashEntry(TupleHashTable hashtable,
 										   bool *isnew, uint32 *hash);
 extern uint32 TupleHashTableHash(TupleHashTable hashtable,
 								 TupleTableSlot *slot);
-extern size_t TupleHashEntrySize(void);
-extern MinimalTuple TupleHashEntryGetTuple(TupleHashEntry entry);
-extern void *TupleHashEntryGetAdditional(TupleHashTable hashtable,
-										 TupleHashEntry entry);
 extern TupleHashEntry LookupTupleHashEntryHash(TupleHashTable hashtable,
 											   TupleTableSlot *slot,
 											   bool *isnew, uint32 hash);
@@ -162,6 +158,40 @@ extern TupleHashEntry FindTupleHashEntry(TupleHashTable hashtable,
 										 ExprState *hashexpr);
 extern void ResetTupleHashTable(TupleHashTable hashtable);
 
+#ifndef FRONTEND
+/*
+ * Return size of the hash bucket. Useful for estimating memory usage.
+ */
+static inline size_t
+TupleHashEntrySize(void)
+{
+	return sizeof(TupleHashEntryData);
+}
+
+/*
+ * Return tuple from hash entry.
+ */
+static inline MinimalTuple
+TupleHashEntryGetTuple(TupleHashEntry entry)
+{
+	return entry->firstTuple;
+}
+
+/*
+ * Get a pointer into the additional space allocated for this entry. The
+ * memory will be maxaligned and zeroed.
+ *
+ * The amount of space available is the additionalsize requested in the call
+ * to BuildTupleHashTable(). If additionalsize was specified as zero, return
+ * NULL.
+ */
+static inline void *
+TupleHashEntryGetAdditional(TupleHashTable hashtable, TupleHashEntry entry)
+{
+	return entry->additional;
+}
+#endif
+
 /*
  * prototypes from functions in execJunk.c
  */
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 7adfca86dfb..b9ca5f7f5fa 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -834,11 +834,17 @@ typedef struct ExecAuxRowMark
  * point to tab_hash_expr and tab_eq_func respectively.
  * ----------------------------------------------------------------
  */
-struct TupleHashEntryData;
-typedef struct TupleHashEntryData TupleHashEntryData;
 typedef struct TupleHashEntryData *TupleHashEntry;
 typedef struct TupleHashTableData *TupleHashTable;
 
+typedef struct TupleHashEntryData
+{
+	MinimalTuple firstTuple;	/* copy of first tuple in this group */
+	void	   *additional;		/* user data */
+	uint32		status;			/* hash status */
+	uint32		hash;			/* hash value (cached) */
+} TupleHashEntryData;
+
 /* define parameters necessary to generate the tuple hash table interface */
 #define SH_PREFIX tuplehash
 #define SH_ELEMENT_TYPE TupleHashEntryData
-- 
2.34.1

From 7470eb62b45dd5ee5280d8f834a910a64c8bcc85 Mon Sep 17 00:00:00 2001
From: Jeff Davis <j...@j-davis.com>
Date: Tue, 4 Mar 2025 15:13:22 -0800
Subject: [PATCH v8 4/7] Remove 'additional' pointer from TupleHashEntryData.

Reduces memory required for hash aggregation by avoiding an allocation
and a pointer in the TupleHashEntryData structure. That structure is
used for all buckets, whether occupied or not, so the savings is
substantial.

Discussion: https://postgr.es/m/AApHDvpN4v3t_sdz4dvrv1Fx_ZPw=twsnxuteytryp7lfz5...@mail.gmail.com
Reviewed-by: David Rowley <dgrowle...@gmail.com>
---
 src/backend/executor/execGrouping.c | 32 ++++++++++++++++++++++-------
 src/include/executor/executor.h     |  5 ++++-
 src/include/nodes/execnodes.h       |  1 -
 3 files changed, 29 insertions(+), 9 deletions(-)

diff --git a/src/backend/executor/execGrouping.c b/src/backend/executor/execGrouping.c
index a9d212aaec6..5c06cc6da40 100644
--- a/src/backend/executor/execGrouping.c
+++ b/src/backend/executor/execGrouping.c
@@ -481,15 +481,33 @@ LookupTupleHashEntry_internal(TupleHashTable hashtable, TupleTableSlot *slot,
 		else
 		{
 			/* created new entry */
-			*isnew = true;
+			bool		shouldFree;
+			char	   *mem;
+			MinimalTuple mtup;
 
-			MemoryContextSwitchTo(hashtable->tablecxt);
+			*isnew = true;
 
-			entry->firstTuple = ExecCopySlotMinimalTuple(slot);
-			if (hashtable->additionalsize > 0)
-				entry->additional = palloc0(hashtable->additionalsize);
-			else
-				entry->additional = NULL;
+			MemoryContextSwitchTo(hashtable->tempcxt);
+
+			/* ignore shouldFree, because we're in the temp context anyway */
+			mtup = ExecFetchSlotMinimalTuple(slot, &shouldFree);
+
+			/*
+			 * Allocate enough memory in the tablecxt for the additionalsize
+			 * followed by the tuple. Clear the additional data, and copy the
+			 * tuple.
+			 *
+			 * The caller can get a pointer to the additional data with
+			 * TupleHashEntryGetAdditional(), and store arbitrary data there.
+			 * Placing both the tuple and additional data in the same
+			 * allocation avoids the need to store an extra pointer in
+			 * TupleHashEntryData or allocate an additional chunk.
+			 */
+			mem = MemoryContextAlloc(hashtable->tablecxt,
+									 mtup->t_len + hashtable->additionalsize);
+			memset(mem, 0, hashtable->additionalsize);
+			entry->firstTuple = (MinimalTuple) (mem + hashtable->additionalsize);
+			memcpy(entry->firstTuple, mtup, mtup->t_len);
 		}
 	}
 	else
diff --git a/src/include/executor/executor.h b/src/include/executor/executor.h
index d0aa0e94241..90fa466dea4 100644
--- a/src/include/executor/executor.h
+++ b/src/include/executor/executor.h
@@ -188,7 +188,10 @@ TupleHashEntryGetTuple(TupleHashEntry entry)
 static inline void *
 TupleHashEntryGetAdditional(TupleHashTable hashtable, TupleHashEntry entry)
 {
-	return entry->additional;
+	if (hashtable->additionalsize > 0)
+		return (char *) entry->firstTuple - hashtable->additionalsize;
+	else
+		return NULL;
 }
 #endif
 
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index b9ca5f7f5fa..e93533dbbc5 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -840,7 +840,6 @@ typedef struct TupleHashTableData *TupleHashTable;
 typedef struct TupleHashEntryData
 {
 	MinimalTuple firstTuple;	/* copy of first tuple in this group */
-	void	   *additional;		/* user data */
 	uint32		status;			/* hash status */
 	uint32		hash;			/* hash value (cached) */
 } TupleHashEntryData;
-- 
2.34.1

From a8c5666c7021c7efed11ab08169320618fc20c91 Mon Sep 17 00:00:00 2001
From: Jeff Davis <j...@j-davis.com>
Date: Fri, 7 Feb 2025 13:14:50 -0800
Subject: [PATCH v8 5/7] Add ExecCopySlotMinimalTupleExtra().

Allows an "extra" argument that allocates extra memory at the end of
the MinimalTuple. This is important for callers that need to store
additional data, but do not want to perform an additional allocation.
---
 src/backend/access/common/heaptuple.c      | 27 +++++++++++++++++-----
 src/backend/executor/execTuples.c          | 24 ++++++++++---------
 src/backend/executor/nodeGatherMerge.c     |  2 +-
 src/backend/utils/sort/tuplesortvariants.c |  2 +-
 src/backend/utils/sort/tuplestore.c        |  6 ++---
 src/include/access/htup_details.h          |  7 +++---
 src/include/executor/tuptable.h            | 20 ++++++++++++++--
 7 files changed, 61 insertions(+), 27 deletions(-)

diff --git a/src/backend/access/common/heaptuple.c b/src/backend/access/common/heaptuple.c
index acd5da4ccf8..969d1028cae 100644
--- a/src/backend/access/common/heaptuple.c
+++ b/src/backend/access/common/heaptuple.c
@@ -1452,9 +1452,11 @@ heap_freetuple(HeapTuple htup)
 MinimalTuple
 heap_form_minimal_tuple(TupleDesc tupleDescriptor,
 						const Datum *values,
-						const bool *isnull)
+						const bool *isnull,
+						Size extra)
 {
 	MinimalTuple tuple;			/* return tuple */
+	char	   *mem;
 	Size		len,
 				data_len;
 	int			hoff;
@@ -1462,6 +1464,8 @@ heap_form_minimal_tuple(TupleDesc tupleDescriptor,
 	int			numberOfAttributes = tupleDescriptor->natts;
 	int			i;
 
+	Assert(extra == MAXALIGN(extra));
+
 	if (numberOfAttributes > MaxTupleAttributeNumber)
 		ereport(ERROR,
 				(errcode(ERRCODE_TOO_MANY_COLUMNS),
@@ -1497,7 +1501,9 @@ heap_form_minimal_tuple(TupleDesc tupleDescriptor,
 	/*
 	 * Allocate and zero the space needed.
 	 */
-	tuple = (MinimalTuple) palloc0(len);
+	mem = palloc0(len + extra);
+	memset(mem, 0, extra);
+	tuple = (MinimalTuple) (mem + extra);
 
 	/*
 	 * And fill in the information.
@@ -1533,11 +1539,15 @@ heap_free_minimal_tuple(MinimalTuple mtup)
  * The result is allocated in the current memory context.
  */
 MinimalTuple
-heap_copy_minimal_tuple(MinimalTuple mtup)
+heap_copy_minimal_tuple(MinimalTuple mtup, Size extra)
 {
 	MinimalTuple result;
+	char	   *mem;
 
-	result = (MinimalTuple) palloc(mtup->t_len);
+	Assert(extra == MAXALIGN(extra));
+	mem = palloc(mtup->t_len + extra);
+	memset(mem, 0, extra);
+	result = (MinimalTuple) (mem + extra);
 	memcpy(result, mtup, mtup->t_len);
 	return result;
 }
@@ -1574,15 +1584,20 @@ heap_tuple_from_minimal_tuple(MinimalTuple mtup)
  * The result is allocated in the current memory context.
  */
 MinimalTuple
-minimal_tuple_from_heap_tuple(HeapTuple htup)
+minimal_tuple_from_heap_tuple(HeapTuple htup, Size extra)
 {
 	MinimalTuple result;
+	char	   *mem;
 	uint32		len;
 
+	Assert(extra == MAXALIGN(extra));
 	Assert(htup->t_len > MINIMAL_TUPLE_OFFSET);
 	len = htup->t_len - MINIMAL_TUPLE_OFFSET;
-	result = (MinimalTuple) palloc(len);
+	mem = palloc(len + extra);
+	memset(mem, 0, extra);
+	result = (MinimalTuple) (mem + extra);
 	memcpy(result, (char *) htup->t_data + MINIMAL_TUPLE_OFFSET, len);
+
 	result->t_len = len;
 	return result;
 }
diff --git a/src/backend/executor/execTuples.c b/src/backend/executor/execTuples.c
index 7de490462d4..8e02d68824f 100644
--- a/src/backend/executor/execTuples.c
+++ b/src/backend/executor/execTuples.c
@@ -298,13 +298,14 @@ tts_virtual_copy_heap_tuple(TupleTableSlot *slot)
 }
 
 static MinimalTuple
-tts_virtual_copy_minimal_tuple(TupleTableSlot *slot)
+tts_virtual_copy_minimal_tuple(TupleTableSlot *slot, Size extra)
 {
 	Assert(!TTS_EMPTY(slot));
 
 	return heap_form_minimal_tuple(slot->tts_tupleDescriptor,
 								   slot->tts_values,
-								   slot->tts_isnull);
+								   slot->tts_isnull,
+								   extra);
 }
 
 
@@ -472,14 +473,14 @@ tts_heap_copy_heap_tuple(TupleTableSlot *slot)
 }
 
 static MinimalTuple
-tts_heap_copy_minimal_tuple(TupleTableSlot *slot)
+tts_heap_copy_minimal_tuple(TupleTableSlot *slot, Size extra)
 {
 	HeapTupleTableSlot *hslot = (HeapTupleTableSlot *) slot;
 
 	if (!hslot->tuple)
 		tts_heap_materialize(slot);
 
-	return minimal_tuple_from_heap_tuple(hslot->tuple);
+	return minimal_tuple_from_heap_tuple(hslot->tuple, extra);
 }
 
 static void
@@ -607,7 +608,8 @@ tts_minimal_materialize(TupleTableSlot *slot)
 	{
 		mslot->mintuple = heap_form_minimal_tuple(slot->tts_tupleDescriptor,
 												  slot->tts_values,
-												  slot->tts_isnull);
+												  slot->tts_isnull,
+												  0);
 	}
 	else
 	{
@@ -617,7 +619,7 @@ tts_minimal_materialize(TupleTableSlot *slot)
 		 * TTS_FLAG_SHOULDFREE set).  Copy the minimal tuple into the given
 		 * slot's memory context.
 		 */
-		mslot->mintuple = heap_copy_minimal_tuple(mslot->mintuple);
+		mslot->mintuple = heap_copy_minimal_tuple(mslot->mintuple, 0);
 	}
 
 	slot->tts_flags |= TTS_FLAG_SHOULDFREE;
@@ -666,14 +668,14 @@ tts_minimal_copy_heap_tuple(TupleTableSlot *slot)
 }
 
 static MinimalTuple
-tts_minimal_copy_minimal_tuple(TupleTableSlot *slot)
+tts_minimal_copy_minimal_tuple(TupleTableSlot *slot, Size extra)
 {
 	MinimalTupleTableSlot *mslot = (MinimalTupleTableSlot *) slot;
 
 	if (!mslot->mintuple)
 		tts_minimal_materialize(slot);
 
-	return heap_copy_minimal_tuple(mslot->mintuple);
+	return heap_copy_minimal_tuple(mslot->mintuple, extra);
 }
 
 static void
@@ -926,7 +928,7 @@ tts_buffer_heap_copy_heap_tuple(TupleTableSlot *slot)
 }
 
 static MinimalTuple
-tts_buffer_heap_copy_minimal_tuple(TupleTableSlot *slot)
+tts_buffer_heap_copy_minimal_tuple(TupleTableSlot *slot, Size extra)
 {
 	BufferHeapTupleTableSlot *bslot = (BufferHeapTupleTableSlot *) slot;
 
@@ -935,7 +937,7 @@ tts_buffer_heap_copy_minimal_tuple(TupleTableSlot *slot)
 	if (!bslot->base.tuple)
 		tts_buffer_heap_materialize(slot);
 
-	return minimal_tuple_from_heap_tuple(bslot->base.tuple);
+	return minimal_tuple_from_heap_tuple(bslot->base.tuple, extra);
 }
 
 static inline void
@@ -1895,7 +1897,7 @@ ExecFetchSlotMinimalTuple(TupleTableSlot *slot,
 	{
 		if (shouldFree)
 			*shouldFree = true;
-		return slot->tts_ops->copy_minimal_tuple(slot);
+		return slot->tts_ops->copy_minimal_tuple(slot, 0);
 	}
 }
 
diff --git a/src/backend/executor/nodeGatherMerge.c b/src/backend/executor/nodeGatherMerge.c
index 01a6e3a8553..15f84597067 100644
--- a/src/backend/executor/nodeGatherMerge.c
+++ b/src/backend/executor/nodeGatherMerge.c
@@ -735,7 +735,7 @@ gm_readnext_tuple(GatherMergeState *gm_state, int nreader, bool nowait,
 	 * Since we'll be buffering these across multiple calls, we need to make a
 	 * copy.
 	 */
-	return tup ? heap_copy_minimal_tuple(tup) : NULL;
+	return tup ? heap_copy_minimal_tuple(tup, 0) : NULL;
 }
 
 /*
diff --git a/src/backend/utils/sort/tuplesortvariants.c b/src/backend/utils/sort/tuplesortvariants.c
index eb8601e2257..10aa51d7323 100644
--- a/src/backend/utils/sort/tuplesortvariants.c
+++ b/src/backend/utils/sort/tuplesortvariants.c
@@ -1004,7 +1004,7 @@ tuplesort_gettupleslot(Tuplesortstate *state, bool forward, bool copy,
 			*abbrev = stup.datum1;
 
 		if (copy)
-			stup.tuple = heap_copy_minimal_tuple((MinimalTuple) stup.tuple);
+			stup.tuple = heap_copy_minimal_tuple((MinimalTuple) stup.tuple, 0);
 
 		ExecStoreMinimalTuple((MinimalTuple) stup.tuple, slot, copy);
 		return true;
diff --git a/src/backend/utils/sort/tuplestore.c b/src/backend/utils/sort/tuplestore.c
index d61b601053c..c9aecab8d66 100644
--- a/src/backend/utils/sort/tuplestore.c
+++ b/src/backend/utils/sort/tuplestore.c
@@ -787,7 +787,7 @@ tuplestore_putvalues(Tuplestorestate *state, TupleDesc tdesc,
 	MinimalTuple tuple;
 	MemoryContext oldcxt = MemoryContextSwitchTo(state->context);
 
-	tuple = heap_form_minimal_tuple(tdesc, values, isnull);
+	tuple = heap_form_minimal_tuple(tdesc, values, isnull, 0);
 	USEMEM(state, GetMemoryChunkSpace(tuple));
 
 	tuplestore_puttuple_common(state, tuple);
@@ -1139,7 +1139,7 @@ tuplestore_gettupleslot(Tuplestorestate *state, bool forward,
 	{
 		if (copy && !should_free)
 		{
-			tuple = heap_copy_minimal_tuple(tuple);
+			tuple = heap_copy_minimal_tuple(tuple, 0);
 			should_free = true;
 		}
 		ExecStoreMinimalTuple(tuple, slot, should_free);
@@ -1590,7 +1590,7 @@ copytup_heap(Tuplestorestate *state, void *tup)
 {
 	MinimalTuple tuple;
 
-	tuple = minimal_tuple_from_heap_tuple((HeapTuple) tup);
+	tuple = minimal_tuple_from_heap_tuple((HeapTuple) tup, 0);
 	USEMEM(state, GetMemoryChunkSpace(tuple));
 	return tuple;
 }
diff --git a/src/include/access/htup_details.h b/src/include/access/htup_details.h
index 6cd4b95bfdb..aa957cf3b01 100644
--- a/src/include/access/htup_details.h
+++ b/src/include/access/htup_details.h
@@ -839,11 +839,12 @@ extern void heap_deform_tuple(HeapTuple tuple, TupleDesc tupleDesc,
 							  Datum *values, bool *isnull);
 extern void heap_freetuple(HeapTuple htup);
 extern MinimalTuple heap_form_minimal_tuple(TupleDesc tupleDescriptor,
-											const Datum *values, const bool *isnull);
+											const Datum *values, const bool *isnull,
+											Size extra);
 extern void heap_free_minimal_tuple(MinimalTuple mtup);
-extern MinimalTuple heap_copy_minimal_tuple(MinimalTuple mtup);
+extern MinimalTuple heap_copy_minimal_tuple(MinimalTuple mtup, Size extra);
 extern HeapTuple heap_tuple_from_minimal_tuple(MinimalTuple mtup);
-extern MinimalTuple minimal_tuple_from_heap_tuple(HeapTuple htup);
+extern MinimalTuple minimal_tuple_from_heap_tuple(HeapTuple htup, Size extra);
 extern size_t varsize_any(void *p);
 extern HeapTuple heap_expand_tuple(HeapTuple sourceTuple, TupleDesc tupleDesc);
 extern MinimalTuple minimal_expand_tuple(HeapTuple sourceTuple, TupleDesc tupleDesc);
diff --git a/src/include/executor/tuptable.h b/src/include/executor/tuptable.h
index a044d78e4d0..095e4cc82e3 100644
--- a/src/include/executor/tuptable.h
+++ b/src/include/executor/tuptable.h
@@ -218,8 +218,12 @@ struct TupleTableSlotOps
 	 * meaningful "system columns" in the copy. The copy is not be "owned" by
 	 * the slot i.e. the caller has to take responsibility to free memory
 	 * consumed by the slot.
+	 *
+	 * The copy has "extra" bytes (maxaligned and zeroed) available before the
+	 * tuple, which is useful so that some callers may store extra data along
+	 * with the minimal tuple without the need for an additional allocation.
 	 */
-	MinimalTuple (*copy_minimal_tuple) (TupleTableSlot *slot);
+	MinimalTuple (*copy_minimal_tuple) (TupleTableSlot *slot, Size extra);
 };
 
 /*
@@ -491,7 +495,19 @@ ExecCopySlotHeapTuple(TupleTableSlot *slot)
 static inline MinimalTuple
 ExecCopySlotMinimalTuple(TupleTableSlot *slot)
 {
-	return slot->tts_ops->copy_minimal_tuple(slot);
+	return slot->tts_ops->copy_minimal_tuple(slot, 0);
+}
+
+/*
+ * ExecCopySlotMinimalTupleExtra - return MinimalTuple allocated in caller's
+ * context, with extra bytes (maxaligned and zeroed) before the tuple for data
+ * the caller wishes to store along with the tuple (without requiring the
+ * caller to make an additional allocation).
+ */
+static inline MinimalTuple
+ExecCopySlotMinimalTupleExtra(TupleTableSlot *slot, Size extra)
+{
+	return slot->tts_ops->copy_minimal_tuple(slot, extra);
 }
 
 /*
-- 
2.34.1

From d05492e9078f00f1b4585312108f9f2fe0b48976 Mon Sep 17 00:00:00 2001
From: Jeff Davis <j...@j-davis.com>
Date: Tue, 4 Mar 2025 12:46:48 -0800
Subject: [PATCH v8 6/7] Use ExecCopySlotMinimalTupleExtra() to avoid a memcpy.

Previously, ExecFetchSlotMinimalTuple() could cause an unnecessary
copy for hash-based set operations.

Use the new API ExecCopySlotMinimalTupleExtra() to perform the copy
while also allocating the extra space needed for the 'additional'
data.

Discussion: https://postgr.es/m/AApHDvpN4v3t_sdz4dvrv1Fx_ZPw=twsnxuteytryp7lfz5...@mail.gmail.com
Reviewed-by: David Rowley <dgrowle...@gmail.com>
---
 src/backend/executor/execGrouping.c | 21 +++++----------------
 1 file changed, 5 insertions(+), 16 deletions(-)

diff --git a/src/backend/executor/execGrouping.c b/src/backend/executor/execGrouping.c
index 5c06cc6da40..255bd795361 100644
--- a/src/backend/executor/execGrouping.c
+++ b/src/backend/executor/execGrouping.c
@@ -481,21 +481,13 @@ LookupTupleHashEntry_internal(TupleHashTable hashtable, TupleTableSlot *slot,
 		else
 		{
 			/* created new entry */
-			bool		shouldFree;
-			char	   *mem;
-			MinimalTuple mtup;
-
 			*isnew = true;
 
-			MemoryContextSwitchTo(hashtable->tempcxt);
-
-			/* ignore shouldFree, because we're in the temp context anyway */
-			mtup = ExecFetchSlotMinimalTuple(slot, &shouldFree);
+			MemoryContextSwitchTo(hashtable->tablecxt);
 
 			/*
-			 * Allocate enough memory in the tablecxt for the additionalsize
-			 * followed by the tuple. Clear the additional data, and copy the
-			 * tuple.
+			 * Copy the first tuple into the table context, and request
+			 * additionalsize extra bytes before the allocation.
 			 *
 			 * The caller can get a pointer to the additional data with
 			 * TupleHashEntryGetAdditional(), and store arbitrary data there.
@@ -503,11 +495,8 @@ LookupTupleHashEntry_internal(TupleHashTable hashtable, TupleTableSlot *slot,
 			 * allocation avoids the need to store an extra pointer in
 			 * TupleHashEntryData or allocate an additional chunk.
 			 */
-			mem = MemoryContextAlloc(hashtable->tablecxt,
-									 mtup->t_len + hashtable->additionalsize);
-			memset(mem, 0, hashtable->additionalsize);
-			entry->firstTuple = (MinimalTuple) (mem + hashtable->additionalsize);
-			memcpy(entry->firstTuple, mtup, mtup->t_len);
+			entry->firstTuple = ExecCopySlotMinimalTupleExtra(slot,
+															  hashtable->additionalsize);
 		}
 	}
 	else
-- 
2.34.1

From b5144000d171bcfa534fba4fbadff95bb7e446ac Mon Sep 17 00:00:00 2001
From: Jeff Davis <j...@j-davis.com>
Date: Tue, 4 Mar 2025 13:21:35 -0800
Subject: [PATCH v8 7/7] Inline MemoryContextMemAllocated().

Speeds up the repeated calls during hash aggregation.

Discussion: https://postgr.es/m/CAApHDvpN4v3t_sdz4dvrv1Fx_ZPw=twsnxuteytryp7lfz5...@mail.gmail.com
---
 src/backend/utils/mmgr/mcxt.c | 68 -----------------------------------
 src/include/nodes/memnodes.h  | 68 +++++++++++++++++++++++++++++++++++
 src/include/utils/memutils.h  |  1 -
 3 files changed, 68 insertions(+), 69 deletions(-)

diff --git a/src/backend/utils/mmgr/mcxt.c b/src/backend/utils/mmgr/mcxt.c
index 91060de0ab7..ea08bcf8ed2 100644
--- a/src/backend/utils/mmgr/mcxt.c
+++ b/src/backend/utils/mmgr/mcxt.c
@@ -232,50 +232,6 @@ GetMemoryChunkHeader(const void *pointer)
 	return header;
 }
 
-/*
- * MemoryContextTraverseNext
- *		Helper function to traverse all descendants of a memory context
- *		without recursion.
- *
- * Recursion could lead to out-of-stack errors with deep context hierarchies,
- * which would be unpleasant in error cleanup code paths.
- *
- * To process 'context' and all its descendants, use a loop like this:
- *
- *     <process 'context'>
- *     for (MemoryContext curr = context->firstchild;
- *          curr != NULL;
- *          curr = MemoryContextTraverseNext(curr, context))
- *     {
- *         <process 'curr'>
- *     }
- *
- * This visits all the contexts in pre-order, that is a node is visited
- * before its children.
- */
-static MemoryContext
-MemoryContextTraverseNext(MemoryContext curr, MemoryContext top)
-{
-	/* After processing a node, traverse to its first child if any */
-	if (curr->firstchild != NULL)
-		return curr->firstchild;
-
-	/*
-	 * After processing a childless node, traverse to its next sibling if
-	 * there is one.  If there isn't, traverse back up to the parent (which
-	 * has already been visited, and now so have all its descendants).  We're
-	 * done if that is "top", otherwise traverse to its next sibling if any,
-	 * otherwise repeat moving up.
-	 */
-	while (curr->nextchild == NULL)
-	{
-		curr = curr->parent;
-		if (curr == top)
-			return NULL;
-	}
-	return curr->nextchild;
-}
-
 /*
  * Support routines to trap use of invalid memory context method IDs
  * (from calling pfree or the like on a bogus pointer).  As a possible
@@ -754,30 +710,6 @@ MemoryContextIsEmpty(MemoryContext context)
 	return context->methods->is_empty(context);
 }
 
-/*
- * Find the memory allocated to blocks for this memory context. If recurse is
- * true, also include children.
- */
-Size
-MemoryContextMemAllocated(MemoryContext context, bool recurse)
-{
-	Size		total = context->mem_allocated;
-
-	Assert(MemoryContextIsValid(context));
-
-	if (recurse)
-	{
-		for (MemoryContext curr = context->firstchild;
-			 curr != NULL;
-			 curr = MemoryContextTraverseNext(curr, context))
-		{
-			total += curr->mem_allocated;
-		}
-	}
-
-	return total;
-}
-
 /*
  * Return the memory consumption statistics about the given context and its
  * children.
diff --git a/src/include/nodes/memnodes.h b/src/include/nodes/memnodes.h
index 5807ef805bd..654c52448b5 100644
--- a/src/include/nodes/memnodes.h
+++ b/src/include/nodes/memnodes.h
@@ -149,4 +149,72 @@ typedef struct MemoryContextData
 	  IsA((context), GenerationContext) || \
 	  IsA((context), BumpContext)))
 
+/*
+ * MemoryContextTraverseNext
+ *		Helper function to traverse all descendants of a memory context
+ *		without recursion.
+ *
+ * Recursion could lead to out-of-stack errors with deep context hierarchies,
+ * which would be unpleasant in error cleanup code paths.
+ *
+ * To process 'context' and all its descendants, use a loop like this:
+ *
+ *     <process 'context'>
+ *     for (MemoryContext curr = context->firstchild;
+ *          curr != NULL;
+ *          curr = MemoryContextTraverseNext(curr, context))
+ *     {
+ *         <process 'curr'>
+ *     }
+ *
+ * This visits all the contexts in pre-order, that is a node is visited
+ * before its children.
+ */
+static MemoryContext
+MemoryContextTraverseNext(MemoryContext curr, MemoryContext top)
+{
+	/* After processing a node, traverse to its first child if any */
+	if (curr->firstchild != NULL)
+		return curr->firstchild;
+
+	/*
+	 * After processing a childless node, traverse to its next sibling if
+	 * there is one.  If there isn't, traverse back up to the parent (which
+	 * has already been visited, and now so have all its descendants).  We're
+	 * done if that is "top", otherwise traverse to its next sibling if any,
+	 * otherwise repeat moving up.
+	 */
+	while (curr->nextchild == NULL)
+	{
+		curr = curr->parent;
+		if (curr == top)
+			return NULL;
+	}
+	return curr->nextchild;
+}
+
+/*
+ * Find the memory allocated to blocks for this memory context. If recurse is
+ * true, also include children.
+ */
+static inline Size
+MemoryContextMemAllocated(MemoryContext context, bool recurse)
+{
+	Size		total = context->mem_allocated;
+
+	Assert(MemoryContextIsValid(context));
+
+	if (recurse)
+	{
+		for (MemoryContext curr = context->firstchild;
+			 curr != NULL;
+			 curr = MemoryContextTraverseNext(curr, context))
+		{
+			total += curr->mem_allocated;
+		}
+	}
+
+	return total;
+}
+
 #endif							/* MEMNODES_H */
diff --git a/src/include/utils/memutils.h b/src/include/utils/memutils.h
index 8abc26abce2..4f153d0e067 100644
--- a/src/include/utils/memutils.h
+++ b/src/include/utils/memutils.h
@@ -83,7 +83,6 @@ extern MemoryContext GetMemoryChunkContext(void *pointer);
 extern Size GetMemoryChunkSpace(void *pointer);
 extern MemoryContext MemoryContextGetParent(MemoryContext context);
 extern bool MemoryContextIsEmpty(MemoryContext context);
-extern Size MemoryContextMemAllocated(MemoryContext context, bool recurse);
 extern void MemoryContextMemConsumed(MemoryContext context,
 									 MemoryContextCounters *consumed);
 extern void MemoryContextStats(MemoryContext context);
-- 
2.34.1

Re: Reduce TupleHashEntryData struct size by half

Reply via email to