Re: optimize hashjoin

Tomas Vondra Wed, 21 Aug 2024 09:32:08 -0700

Hi,

On 8/21/24 03:49, bucoo wrote:
> Avoid unnecessary form and deform tuple.
>


Thanks for the patch. Generally speaking, it's usually a good idea to
briefly explain what the patch does, why is that beneficial, in what
circumstances, etc. It's sometimes not quite obvious from the patch
itself (even if the patch is simple). Plus it really helps new
contributors who want to review the patch, but even for experienced
people it's a huge time saver ...

Anyway, I took a look and the basic idea is simple - when shuffling
tuples between batches in a hash join, we're currently deforming the
tuple (->slot) we just read from a batch, only to immediately form it
(slot->) again and write it to the "correct" batch.

I think the idea to skip this unnecessary deform/form step is sound, and
I don't see any particular argument against doing that. The code looks
reasonable too, I think.

A couple minor comments:

0) The patch does not apply anymore, thanks to David committing a patch
yesterday. Attached is a patch rebased on top of current master.

1) Wouldn't it be easier (and just as efficient) to use slots with
TTSOpsMinimalTuple, i.e. backed by a minimal tuple?

2) I find the naming of the new functions a bit confusing. We now have
the "original" functions working with slots, and then also functions
with "Tuple" working with tuples. Seems asymmetric.

3) The code in ExecHashJoinOuterGetTuple is hard to understand, it'd
very much benefit from some comments. I'm a bit unsure if the likely()
and unlikely() hints really help.

4) Is the hj_outerTupleBuffer buffer really needed / effective? I'd bet
just using palloc() will work just as well, thanks to the AllocSet
caching etc.

5) You might want to add the patch to the 2024-09 CF, to keep track of
it: https://commitfest.postgresql.org/49/

> In the TPCH test, HashJoin speed up to ~2x.
> 
Can you provide more information about the benchmark you did? What
hardware, what scale, PostgreSQL configuration, which of the 22 queries
are improved, etc.

I ran TPC-H with 1GB and 10GB scales on two machines, and I see pretty
much no difference compared to master. However, it occurred to me the
patch only ever helps if we increase the number of batches during
execution, in which case we need to move tuples to the right batch.

Perhaps that's not happening in my testing, but it was happening in your
runs? Which just means we really need to know more about how you did the
testing.


regards

-- 
Tomas Vondra

diff --git a/src/backend/executor/nodeHash.c b/src/backend/executor/nodeHash.c
index 570a90ebe15..6f5cd16aa2b 100644
--- a/src/backend/executor/nodeHash.c
+++ b/src/backend/executor/nodeHash.c
@@ -1611,6 +1611,23 @@ ExecHashTableInsert(HashJoinTable hashtable,
 {
 	bool		shouldFree;
 	MinimalTuple tuple = ExecFetchSlotMinimalTuple(slot, &shouldFree);
+
+	ExecHashTableInsertTuple(hashtable, tuple, hashvalue);
+
+	if (shouldFree)
+		heap_free_minimal_tuple(tuple);
+}
+
+/*
+ * ExecHashTableInsert
+ *		insert a tuple into the hash table depending on the hash value
+ *		it may just go to a temp file for later batches
+ */
+void
+ExecHashTableInsertTuple(HashJoinTable hashtable,
+						 MinimalTuple tuple,
+						 uint32 hashvalue)
+{
 	int			bucketno;
 	int			batchno;
 
@@ -1685,9 +1702,6 @@ ExecHashTableInsert(HashJoinTable hashtable,
 							  &hashtable->innerBatchFile[batchno],
 							  hashtable);
 	}
-
-	if (shouldFree)
-		heap_free_minimal_tuple(tuple);
 }
 
 /*
@@ -1761,12 +1775,10 @@ retry:
  * tuples that belong in the current batch once growth has been disabled.
  */
 void
-ExecParallelHashTableInsertCurrentBatch(HashJoinTable hashtable,
-										TupleTableSlot *slot,
-										uint32 hashvalue)
+ExecParallelHashTableInsertCurrentBatchTuple(HashJoinTable hashtable,
+											 MinimalTuple tuple,
+											 uint32 hashvalue)
 {
-	bool		shouldFree;
-	MinimalTuple tuple = ExecFetchSlotMinimalTuple(slot, &shouldFree);
 	HashJoinTuple hashTuple;
 	dsa_pointer shared;
 	int			batchno;
@@ -1782,6 +1794,21 @@ ExecParallelHashTableInsertCurrentBatch(HashJoinTable hashtable,
 	HeapTupleHeaderClearMatch(HJTUPLE_MINTUPLE(hashTuple));
 	ExecParallelHashPushTuple(&hashtable->buckets.shared[bucketno],
 							  hashTuple, shared);
+}
+
+/*
+ * like ExecParallelHashTableInsertCurrentBatchTuple,
+ * but this function accept a TupleTableSlot
+ */
+void
+ExecParallelHashTableInsertCurrentBatch(HashJoinTable hashtable,
+										TupleTableSlot *slot,
+										uint32 hashvalue)
+{
+	bool		shouldFree;
+	MinimalTuple tuple = ExecFetchSlotMinimalTuple(slot, &shouldFree);
+
+	ExecParallelHashTableInsertCurrentBatchTuple(hashtable, tuple, hashvalue);
 
 	if (shouldFree)
 		heap_free_minimal_tuple(tuple);
diff --git a/src/backend/executor/nodeHashjoin.c b/src/backend/executor/nodeHashjoin.c
index 2f7170604d6..d52e4e14366 100644
--- a/src/backend/executor/nodeHashjoin.c
+++ b/src/backend/executor/nodeHashjoin.c
@@ -195,10 +195,10 @@ static TupleTableSlot *ExecHashJoinOuterGetTuple(PlanState *outerNode,
 static TupleTableSlot *ExecParallelHashJoinOuterGetTuple(PlanState *outerNode,
 														 HashJoinState *hjstate,
 														 uint32 *hashvalue);
-static TupleTableSlot *ExecHashJoinGetSavedTuple(HashJoinState *hjstate,
-												 BufFile *file,
-												 uint32 *hashvalue,
-												 TupleTableSlot *tupleSlot);
+static MinimalTuple ExecHashJoinGetSavedTuple(HashJoinState *hjstate,
+											  BufFile *file,
+											  uint32 *hashvalue,
+											  StringInfo buf);
 static bool ExecHashJoinNewBatch(HashJoinState *hjstate);
 static bool ExecParallelHashJoinNewBatch(HashJoinState *hjstate);
 static void ExecParallelHashJoinPartitionOuter(HashJoinState *hjstate);
@@ -925,6 +925,7 @@ ExecInitHashJoin(HashJoin *node, EState *estate, int eflags)
 	 */
 	hjstate->hj_HashTable = NULL;
 	hjstate->hj_FirstOuterTupleSlot = NULL;
+	hjstate->hj_outerTupleBuffer = NULL;
 
 	hjstate->hj_CurHashValue = 0;
 	hjstate->hj_CurBucketNo = 0;
@@ -1030,6 +1031,7 @@ ExecHashJoinOuterGetTuple(PlanState *outerNode,
 	}
 	else if (curbatch < hashtable->nbatch)
 	{
+		MinimalTuple mtup;
 		BufFile    *file = hashtable->outerBatchFile[curbatch];
 
 		/*
@@ -1039,12 +1041,23 @@ ExecHashJoinOuterGetTuple(PlanState *outerNode,
 		if (file == NULL)
 			return NULL;
 
-		slot = ExecHashJoinGetSavedTuple(hjstate,
+		if (unlikely(hjstate->hj_outerTupleBuffer == NULL))
+		{
+			MemoryContext oldcontext = MemoryContextSwitchTo(GetMemoryChunkContext(hjstate));
+			hjstate->hj_outerTupleBuffer = makeStringInfo();
+			MemoryContextSwitchTo(oldcontext);
+		}
+
+		mtup = ExecHashJoinGetSavedTuple(hjstate,
 										 file,
 										 hashvalue,
-										 hjstate->hj_OuterTupleSlot);
-		if (!TupIsNull(slot))
+										 hjstate->hj_outerTupleBuffer);
+		if (likely(mtup != NULL))
+		{
+			slot = hjstate->hj_OuterTupleSlot;
+			ExecForceStoreMinimalTuple(mtup, slot, false);
 			return slot;
+		}
 	}
 
 	/* End of this batch */
@@ -1133,7 +1146,6 @@ ExecHashJoinNewBatch(HashJoinState *hjstate)
 	int			nbatch;
 	int			curbatch;
 	BufFile    *innerFile;
-	TupleTableSlot *slot;
 	uint32		hashvalue;
 
 	nbatch = hashtable->nbatch;
@@ -1224,21 +1236,25 @@ ExecHashJoinNewBatch(HashJoinState *hjstate)
 
 	if (innerFile != NULL)
 	{
+		StringInfoData buf;
+		MinimalTuple tuple;
+
 		if (BufFileSeek(innerFile, 0, 0, SEEK_SET))
 			ereport(ERROR,
 					(errcode_for_file_access(),
 					 errmsg("could not rewind hash-join temporary file")));
 
-		while ((slot = ExecHashJoinGetSavedTuple(hjstate,
-												 innerFile,
-												 &hashvalue,
-												 hjstate->hj_HashTupleSlot)))
+		initStringInfo(&buf);
+		while ((tuple = ExecHashJoinGetSavedTuple(hjstate,
+												  innerFile,
+												  &hashvalue,
+												  &buf)))
 		{
 			/*
 			 * NOTE: some tuples may be sent to future batches.  Also, it is
 			 * possible for hashtable->nbatch to be increased here!
 			 */
-			ExecHashTableInsert(hashtable, slot, hashvalue);
+			ExecHashTableInsertTuple(hashtable, tuple, hashvalue);
 		}
 
 		/*
@@ -1247,6 +1263,7 @@ ExecHashJoinNewBatch(HashJoinState *hjstate)
 		 */
 		BufFileClose(innerFile);
 		hashtable->innerBatchFile[curbatch] = NULL;
+		pfree(buf.data);
 	}
 
 	/*
@@ -1297,7 +1314,6 @@ ExecParallelHashJoinNewBatch(HashJoinState *hjstate)
 	{
 		uint32		hashvalue;
 		MinimalTuple tuple;
-		TupleTableSlot *slot;
 
 		if (!hashtable->batches[batchno].done)
 		{
@@ -1329,12 +1345,9 @@ ExecParallelHashJoinNewBatch(HashJoinState *hjstate)
 					while ((tuple = sts_parallel_scan_next(inner_tuples,
 														   &hashvalue)))
 					{
-						ExecForceStoreMinimalTuple(tuple,
-												   hjstate->hj_HashTupleSlot,
-												   false);
-						slot = hjstate->hj_HashTupleSlot;
-						ExecParallelHashTableInsertCurrentBatch(hashtable, slot,
-																hashvalue);
+						ExecParallelHashTableInsertCurrentBatchTuple(hashtable,
+																	 tuple,
+																	 hashvalue);
 					}
 					sts_end_parallel_scan(inner_tuples);
 					BarrierArriveAndWait(batch_barrier,
@@ -1448,14 +1461,14 @@ ExecHashJoinSaveTuple(MinimalTuple tuple, uint32 hashvalue,
  * ExecHashJoinGetSavedTuple
  *		read the next tuple from a batch file.  Return NULL if no more.
  *
- * On success, *hashvalue is set to the tuple's hash value, and the tuple
- * itself is stored in the given slot.
+ * On success, *hashvalue is set to the tuple's hash value, and return
+ * the tuple(stored in the given buf) itself.
  */
-static TupleTableSlot *
+static MinimalTuple
 ExecHashJoinGetSavedTuple(HashJoinState *hjstate,
 						  BufFile *file,
 						  uint32 *hashvalue,
-						  TupleTableSlot *tupleSlot)
+						  StringInfo buf)
 {
 	uint32		header[2];
 	size_t		nread;
@@ -1474,19 +1487,19 @@ ExecHashJoinGetSavedTuple(HashJoinState *hjstate,
 	 * cheating.
 	 */
 	nread = BufFileReadMaybeEOF(file, header, sizeof(header), true);
-	if (nread == 0)				/* end of file */
-	{
-		ExecClearTuple(tupleSlot);
+	if (unlikely(nread == 0))				/* end of file */
 		return NULL;
-	}
+
+	enlargeStringInfo(buf, header[1]);
 	*hashvalue = header[0];
-	tuple = (MinimalTuple) palloc(header[1]);
+	buf->len = header[1];
+	tuple = (MinimalTuple) buf->data;
 	tuple->t_len = header[1];
 	BufFileReadExact(file,
 					 (char *) tuple + sizeof(uint32),
 					 header[1] - sizeof(uint32));
-	ExecForceStoreMinimalTuple(tuple, tupleSlot, true);
-	return tupleSlot;
+
+	return tuple;
 }
 
 
diff --git a/src/include/executor/nodeHash.h b/src/include/executor/nodeHash.h
index e4eb7bc6359..37585035ade 100644
--- a/src/include/executor/nodeHash.h
+++ b/src/include/executor/nodeHash.h
@@ -36,12 +36,18 @@ extern void ExecParallelHashTableSetCurrentBatch(HashJoinTable hashtable,
 extern void ExecHashTableInsert(HashJoinTable hashtable,
 								TupleTableSlot *slot,
 								uint32 hashvalue);
+extern void ExecHashTableInsertTuple(HashJoinTable hashtable,
+									 MinimalTuple tuple,
+									 uint32 hashvalue);
 extern void ExecParallelHashTableInsert(HashJoinTable hashtable,
 										TupleTableSlot *slot,
 										uint32 hashvalue);
 extern void ExecParallelHashTableInsertCurrentBatch(HashJoinTable hashtable,
 													TupleTableSlot *slot,
 													uint32 hashvalue);
+extern void ExecParallelHashTableInsertCurrentBatchTuple(HashJoinTable hashtable,
+														 MinimalTuple tuple,
+														 uint32 hashvalue);
 extern void ExecHashGetBucketAndBatch(HashJoinTable hashtable,
 									  uint32 hashvalue,
 									  int *bucketno,
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index af7d8fd1e72..299f66af135 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -2225,6 +2225,7 @@ typedef struct HashJoinState
 	TupleTableSlot *hj_NullOuterTupleSlot;
 	TupleTableSlot *hj_NullInnerTupleSlot;
 	TupleTableSlot *hj_FirstOuterTupleSlot;
+	StringInfo	hj_outerTupleBuffer;
 	int			hj_JoinState;
 	bool		hj_MatchedOuter;
 	bool		hj_OuterNotEmpty;
diff --git a/src/test/modules/unsafe_tests/Makefile b/src/test/modules/unsafe_tests/Makefile
index 90d19791871..a868a6ff34f 100644
--- a/src/test/modules/unsafe_tests/Makefile
+++ b/src/test/modules/unsafe_tests/Makefile
@@ -1,6 +1,6 @@
 # src/test/modules/unsafe_tests/Makefile
 
-REGRESS = rolenames alter_system_table guc_privs
+REGRESS = alter_system_table guc_privs
 
 # the whole point of these tests is to not run installcheck
 NO_INSTALLCHECK = 1

Re: optimize hashjoin

Reply via email to