Hi,
A few days ago I came up with an idea to implement multi insert
optimization wherever possible. I prepared a raw patch
and it showed a great performance gain (up to 4 times during INSERT
... INTO ... in the best case).
Then I was very happy to find this thread. You did a great job and I
want to help you to bring the matter to an end.

On Thu, Oct 31, 2024 at 11:17 AM Jingtang Zhang <mrdrivingd...@gmail.com> wrote:
> I did some performance test these days, and I have some findings.
> HEAD:
>   12.29%  postgres            [.] pg_checksum_block
>    6.33%  postgres            [.] GetPrivateRefCountEntry
>    5.40%  postgres            [.] pg_comp_crc32c_sse42
>    4.54%  [kernel]            [k] copy_user_enhanced_fast_string
>    2.69%  postgres            [.] BufferIsValid
>    1.52%  postgres            [.] XLogRecordAssemble
>
> Patched:
>   11.75%  postgres            [.] tts_virtual_materialize
>    8.87%  postgres            [.] pg_checksum_block
>    8.17%  postgres            [.] slot_deform_heap_tuple
>    8.09%  postgres            [.] heap_compute_data_size
>    6.17%  postgres            [.] fill_val
>    3.81%  postgres            [.] heap_fill_tuple
>    3.37%  postgres            [.] tts_virtual_copyslot
>    2.62%  [kernel]            [k] copy_user_enhanced_fast_string

I applied v25 patches on the master branch and made some measurements
to find out what is the bottleneck in this case. The 'time' utility
showed that without a patch, this query will run 1.5 times slower. I
also made a few flamegraphs for this test. Most of the time is spent
calling
these two functions : tts_virtual_copyslot and heap_form_tuple.
All tests were run in virtual machine with these CPU characteristics:
Architecture:             x86_64
CPU(s):                   2
  On-line CPU(s) list:    0,1
Virtualization features:
  Virtualization:         AMD-V
  Hypervisor vendor:      KVM
  Virtualization type:    full
Caches (sum of all):
  L1d:                    128 KiB (2 instances)
  L1i:                    128 KiB (2 instances)
  L2:                     1 MiB (2 instances)
  L3:                     32 MiB (2 instances)
NUMA:
  NUMA node(s):           1
  NUMA node0 CPU(s):      0,1

In my implementation, I used Tuplestore functionality to store tuples.
In order to get rid of getting stuck in the above mentioned functions,
I crossed it with the current implementation (v25 patches) and got a
10% increase in performance (for the test above). I also set up v22
patches to
compare performance (with/without tuplestore) for INSERT ... INTO ...
queries (with -j 4 -c 10 parameters for pgbech), and there also was an
increase in TPS (about 3-4%).

I attach a patch that adds Tuplestore to v25. What do you think about this idea?

--
Best regards,
Daniil Davydov
From a59cfcbb05bb07c94a4c0ad6531baa5e531629ae Mon Sep 17 00:00:00 2001
From: Daniil Davidov <d.davy...@postgrespro.ru>
Date: Sun, 9 Mar 2025 16:37:44 +0700
Subject: [PATCH] Replace holding tuples in virtual slots with tuplestorage

During performance testing, it was found out that in the current
implementation a lot of the program's time is spent calling two functions :
tts_virtual_copyslot and heap_fill_tuple. Calls to these functions are related
to the fact that tuples are stored in virtual_tts, so I propose to replace this
logic with Tuplestore functionality.

Discussion: https://www.postgresql.org/message-id/9F9326B4-8AD9-4858-B1C1-559FC64E6E93%40gmail.com
---
 src/backend/access/heap/heapam.c | 67 +++++++++++++++-----------------
 src/include/access/heapam.h      |  9 ++++-
 2 files changed, 38 insertions(+), 38 deletions(-)

diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index acdce1a4b4..276480213a 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -2665,7 +2665,6 @@ void
 heap_modify_buffer_insert(TableModifyState *state,
 						  TupleTableSlot *slot)
 {
-	TupleTableSlot *dstslot;
 	HeapInsertState *istate;
 	HeapMultiInsertState *mistate;
 	MemoryContext oldcontext;
@@ -2682,8 +2681,10 @@ heap_modify_buffer_insert(TableModifyState *state,
 		mistate =
 			(HeapMultiInsertState *) palloc(sizeof(HeapMultiInsertState));
 		mistate->slots =
-			(TupleTableSlot **) palloc0(sizeof(TupleTableSlot *) * HEAP_MAX_BUFFERED_SLOTS);
-		mistate->cur_slots = 0;
+			(TupleTableSlot **) palloc0(sizeof(void *) * HEAP_MAX_BUFFERED_SLOTS);
+		mistate->tstore = tuplestore_begin_heap(false, false, work_mem);
+		mistate->nused = 0;
+
 		istate->mistate = mistate;
 
 		/*
@@ -2702,36 +2703,11 @@ heap_modify_buffer_insert(TableModifyState *state,
 	istate = (HeapInsertState *) state->data;
 	Assert(istate->mistate != NULL);
 	mistate = istate->mistate;
-	dstslot = mistate->slots[mistate->cur_slots];
-
-	if (dstslot == NULL)
-	{
-		/*
-		 * We use virtual tuple slots buffered slots for leveraging the
-		 * optimization it provides to minimize physical data copying. The
-		 * virtual slot gets materialized when we copy (via below
-		 * ExecCopySlot) the tuples from the source slot which can be of any
-		 * type. This way, it is ensured that the tuple storage doesn't depend
-		 * on external memory, because all the datums that aren't passed by
-		 * value are copied into the slot's memory context.
-		 */
-		dstslot = MakeTupleTableSlot(RelationGetDescr(state->rel),
-									 &TTSOpsVirtual);
-
-		mistate->slots[mistate->cur_slots] = dstslot;
-	}
-
-	Assert(TTS_IS_VIRTUAL(dstslot));
-
-	/*
-	 * Note that the copy clears the previous destination slot contents, so no
-	 * need to explicitly ExecClearTuple() here.
-	 */
-	ExecCopySlot(dstslot, slot);
 
-	mistate->cur_slots++;
+	tuplestore_puttupleslot(mistate->tstore, slot);
+	mistate->nused += 1;
 
-	if (mistate->cur_slots >= HEAP_MAX_BUFFERED_SLOTS)
+	if (mistate->nused >= HEAP_MAX_BUFFERED_SLOTS)
 		heap_modify_buffer_flush(state);
 
 	MemoryContextSwitchTo(oldcontext);
@@ -2746,19 +2722,35 @@ heap_modify_buffer_flush(TableModifyState *state)
 	HeapInsertState *istate;
 	HeapMultiInsertState *mistate;
 	MemoryContext oldcontext;
+	TupleDesc tupdesc;
 
 	/* Quick exit if we haven't inserted anything yet */
 	if (state->data == NULL)
 		return;
 
+	tupdesc = RelationGetDescr(state->rel);
 	istate = (HeapInsertState *) state->data;
 	Assert(istate->mistate != NULL);
 	mistate = istate->mistate;
 
 	/* Quick exit if we have flushed already */
-	if (mistate->cur_slots == 0)
+	if (mistate->nused == 0)
 		return;
 
+	for (int i = 0; i < mistate->nused; i++)
+	{
+		bool ok;
+
+		if (istate->mistate->slots[i] == NULL)
+		{
+			istate->mistate->slots[i] =
+				MakeSingleTupleTableSlot(tupdesc, &TTSOpsMinimalTuple);
+		}
+		ok = tuplestore_gettupleslot(mistate->tstore, true, false,
+									 istate->mistate->slots[i]);
+		Assert(ok);
+	}
+
 	/*
 	 * heap_multi_insert() can leak memory, so switch to short-lived memory
 	 * context before calling it.
@@ -2766,7 +2758,7 @@ heap_modify_buffer_flush(TableModifyState *state)
 	oldcontext = MemoryContextSwitchTo(mistate->mem_ctx);
 	heap_multi_insert(state->rel,
 					  mistate->slots,
-					  mistate->cur_slots,
+					  mistate->nused,
 					  state->cid,
 					  state->options,
 					  istate->bistate);
@@ -2779,14 +2771,15 @@ heap_modify_buffer_flush(TableModifyState *state)
 	 */
 	if (state->buffer_flush_cb != NULL)
 	{
-		for (int i = 0; i < mistate->cur_slots; i++)
+		for (int i = 0; i < mistate->nused; i++)
 		{
 			state->buffer_flush_cb(state->buffer_flush_ctx,
 								   mistate->slots[i]);
 		}
 	}
 
-	mistate->cur_slots = 0;
+	tuplestore_clear(mistate->tstore);
+	mistate->nused = 0;
 }
 
 /*
@@ -2811,11 +2804,13 @@ heap_modify_insert_end(TableModifyState *state)
 
 		heap_modify_buffer_flush(state);
 
-		Assert(mistate->cur_slots == 0);
+		Assert(mistate->nused== 0);
 
 		for (int i = 0; i < HEAP_MAX_BUFFERED_SLOTS && mistate->slots[i] != NULL; i++)
 			ExecDropSingleTupleTableSlot(mistate->slots[i]);
 
+		tuplestore_end(mistate->tstore);
+
 		MemoryContextDelete(mistate->mem_ctx);
 	}
 
diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index fdbbf9b8e8..5d8e672059 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -27,8 +27,10 @@
 #include "storage/lockdefs.h"
 #include "storage/read_stream.h"
 #include "storage/shm_toc.h"
+#include "tcop/dest.h"
 #include "utils/relcache.h"
 #include "utils/snapshot.h"
+#include "utils/tuplestore.h"
 
 
 /* "options" flag bits for heap_insert */
@@ -285,8 +287,11 @@ typedef struct HeapMultiInsertState
 	/* Array of buffered slots */
 	TupleTableSlot **slots;
 
-	/* Number of buffered slots currently held */
-	int			cur_slots;
+	/* Holds the tuple set */
+	Tuplestorestate *tstore;
+
+	/* Number of buffered tuples currently held */
+	int				nused;
 
 	/* Memory context for dealing with multi inserts */
 	MemoryContext mem_ctx;
-- 
2.43.0

Reply via email to