Hi, A few days ago I came up with an idea to implement multi insert optimization wherever possible. I prepared a raw patch and it showed a great performance gain (up to 4 times during INSERT ... INTO ... in the best case). Then I was very happy to find this thread. You did a great job and I want to help you to bring the matter to an end.
On Thu, Oct 31, 2024 at 11:17 AM Jingtang Zhang <mrdrivingd...@gmail.com> wrote: > I did some performance test these days, and I have some findings. > HEAD: > 12.29% postgres [.] pg_checksum_block > 6.33% postgres [.] GetPrivateRefCountEntry > 5.40% postgres [.] pg_comp_crc32c_sse42 > 4.54% [kernel] [k] copy_user_enhanced_fast_string > 2.69% postgres [.] BufferIsValid > 1.52% postgres [.] XLogRecordAssemble > > Patched: > 11.75% postgres [.] tts_virtual_materialize > 8.87% postgres [.] pg_checksum_block > 8.17% postgres [.] slot_deform_heap_tuple > 8.09% postgres [.] heap_compute_data_size > 6.17% postgres [.] fill_val > 3.81% postgres [.] heap_fill_tuple > 3.37% postgres [.] tts_virtual_copyslot > 2.62% [kernel] [k] copy_user_enhanced_fast_string I applied v25 patches on the master branch and made some measurements to find out what is the bottleneck in this case. The 'time' utility showed that without a patch, this query will run 1.5 times slower. I also made a few flamegraphs for this test. Most of the time is spent calling these two functions : tts_virtual_copyslot and heap_form_tuple. All tests were run in virtual machine with these CPU characteristics: Architecture: x86_64 CPU(s): 2 On-line CPU(s) list: 0,1 Virtualization features: Virtualization: AMD-V Hypervisor vendor: KVM Virtualization type: full Caches (sum of all): L1d: 128 KiB (2 instances) L1i: 128 KiB (2 instances) L2: 1 MiB (2 instances) L3: 32 MiB (2 instances) NUMA: NUMA node(s): 1 NUMA node0 CPU(s): 0,1 In my implementation, I used Tuplestore functionality to store tuples. In order to get rid of getting stuck in the above mentioned functions, I crossed it with the current implementation (v25 patches) and got a 10% increase in performance (for the test above). I also set up v22 patches to compare performance (with/without tuplestore) for INSERT ... INTO ... queries (with -j 4 -c 10 parameters for pgbech), and there also was an increase in TPS (about 3-4%). I attach a patch that adds Tuplestore to v25. What do you think about this idea? -- Best regards, Daniil Davydov
From a59cfcbb05bb07c94a4c0ad6531baa5e531629ae Mon Sep 17 00:00:00 2001 From: Daniil Davidov <d.davy...@postgrespro.ru> Date: Sun, 9 Mar 2025 16:37:44 +0700 Subject: [PATCH] Replace holding tuples in virtual slots with tuplestorage During performance testing, it was found out that in the current implementation a lot of the program's time is spent calling two functions : tts_virtual_copyslot and heap_fill_tuple. Calls to these functions are related to the fact that tuples are stored in virtual_tts, so I propose to replace this logic with Tuplestore functionality. Discussion: https://www.postgresql.org/message-id/9F9326B4-8AD9-4858-B1C1-559FC64E6E93%40gmail.com --- src/backend/access/heap/heapam.c | 67 +++++++++++++++----------------- src/include/access/heapam.h | 9 ++++- 2 files changed, 38 insertions(+), 38 deletions(-) diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c index acdce1a4b4..276480213a 100644 --- a/src/backend/access/heap/heapam.c +++ b/src/backend/access/heap/heapam.c @@ -2665,7 +2665,6 @@ void heap_modify_buffer_insert(TableModifyState *state, TupleTableSlot *slot) { - TupleTableSlot *dstslot; HeapInsertState *istate; HeapMultiInsertState *mistate; MemoryContext oldcontext; @@ -2682,8 +2681,10 @@ heap_modify_buffer_insert(TableModifyState *state, mistate = (HeapMultiInsertState *) palloc(sizeof(HeapMultiInsertState)); mistate->slots = - (TupleTableSlot **) palloc0(sizeof(TupleTableSlot *) * HEAP_MAX_BUFFERED_SLOTS); - mistate->cur_slots = 0; + (TupleTableSlot **) palloc0(sizeof(void *) * HEAP_MAX_BUFFERED_SLOTS); + mistate->tstore = tuplestore_begin_heap(false, false, work_mem); + mistate->nused = 0; + istate->mistate = mistate; /* @@ -2702,36 +2703,11 @@ heap_modify_buffer_insert(TableModifyState *state, istate = (HeapInsertState *) state->data; Assert(istate->mistate != NULL); mistate = istate->mistate; - dstslot = mistate->slots[mistate->cur_slots]; - - if (dstslot == NULL) - { - /* - * We use virtual tuple slots buffered slots for leveraging the - * optimization it provides to minimize physical data copying. The - * virtual slot gets materialized when we copy (via below - * ExecCopySlot) the tuples from the source slot which can be of any - * type. This way, it is ensured that the tuple storage doesn't depend - * on external memory, because all the datums that aren't passed by - * value are copied into the slot's memory context. - */ - dstslot = MakeTupleTableSlot(RelationGetDescr(state->rel), - &TTSOpsVirtual); - - mistate->slots[mistate->cur_slots] = dstslot; - } - - Assert(TTS_IS_VIRTUAL(dstslot)); - - /* - * Note that the copy clears the previous destination slot contents, so no - * need to explicitly ExecClearTuple() here. - */ - ExecCopySlot(dstslot, slot); - mistate->cur_slots++; + tuplestore_puttupleslot(mistate->tstore, slot); + mistate->nused += 1; - if (mistate->cur_slots >= HEAP_MAX_BUFFERED_SLOTS) + if (mistate->nused >= HEAP_MAX_BUFFERED_SLOTS) heap_modify_buffer_flush(state); MemoryContextSwitchTo(oldcontext); @@ -2746,19 +2722,35 @@ heap_modify_buffer_flush(TableModifyState *state) HeapInsertState *istate; HeapMultiInsertState *mistate; MemoryContext oldcontext; + TupleDesc tupdesc; /* Quick exit if we haven't inserted anything yet */ if (state->data == NULL) return; + tupdesc = RelationGetDescr(state->rel); istate = (HeapInsertState *) state->data; Assert(istate->mistate != NULL); mistate = istate->mistate; /* Quick exit if we have flushed already */ - if (mistate->cur_slots == 0) + if (mistate->nused == 0) return; + for (int i = 0; i < mistate->nused; i++) + { + bool ok; + + if (istate->mistate->slots[i] == NULL) + { + istate->mistate->slots[i] = + MakeSingleTupleTableSlot(tupdesc, &TTSOpsMinimalTuple); + } + ok = tuplestore_gettupleslot(mistate->tstore, true, false, + istate->mistate->slots[i]); + Assert(ok); + } + /* * heap_multi_insert() can leak memory, so switch to short-lived memory * context before calling it. @@ -2766,7 +2758,7 @@ heap_modify_buffer_flush(TableModifyState *state) oldcontext = MemoryContextSwitchTo(mistate->mem_ctx); heap_multi_insert(state->rel, mistate->slots, - mistate->cur_slots, + mistate->nused, state->cid, state->options, istate->bistate); @@ -2779,14 +2771,15 @@ heap_modify_buffer_flush(TableModifyState *state) */ if (state->buffer_flush_cb != NULL) { - for (int i = 0; i < mistate->cur_slots; i++) + for (int i = 0; i < mistate->nused; i++) { state->buffer_flush_cb(state->buffer_flush_ctx, mistate->slots[i]); } } - mistate->cur_slots = 0; + tuplestore_clear(mistate->tstore); + mistate->nused = 0; } /* @@ -2811,11 +2804,13 @@ heap_modify_insert_end(TableModifyState *state) heap_modify_buffer_flush(state); - Assert(mistate->cur_slots == 0); + Assert(mistate->nused== 0); for (int i = 0; i < HEAP_MAX_BUFFERED_SLOTS && mistate->slots[i] != NULL; i++) ExecDropSingleTupleTableSlot(mistate->slots[i]); + tuplestore_end(mistate->tstore); + MemoryContextDelete(mistate->mem_ctx); } diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h index fdbbf9b8e8..5d8e672059 100644 --- a/src/include/access/heapam.h +++ b/src/include/access/heapam.h @@ -27,8 +27,10 @@ #include "storage/lockdefs.h" #include "storage/read_stream.h" #include "storage/shm_toc.h" +#include "tcop/dest.h" #include "utils/relcache.h" #include "utils/snapshot.h" +#include "utils/tuplestore.h" /* "options" flag bits for heap_insert */ @@ -285,8 +287,11 @@ typedef struct HeapMultiInsertState /* Array of buffered slots */ TupleTableSlot **slots; - /* Number of buffered slots currently held */ - int cur_slots; + /* Holds the tuple set */ + Tuplestorestate *tstore; + + /* Number of buffered tuples currently held */ + int nused; /* Memory context for dealing with multi inserts */ MemoryContext mem_ctx; -- 2.43.0