> * mtup is hold in hjstate->hj_outerTupleBuffer, so we can using > * shouldFree as false to call ExecForceStoreMinimalTuple(). > * > * When slot is TTSOpsMinimalTuple we can avoid realloc memory for > * new MinimalTuple(reuse StringInfo to call ExecHashJoinGetSavedTuple). > > But my point was that I don't think the palloc/repalloc should be very > expensive, once the AllocSet warms up a bit.
Avoiding memory palloc/repalloc is just a side effect of avoiding reform tuple. > * More importantly, in non-TTSOpsMinimalTuple scenarios, it can avoid > * reform(materialize) tuple(see ExecForceStoreMinimalTuple). > > Yeah, but doesn't that conflate two things - materialization and freeing the > memory? Only because materialization is expensive, is that a good reason to > abandon the memory management too? Currently, I haven't thought of a better way to avoid reform. > > > >> Can you provide more information about the benchmark you did? What > >> hardware, what scale, PostgreSQL configuration, which of the 22 > >> queries are improved, etc. > >> > >> I ran TPC-H with 1GB and 10GB scales on two machines, and I see > >> pretty much no difference compared to master. However, it occurred to > >> me the patch only ever helps if we increase the number of batches > >> during execution, in which case we need to move tuples to the right batch. > > > > Only parallel HashJoin speed up to ~2x(all data cached in memory), > > > > not full query, include non-parallel HashJoin. > > > > non-parallel HashJoin only when batchs large then one will speed up, > > > > because this patch only optimize for read batchs tuples to memory. > > > > I'm sorry, but this does not answer *any* of the questions I asked. > > Please provide enough info to reproduce the benefit - benchmark scale, which > query, which > parameters, etc. Show explain / explain analyze of the query > without / with the patch, stuff > like that. > > I ran a number of TPC-H benchmarks with the patch and I never a benefit of > this scale. After further testing, it turns out that the parallel hashjoin did not improve performance. I might have compared it with a debug version at the time. I apologize for that. Howerver, the non-parallel hashjoin indeed showed about a 10% performance improvement. Here is the testing information: CPU: 13th Gen Intel(R) Core(TM) i7-13700 Memory: 32GB SSD: UMIS REPEYJ512MKN1QWQ Windows version: win11 23H2 22631.4037 WSL version: 2.2.4.0 Kernel version: 5.15.153.1-2 OS version: rocky linux 9.4 TPCH: SF=8 SQL: set max_parallel_workers_per_gather = 0; set enable_mergejoin = off; explain (verbose,analyze) select count(*) from lineitem, orders where lineitem.l_orderkey = orders.o_orderkey; patch before: Aggregate (cost=2422401.83..2422401.84 rows=1 width=8) (actual time=10591.679..10591.681 rows=1 loops=1) Output: count(*) -> Hash Join (cost=508496.00..2302429.31 rows=47989008 width=0) (actual time=1075.213..9503.727 rows=47989007 loops=1) Inner Unique: true Hash Cond: (lineitem.l_orderkey = orders.o_orderkey) -> Index Only Scan using lineitem_pkey on public.lineitem (cost=0.56..1246171.69 rows=47989008 width=4) (actual time=0.023..1974.365 rows=47989007 loops=1) Output: lineitem.l_orderkey Heap Fetches: 0 -> Hash (cost=311620.43..311620.43 rows=12000000 width=4) (actual time=1074.155..1074.156 rows=12000000 loops=1) Output: orders.o_orderkey Buckets: 262144 Batches: 128 Memory Usage: 5335kB -> Index Only Scan using orders_pkey on public.orders (cost=0.43..311620.43 rows=12000000 width=4) (actual time=0.014..464.346 rows=12000000 loops=1) Output: orders.o_orderkey Heap Fetches: 0 Planning Time: 0.141 ms Execution Time: 10591.730 ms (16 rows) Patch after: Aggregate (cost=2422401.83..2422401.84 rows=1 width=8) (actual time=9826.105..9826.106 rows=1 loops=1) Output: count(*) -> Hash Join (cost=508496.00..2302429.31 rows=47989008 width=0) (actual time=1087.588..8726.441 rows=47989007 loops=1) Inner Unique: true Hash Cond: (lineitem.l_orderkey = orders.o_orderkey) -> Index Only Scan using lineitem_pkey on public.lineitem (cost=0.56..1246171.69 rows=47989008 width=4) (actual time=0.015..1989.389 rows=47989007 loops=1) Output: lineitem.l_orderkey Heap Fetches: 0 -> Hash (cost=311620.43..311620.43 rows=12000000 width=4) (actual time=1086.357..1086.358 rows=12000000 loops=1) Output: orders.o_orderkey Buckets: 262144 Batches: 128 Memory Usage: 5335kB -> Index Only Scan using orders_pkey on public.orders (cost=0.43..311620.43 rows=12000000 width=4) (actual time=0.011..470.225 rows=12000000 loops=1) Output: orders.o_orderkey Heap Fetches: 0 Planning Time: 0.065 ms Execution Time: 9826.135 ms