答复: optimize hashjoin

bucoo Fri, 23 Aug 2024 04:02:51 -0700


>  * mtup is hold in hjstate->hj_outerTupleBuffer, so we can using
>  * shouldFree as false to call ExecForceStoreMinimalTuple().
>  *
>  * When slot is TTSOpsMinimalTuple we can avoid realloc memory for
>  * new MinimalTuple(reuse StringInfo to call ExecHashJoinGetSavedTuple).
> 
> But my point was that I don't think the palloc/repalloc should be very 
> expensive, once the AllocSet warms up a bit.


Avoiding memory palloc/repalloc is just a side effect of avoiding reform tuple.

>  * More importantly, in non-TTSOpsMinimalTuple scenarios, it can avoid
>  * reform(materialize) tuple(see ExecForceStoreMinimalTuple).
> 
> Yeah, but doesn't that conflate two things - materialization and freeing the 
> memory? Only because materialization is expensive, is that a good reason to 
> abandon the memory management too?

Currently, I haven't thought of a better way to avoid reform.

> > 
> >> Can you provide more information about the benchmark you did? What 
> >> hardware, what scale, PostgreSQL configuration, which of the 22 
> >> queries are improved, etc.
> >>
> >> I ran TPC-H with 1GB and 10GB scales on two machines, and I see 
> >> pretty much no difference compared to master. However, it occurred to 
> >> me the patch only ever helps if we increase the number of batches 
> >> during execution, in which case we need to move tuples to the right batch.
> > 
> > Only parallel HashJoin speed up to ~2x(all data cached in memory),
> > 
> > not full query, include non-parallel HashJoin.
> > 
> > non-parallel HashJoin only when batchs large then one will speed up,
> > 
> > because this patch only optimize for read batchs tuples to memory.
> > 
> 
> I'm sorry, but this does not answer *any* of the questions I asked.
> 
> Please provide enough info to reproduce the benefit - benchmark scale, which 
> query, which > parameters, etc. Show explain / explain analyze of the query 
> without / with the patch, stuff > like that.
> 
> I ran a number of TPC-H benchmarks with the patch and I never a benefit of 
> this scale.

After further testing, it turns out that the parallel hashjoin did not improve 
performance. I might have compared it with a debug version at the time. I 
apologize for that.

Howerver, the non-parallel hashjoin indeed showed about a 10% performance 
improvement.
Here is the testing information:

CPU: 13th Gen Intel(R) Core(TM) i7-13700
Memory: 32GB
SSD: UMIS REPEYJ512MKN1QWQ
Windows version: win11 23H2 22631.4037
WSL version: 2.2.4.0
Kernel version: 5.15.153.1-2
OS version: rocky linux 9.4
TPCH: SF=8

SQL:
set max_parallel_workers_per_gather = 0;
set enable_mergejoin = off;
explain (verbose,analyze)
select count(*)
from lineitem, orders
where lineitem.l_orderkey = orders.o_orderkey;

patch before:
Aggregate  (cost=2422401.83..2422401.84 rows=1 width=8) (actual 
time=10591.679..10591.681 rows=1 loops=1)
   Output: count(*)
   ->  Hash Join  (cost=508496.00..2302429.31 rows=47989008 width=0) (actual 
time=1075.213..9503.727 rows=47989007 loops=1)
         Inner Unique: true
         Hash Cond: (lineitem.l_orderkey = orders.o_orderkey)
         ->  Index Only Scan using lineitem_pkey on public.lineitem  
(cost=0.56..1246171.69 rows=47989008 width=4) (actual time=0.023..1974.365 
rows=47989007 loops=1)
               Output: lineitem.l_orderkey
               Heap Fetches: 0
         ->  Hash  (cost=311620.43..311620.43 rows=12000000 width=4) (actual 
time=1074.155..1074.156 rows=12000000 loops=1)
               Output: orders.o_orderkey
               Buckets: 262144  Batches: 128  Memory Usage: 5335kB
               ->  Index Only Scan using orders_pkey on public.orders  
(cost=0.43..311620.43 rows=12000000 width=4) (actual time=0.014..464.346 
rows=12000000 loops=1)
                     Output: orders.o_orderkey
                     Heap Fetches: 0
 Planning Time: 0.141 ms
 Execution Time: 10591.730 ms
(16 rows)

Patch after:
Aggregate  (cost=2422401.83..2422401.84 rows=1 width=8) (actual 
time=9826.105..9826.106 rows=1 loops=1)
   Output: count(*)
   ->  Hash Join  (cost=508496.00..2302429.31 rows=47989008 width=0) (actual 
time=1087.588..8726.441 rows=47989007 loops=1)
         Inner Unique: true
         Hash Cond: (lineitem.l_orderkey = orders.o_orderkey)
         ->  Index Only Scan using lineitem_pkey on public.lineitem  
(cost=0.56..1246171.69 rows=47989008 width=4) (actual time=0.015..1989.389 
rows=47989007 loops=1)
               Output: lineitem.l_orderkey
               Heap Fetches: 0
         ->  Hash  (cost=311620.43..311620.43 rows=12000000 width=4) (actual 
time=1086.357..1086.358 rows=12000000 loops=1)
               Output: orders.o_orderkey
               Buckets: 262144  Batches: 128  Memory Usage: 5335kB
               ->  Index Only Scan using orders_pkey on public.orders  
(cost=0.43..311620.43 rows=12000000 width=4) (actual time=0.011..470.225 
rows=12000000 loops=1)
                     Output: orders.o_orderkey
                     Heap Fetches: 0
 Planning Time: 0.065 ms
 Execution Time: 9826.135 ms

答复: optimize hashjoin

Reply via email to