Re: fix cost subqueryscan wrong parallel cost
> > > Suppose parallelism is not in use and that param_info is NULL. Then, > > > is path->subpath->rows guaranteed to be equal to baserel->rows? If > > > yes, then we don't need to a three-part if statement as you propose > > > here and can just change the "else" clause to say path->path.rows = > > > path->subpath->rows. If no, then your change gives the wrong answer. > > > > I checked some regress test, Sometimes subquery scan have filter, > > so path->subpath->row guaranteed *not* to be equal to baserel->rows. > > If the first patch is false, I don't known how to fix this, > > looks like need someone's help. > > Please fix your mailer so that it doesn't send me a bounce message > every time I reply to one of your messages on list. This message send using Outlook. > I don't know how to fix this right now either, then; maybe I or > someone else will have a good idea later. I don't known too.
partition wise aggregate wrong rows cost
Normal aggregate and partition wise aggregate have a big difference rows cost: begin; create table t1(id integer, name text) partition by hash(id); create table t1_0 partition of t1 for values with(modulus 3, remainder 0); create table t1_1 partition of t1 for values with(modulus 3, remainder 1); create table t1_2 partition of t1 for values with(modulus 3, remainder 2); commit; normal aggregate rows cost is 200. explain (verbose) select count(1) from t1 group by id; HashAggregate (cost=106.20..108.20 rows=200 width=12) --here rows is 200 Output: count(1), t1.id Group Key: t1.id -> Append (cost=0.00..87.15 rows=3810 width=4) -> Seq Scan on public.t1_0 t1_1 (cost=0.00..22.70 rows=1270 width=4) Output: t1_1.id -> Seq Scan on public.t1_1 t1_2 (cost=0.00..22.70 rows=1270 width=4) Output: t1_2.id -> Seq Scan on public.t1_2 t1_3 (cost=0.00..22.70 rows=1270 width=4) Output: t1_3.id And partition wise aggregate rows cost is 600 set enable_partitionwise_aggregate = on; explain (verbose) select count(1) from t1 group by id; Append (cost=29.05..96.15 rows=600 width=12) --here rows is 600 -> HashAggregate (cost=29.05..31.05 rows=200 width=12) --this rows looks like same as normal aggregate Output: count(1), t1.id Group Key: t1.id -> Seq Scan on public.t1_0 t1 (cost=0.00..22.70 rows=1270 width=4) Output: t1.id -> HashAggregate (cost=29.05..31.05 rows=200 width=12) Output: count(1), t1_1.id Group Key: t1_1.id -> Seq Scan on public.t1_1 (cost=0.00..22.70 rows=1270 width=4) Output: t1_1.id -> HashAggregate (cost=29.05..31.05 rows=200 width=12) Output: count(1), t1_2.id Group Key: t1_2.id -> Seq Scan on public.t1_2 (cost=0.00..22.70 rows=1270 width=4) Output: t1_2.id Source code is 15beta1(7fdbdf204920ac279f280d0a8e96946fdaf41aef)
re: partition wise aggregate wrong rows cost
> We try fairly hard to ensure that the row count estimate for a given relation > does not vary across paths, so I concur with the OP that this is a bug. > Having > said that, I'm not sure that the consequences are significant. As you say, > the > estimates seem to get a lot closer as soon as the table has some statistics. > (But nonetheless, they are not identical, so it's still a bug.) Yes, the estimates seem to get a lot closer as soon as the table has some statistics. > I'm not sure that the consequences are significant. At least it doesn't make any difference to me for now. I noticed this problem while testing aggregation. It looks a little weird, so I emailed. Thanks every one.
optimize hashjoin
Avoid unnecessary form and deform tuple. In the TPCH test, HashJoin speed up to ~2x. diff --git a/src/backend/executor/nodeHash.c b/src/backend/executor/nodeHash.c index 61480733a1..2dad0c8a55 100644 --- a/src/backend/executor/nodeHash.c +++ b/src/backend/executor/nodeHash.c @@ -1627,6 +1627,23 @@ ExecHashTableInsert(HashJoinTable hashtable, { boolshouldFree; MinimalTuple tuple = ExecFetchSlotMinimalTuple(slot, &shouldFree); + + ExecHashTableInsertTuple(hashtable, tuple, hashvalue); + + if (shouldFree) + heap_free_minimal_tuple(tuple); +} + +/* + * ExecHashTableInsert + * insert a tuple into the hash table depending on the hash value + * it may just go to a temp file for later batches + */ +void +ExecHashTableInsertTuple(HashJoinTable hashtable, +MinimalTuple tuple, +uint32 hashvalue) +{ int bucketno; int batchno; @@ -1701,9 +1718,6 @@ ExecHashTableInsert(HashJoinTable hashtable, &hashtable->innerBatchFile[batchno], hashtable); } - - if (shouldFree) - heap_free_minimal_tuple(tuple); } /* @@ -1777,12 +1791,10 @@ retry: * tuples that belong in the current batch once growth has been disabled. */ void -ExecParallelHashTableInsertCurrentBatch(HashJoinTable hashtable, - TupleTableSlot *slot, - uint32 hashvalue) +ExecParallelHashTableInsertCurrentBatchTuple(HashJoinTable hashtable, + MinimalTuple tuple, + uint32 hashvalue) { - boolshouldFree; - MinimalTuple tuple = ExecFetchSlotMinimalTuple(slot, &shouldFree); HashJoinTuple hashTuple; dsa_pointer shared; int batchno; @@ -1798,6 +1810,21 @@ ExecParallelHashTableInsertCurrentBatch(HashJoinTable hashtable, HeapTupleHeaderClearMatch(HJTUPLE_MINTUPLE(hashTuple)); ExecParallelHashPushTuple(&hashtable->buckets.shared[bucketno], hashTuple, shared); +} + +/* + * like ExecParallelHashTableInsertCurrentBatchTuple, + * but this function accept a TupleTableSlot + */ +void +ExecParallelHashTableInsertCurrentBatch(HashJoinTable hashtable, + TupleTableSlot *slot, + uint32 hashvalue) +{ + boolshouldFree; + MinimalTuple tuple = ExecFetchSlotMinimalTuple(slot, &shouldFree); + + ExecParallelHashTableInsertCurrentBatchTuple(hashtable, tuple, hashvalue); if (shouldFree) heap_free_minimal_tuple(tuple); diff --git a/src/backend/executor/nodeHashjoin.c b/src/backend/executor/nodeHashjoin.c index 5f4073eabd..002098f129 100644 --- a/src/backend/executor/nodeHashjoin.c +++ b/src/backend/executor/nodeHashjoin.c @@ -194,10 +194,10 @@ static TupleTableSlot *ExecHashJoinOuterGetTuple(PlanState *outerNode, static TupleTableSlot *ExecParallelHashJoinOuterGetTuple(PlanState *outerNode, HashJoinState *hjstate, uint32 *hashvalue); -static TupleTableSlot *ExecHashJoinGetSavedTuple(HashJoinState *hjstate, - BufFile *file, - uint32 *hashvalue, - TupleTableSlot *tupleSlot); +static MinimalTuple ExecHashJoinGetSavedTuple(HashJoinState *hjstate, + BufFile *file, + uint32 *hashvalue, + StringInfo buf); static bool ExecHashJoinNewBatch(HashJoinState *hjstate); static bool ExecParallelHashJoinNewBatch(HashJoinState *hjstate); static void ExecParallelHashJoinPartitionOuter(HashJoinState *hjstate); @@ -831,6 +831,7 @@ ExecInitHashJoin(HashJoin *node, EState *estate, int eflag
Re: optimize hashjoin
> 0) The patch does not apply anymore, thanks to David committing a patch> yesterday. Attached is a patch rebased on top of current master.That patch is based on PG17. I have now rewritten it based on the master branch and added some comments.> 1) Wouldn't it be easier (and just as efficient) to use slots with> TTSOpsMinimalTuple, i.e. backed by a minimal tuple?Use diffent kind of slot, the ExecEvalExpr function will report an error. > 2) I find the naming of the new functions a bit confusing. We now have> the "original" functions working with slots, and then also functions> with "Tuple" working with tuples. Seems asymmetric.In net patch function name renamed to ExecHashTableInsertSlot and ExecHashTableInsertTuple,also ExecParallelHashTableInsertSlotCurrentBatch and ExecParallelHashTableInsertTupleCurrentBatch. > 3) The code in ExecHashJoinOuterGetTuple is hard to understand, it'd> very much benefit from some comments. I'm a bit unsure if the likely()> and unlikely() hints really help.In new patch added some comments."Likely" and "unlikely" might offer only marginal help on some CPUs and might not be beneficial at all on other platforms (I think). > 4) Is the hj_outerTupleBuffer buffer really needed / effective? I'd bet> just using palloc() will work just as well, thanks to the AllocSet> caching etc.The hj_outerTupleBuffer avoid reform(materialize) tuple in non-TTSOpsMinimalTuple scenarios,see ExecForceStoreMinimalTuple. I added some comments in new patch.> Can you provide more information about the benchmark you did? What> hardware, what scale, PostgreSQL configuration, which of the 22 queries> are improved, etc.> > I ran TPC-H with 1GB and 10GB scales on two machines, and I see pretty> much no difference compared to master. However, it occurred to me the> patch only ever helps if we increase the number of batches during> execution, in which case we need to move tuples to the right batch.Only parallel HashJoin speed up to ~2x(all data cached in memory),not full query, include non-parallel HashJoin.non-parallel HashJoin only when batchs large then one will speed up,because this patch only optimize for read batchs tuples to memory. optimize-hashjoin-master.patch Description: Binary data
答复: optimize hashjoin
> * mtup is hold in hjstate->hj_outerTupleBuffer, so we can using > * shouldFree as false to call ExecForceStoreMinimalTuple(). > * > * When slot is TTSOpsMinimalTuple we can avoid realloc memory for > * new MinimalTuple(reuse StringInfo to call ExecHashJoinGetSavedTuple). > > But my point was that I don't think the palloc/repalloc should be very > expensive, once the AllocSet warms up a bit. Avoiding memory palloc/repalloc is just a side effect of avoiding reform tuple. > * More importantly, in non-TTSOpsMinimalTuple scenarios, it can avoid > * reform(materialize) tuple(see ExecForceStoreMinimalTuple). > > Yeah, but doesn't that conflate two things - materialization and freeing the > memory? Only because materialization is expensive, is that a good reason to > abandon the memory management too? Currently, I haven't thought of a better way to avoid reform. > > > >> Can you provide more information about the benchmark you did? What > >> hardware, what scale, PostgreSQL configuration, which of the 22 > >> queries are improved, etc. > >> > >> I ran TPC-H with 1GB and 10GB scales on two machines, and I see > >> pretty much no difference compared to master. However, it occurred to > >> me the patch only ever helps if we increase the number of batches > >> during execution, in which case we need to move tuples to the right batch. > > > > Only parallel HashJoin speed up to ~2x(all data cached in memory), > > > > not full query, include non-parallel HashJoin. > > > > non-parallel HashJoin only when batchs large then one will speed up, > > > > because this patch only optimize for read batchs tuples to memory. > > > > I'm sorry, but this does not answer *any* of the questions I asked. > > Please provide enough info to reproduce the benefit - benchmark scale, which > query, which > parameters, etc. Show explain / explain analyze of the query > without / with the patch, stuff > like that. > > I ran a number of TPC-H benchmarks with the patch and I never a benefit of > this scale. After further testing, it turns out that the parallel hashjoin did not improve performance. I might have compared it with a debug version at the time. I apologize for that. Howerver, the non-parallel hashjoin indeed showed about a 10% performance improvement. Here is the testing information: CPU: 13th Gen Intel(R) Core(TM) i7-13700 Memory: 32GB SSD: UMIS REPEYJ512MKN1QWQ Windows version: win11 23H2 22631.4037 WSL version: 2.2.4.0 Kernel version: 5.15.153.1-2 OS version: rocky linux 9.4 TPCH: SF=8 SQL: set max_parallel_workers_per_gather = 0; set enable_mergejoin = off; explain (verbose,analyze) select count(*) from lineitem, orders where lineitem.l_orderkey = orders.o_orderkey; patch before: Aggregate (cost=2422401.83..2422401.84 rows=1 width=8) (actual time=10591.679..10591.681 rows=1 loops=1) Output: count(*) -> Hash Join (cost=508496.00..2302429.31 rows=47989008 width=0) (actual time=1075.213..9503.727 rows=47989007 loops=1) Inner Unique: true Hash Cond: (lineitem.l_orderkey = orders.o_orderkey) -> Index Only Scan using lineitem_pkey on public.lineitem (cost=0.56..1246171.69 rows=47989008 width=4) (actual time=0.023..1974.365 rows=47989007 loops=1) Output: lineitem.l_orderkey Heap Fetches: 0 -> Hash (cost=311620.43..311620.43 rows=1200 width=4) (actual time=1074.155..1074.156 rows=1200 loops=1) Output: orders.o_orderkey Buckets: 262144 Batches: 128 Memory Usage: 5335kB -> Index Only Scan using orders_pkey on public.orders (cost=0.43..311620.43 rows=1200 width=4) (actual time=0.014..464.346 rows=1200 loops=1) Output: orders.o_orderkey Heap Fetches: 0 Planning Time: 0.141 ms Execution Time: 10591.730 ms (16 rows) Patch after: Aggregate (cost=2422401.83..2422401.84 rows=1 width=8) (actual time=9826.105..9826.106 rows=1 loops=1) Output: count(*) -> Hash Join (cost=508496.00..2302429.31 rows=47989008 width=0) (actual time=1087.588..8726.441 rows=47989007 loops=1) Inner Unique: true Hash Cond: (lineitem.l_orderkey = orders.o_orderkey) -> Index Only Scan using lineitem_pkey on public.lineitem (cost=0.56..1246171.69 rows=47989008 width=4) (actual time=0.015..1989.389 rows=47989007 loops=1) Output: lineitem.l_orderkey Heap Fetches: 0 -> Hash (cost=311620.43..311620.43 rows=1200 width=4) (actual time=1086.357..1086.358 rows=1200 loops=1) Output: orders.o_orderkey Buckets: 262144 Batches: 128 Memory Usage: 5335kB -> Index Only Scan using orders_pkey on public.orders (cost=0.43..311620.43 rows=1200 width=4) (actual time=0.011..470.225 rows=1200 loops=1) Output: orders.o_orderkey Heap Fetches: 0 Planning Time: 0.065 ms Executi