Re: Removing unneeded self joins

Alena Rybakina Sat, 05 Apr 2025 14:02:19 -0700

On 05.04.2025 13:09, Alexander Korotkov wrote:

On Fri, Apr 4, 2025 at 11:35 AM Andrei Lepikhov<[email protected]>  wrote:

On 4/4/25 04:53, Richard Guo wrote:

On Fri, Apr 4, 2025 at 1:02 AM Alexander Korotkov<[email protected]>  wrote:

I've got an off-list bug report from Alexander Lakhin involving a
placeholder variable.  Alena and Andrei proposed a fix.  It is fairly
simple: we just shouldn't remove PHVs during self-join elimination, as
they might still be referenced from other parts of a query.  The patch
is attached.  I'm going to fix this if no objections.

Hmm, I'm not sure about the fix.  It seems to me that it simply
prevents removing any PHVs in the self-join removal case.  My concern
is that this might result in PHVs that could actually be removed not
being removed in many cases.

Let's play with use cases:
If a PHV is needed in the inner or outer only, it means we have a clause
in the baserestrictinfo that will be transferred to the keeping
relation, and we shouldn't remove the PHV.
Another case is when the PHV is needed in a join clause of the
self-join. I may imagine such a case:


toKeep.x+toRemove.y=PHV

This clause will be transformed to "toKeep.x+toKeep.y=PHV", pushed to
baserestrictinfo of keeping relation and should be saved.
I think it is possible to invent quite a narrow case of clause like the
following:

PHV_evaluated_at_inner = PHV_evaluated_at_outer

It needs to prove reproducibility. But even if it makes sense, it seems
to have no danger for further selectivity estimation compared to the
source clause and is a too-narrow case, isn't it?
In other cases, this PHV is needed something else, and we can't remove it.

I agree with that, and I also believe this case will be quite rare inpractice.

Should we add more regression tests covering these cases?

I experimented with some examples like this and noticed that it doesaffect cardinality estimation, though I'm not sure the impact issignificant.I used the tables from the regression tests, so if they’re appropriatefor reproducing this case, it should be straightforward to add them.


EXPLAIN (analyze, VERBOSE)
SELECT 1
FROM tbl_phv t1
LEFT JOIN (
    SELECT 1 AS extra, x, y FROM tbl_phv tl
) t3
JOIN (
    SELECT y, x FROM tbl_phv tr
) t4 ON t4.y = t3.y and (t4.x*0) = (t3.x *0)
ON true
WHERE t3.extra IS NOT NULL
  AND t4.x + t3.y= (t1.x * 0);


The query plan with transformation:
------------------------------------------------------------------------------------------------------------------------------

Hash Join (cost=3.27..7.16 rows=100 width=4) (actualtime=0.149..0.151 rows=0.00 loops=1)

   Output: 1
   Hash Cond: ((tr.x + tr.y) = (t1.x * 0))
   Buffers: shared hit=2

-> Seq Scan on public.tbl_phv tr (cost=0.00..2.26 rows=100width=8) (actual time=0.027..0.047 rows=100.00 loops=1)

         Output: tr.x, tr.y
         Filter: ((1 IS NOT NULL) AND ((tr.x * 0) IS NOT NULL))
         Rows Removed by Filter: 1
         Buffers: shared hit=1

-> Hash (cost=2.01..2.01 rows=101 width=4) (actualtime=0.078..0.079 rows=100.00 loops=1)

         Output: t1.x
         Buckets: 1024  Batches: 1  Memory Usage: 12kB
         Buffers: shared hit=1

-> Seq Scan on public.tbl_phv t1 (cost=0.00..2.01 rows=101width=4) (actual time=0.005..0.038 rows=101.00 loops=1)

               Output: t1.x
               Buffers: shared hit=1
 Planning Time: 0.607 ms
 Execution Time: 0.194 ms
(18 rows)

The query plan without transformation:

QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------------

Hash Join (cost=6.09..8.74 rows=1 width=4) (actual time=0.248..0.250rows=0.00 loops=1)

Output:1
Hash Cond:((t1.x * 0) = (tr.x + tl.y))
Buffers:shared hit=3

-> Seq Scan on public.tbl_phv t1 (cost=0.00..2.01 rows=101 width=4)(actual time=0.011..0.021 rows=101.00 loops=1)

Output:t1.x, t1.y
Buffers:shared hit=1

-> Hash (cost=6.08..6.08 rows=1 width=8) (actual time=0.206..0.207rows=100.00 loops=1)

Output:tl.y, tr.x
Buckets:1024 Batches:1 Memory Usage:12kB
Buffers:shared hit=2

-> Hash Join (cost=3.51..6.08 rows=1 width=8) (actual time=0.090..0.160rows=100.00 loops=1)

Output:tl.y, tr.x
Inner Unique:true
Hash Cond:((tr.y = tl.y) AND ((tr.x * 0) = (tl.x * 0)))
Buffers:shared hit=2

-> Seq Scan on public.tbl_phv tr (cost=0.00..2.01 rows=101 width=8)(actual time=0.005..0.016 rows=101.00 loops=1)

Output:tr.x, tr.y
Buffers:shared hit=1

-> Hash (cost=2.01..2.01 rows=100 width=8) (actual time=0.080..0.080rows=100.00 loops=1)

Output:tl.y, tl.x
Buckets:1024 Batches:1 Memory Usage:12kB
Buffers:shared hit=1

-> Seq Scan on public.tbl_phv tl (cost=0.00..2.01 rows=100 width=8)(actual time=0.008..0.035 rows=101.00 loops=1)

Output:tl.y, tl.x
Filter:(1 IS NOT NULL)
Buffers:shared hit=1
Planning:
Buffers:shared hit=25
Planning Time:0.609 ms
Execution Time:0.319 ms
(31 rows)

EXPLAIN (analyze, VERBOSE)
SELECT 1
FROM tbl_phv t1
LEFT JOIN (
    SELECT 1 AS extra, x, y FROM tbl_phv tl
) t3
JOIN (
    SELECT y, x FROM tbl_phv tr
) t4 ON t4.y = t3.y
ON true
WHERE t3.extra IS NOT NULL
  AND t4.x + t3.y= (t1.x * 0);


The query plan with transformation:
------------------------------------------------------------------------------------------------------------------------------

Hash Join (cost=3.26..6.90 rows=100 width=4) (actualtime=0.145..0.146 rows=0.00 loops=1)

   Output: 1
   Hash Cond: ((t1.x * 0) = (tr.x + tr.y))
   Buffers: shared hit=2

-> Seq Scan on public.tbl_phv t1 (cost=0.00..2.01 rows=101width=4) (actual time=0.030..0.040 rows=101.00 loops=1)

         Output: t1.x, t1.y
         Buffers: shared hit=1

-> Hash (cost=2.01..2.01 rows=100 width=8) (actualtime=0.081..0.081 rows=100.00 loops=1)

         Output: tr.y, tr.x
         Buckets: 1024  Batches: 1  Memory Usage: 12kB
         Buffers: shared hit=1

-> Seq Scan on public.tbl_phv tr (cost=0.00..2.01 rows=100width=8) (actual time=0.012..0.040 rows=101.00 loops=1)

               Output: tr.y, tr.x
               Filter: (1 IS NOT NULL)
               Buffers: shared hit=1
 Planning Time: 0.571 ms
 Execution Time: 0.195 ms
(17 rows)

The query plan without transformation:

QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------

Hash Join (cost=6.53..9.94 rows=50 width=4) (actual time=0.236..0.238rows=0.00 loops=1)

Output:1
Hash Cond:((tr.x + tl.y) = (t1.x * 0))
Buffers:shared hit=3

-> Hash Join (cost=3.26..5.55 rows=100 width=8) (actualtime=0.075..0.137 rows=101.00 loops=1)

Output:tl.y, tr.x
Inner Unique:true
Hash Cond:(tr.y = tl.y)
Buffers:shared hit=2

-> Seq Scan on public.tbl_phv tr (cost=0.00..2.01 rows=101 width=8)(actual time=0.005..0.017 rows=101.00 loops=1)

Output:tr.x, tr.y
Buffers:shared hit=1

-> Hash (cost=2.01..2.01 rows=100 width=4) (actual time=0.064..0.065rows=101.00 loops=1)

Output:tl.y
Buckets:1024 Batches:1 Memory Usage:12kB
Buffers:shared hit=1

-> Seq Scan on public.tbl_phv tl (cost=0.00..2.01 rows=100 width=4)(actual time=0.006..0.033 rows=101.00 loops=1)

Output:tl.y
Filter:(1 IS NOT NULL)
Buffers:shared hit=1

-> Hash (cost=2.01..2.01 rows=101 width=4) (actual time=0.078..0.078rows=100.00 loops=1)

Output:t1.x
Buckets:1024 Batches:1 Memory Usage:12kB
Buffers:shared hit=1

-> Seq Scan on public.tbl_phv t1 (cost=0.00..2.01 rows=101 width=4)(actual time=0.014..0.039 rows=101.00 loops=1)

Output:t1.x
Buffers:shared hit=1
Planning:
Buffers:shared hit=12
Planning Time:0.478 ms
Execution Time:0.293 ms
(31 rows)

--
Regards,
Alena Rybakina
Postgres Professional

Re: Removing unneeded self joins

Reply via email to