Re: A new strategy for pull-up correlated ANY_SUBLINK

Alena Rybakina Thu, 12 Oct 2023 16:14:35 -0700

On 12.10.2023 10:52, Andy Fan wrote:

    Unfortunately, I found a request when sublink did not pull-up, as
    in the

    examples above. I couldn't quite figure out why.
I'm not sure what you mean with the "above", I guess it should be the"below"?


Yes, you are right)


    explain (analyze, costs off, buffers)
    select b.x, b.x, a.y
    from b
        left join a
            on b.x=a.x and
    *b.t in
                (select max(a0.t) *
                 from a a0
                 where a0.x = b.x and
                       a0.t = b.t);

...

       SubPlan 2


Here the sublink can't be pulled up because of its reference to

the LHS of left join, the original logic is that no matter the 'b.tin ..'

returns the true or false,  the rows in LHS will be returned.  If we
pull it up to LHS, some rows in LHS will be filtered out, which
breaks its original semantics.


Thanks for the explanation, it became more clear to me here.

    I thought it would be:

    explain (analyze, costs off, buffers)
    select b.x, b.x, a.y
    from b
        left join a on
            b.x=a.x and
    *b.t =
                (select max(a0.t) *
                 from a a0
                 where a0.x = b.x and
                       a0.t <= b.t);

    QUERY PLAN
    
---------------------------------------------------------------------------------------------------------------------
     Hash Right Join (actual time=1.181..67.927 rows=1000 loops=1)
       Hash Cond: (a.x = b.x)
    *Join Filter: (b.t = (SubPlan 2))*
       Buffers: shared hit=3546
       ->  Seq Scan on a (actual time=0.022..17.109 rows=100000 loops=1)
             Buffers: shared hit=541
       ->  Hash (actual time=1.065..1.068 rows=1000 loops=1)
             Buckets: 4096  Batches: 1  Memory Usage: 72kB
             Buffers: shared hit=5
             ->  Seq Scan on b (actual time=0.049..0.401 rows=1000
    loops=1)
                   Buffers: shared hit=5
       SubPlan 2
         ->  Result (actual time=0.025..0.025 rows=1 loops=1000)
               Buffers: shared hit=3000
               InitPlan 1 (returns $2)
                 ->  Limit (actual time=0.024..0.024 rows=1 loops=1000)
                       Buffers: shared hit=3000
                       ->  Index Only Scan Backward using a_t_x_idx on
    a a0 (actual time=0.023..0.023 rows=1 loops=1000)
                             Index Cond: ((t IS NOT NULL) AND (t <=
    b.t) AND (x = b.x))
                             Heap Fetches: 1000
                             Buffers: shared hit=3000
     Planning Time: 0.689 ms
     Execution Time: 68.220 ms
    (23 rows)

    If you noticed, it became possible after replacing the "in"
    operator with "=".

I didn't notice much difference between the 'in'  and '=',  maybe I
missed something?

It seems to me that the expressions "=" and "IN" are equivalent here dueto the fact that the aggregated subquery returns only one value, and theresult with the "IN" operation can be considered as the intersection ofelements on the left and right. In this query, we have some kind of seton the left, among which there will be found or not only one element onthe right. In general, this expression can be considered as b=const, sopush down will be applied to b and we can filter b during its scanningby the subquery's result.But I think your explanation is necessary here, that this is allpossible, because we can pull up the sublink here, since filtering isallowed on the right side (the nullable side) and does not break thesemantics of LHS. But in contrast, I also added two queries wherepull-up is impossible and it is not done here. Otherwise if filteringwas applied on the left it would be mistake.

To be honest, I'm not sure if this explanation is needed in the testanymore, so I didn't add it.


explain (costs off)
SELECT * FROM tenk1 A LEFT JOIN tenk2 B
ON A.hundred in (SELECT min(c.hundred) FROM tenk2 C WHERE c.odd = b.odd);
                           QUERY PLAN
-----------------------------------------------------------------
 Nested Loop Left Join
   Join Filter: (SubPlan 2)
   ->  Seq Scan on tenk1 a
   ->  Materialize
         ->  Seq Scan on tenk2 b
   SubPlan 2
     ->  Result
           InitPlan 1 (returns $1)
             ->  Limit
                   ->  Index Scan using tenk2_hundred on tenk2 c
                         Index Cond: (hundred IS NOT NULL)
                         Filter: (odd = b.odd)
(12 rows)

explain (costs off)
SELECT * FROM tenk1 A LEFT JOIN tenk2 B
ON A.hundred in (SELECT count(c.hundred) FROM tenk2 C group by (c.odd));
            QUERY PLAN
-----------------------------------
 Nested Loop Left Join
   Join Filter: (hashed SubPlan 1)
   ->  Seq Scan on tenk1 a
   ->  Materialize
         ->  Seq Scan on tenk2 b
   SubPlan 1
     ->  HashAggregate
           Group Key: c.odd
           ->  Seq Scan on tenk2 c
(9 rows)

    I took the liberty of adding this to your patch and added myself
    as reviewer, if you don't mind.

Sure, the patch after your modification looks better than the original.
I'm not sure how the test case around "because of got one row" is
relevant to the current changes.  After we reach to some agreement
on the above discussion, I think v4 is good for committer to review!


Thank you!) I am ready to discuss it.

--
Regards,
Alena Rybakina

diff --git a/src/test/regress/expected/subselect.out b/src/test/regress/expected/subselect.out
index b70b346696d..0b8f28b157f 100644
--- a/src/test/regress/expected/subselect.out
+++ b/src/test/regress/expected/subselect.out
@@ -2028,7 +2028,7 @@ ON A.hundred in (SELECT c.hundred FROM tenk2 C WHERE c.odd = b.odd);
          ->  Seq Scan on tenk2 b
 (11 rows)
 
--- we can pull up the aggregate sublink into the subquery scan because of got one row.
+-- we can pull up the aggregate sublink into RHS of a left join.
 explain (costs off)
 SELECT * FROM tenk1 A LEFT JOIN tenk2 B
 ON B.hundred in (SELECT min(c.hundred) FROM tenk2 C WHERE c.odd = b.odd);
@@ -2052,7 +2052,6 @@ ON B.hundred in (SELECT min(c.hundred) FROM tenk2 C WHERE c.odd = b.odd);
                                                Filter: (odd = b.odd)
 (16 rows)
 
--- we can pull up the aggregate sublink into the JoinExpr.
 explain (costs off)
 SELECT * FROM tenk1 A LEFT JOIN tenk2 B
 ON B.hundred in (SELECT count(c.hundred) FROM tenk2 C group by (c.odd));
@@ -2071,3 +2070,39 @@ ON B.hundred in (SELECT count(c.hundred) FROM tenk2 C group by (c.odd));
                                  ->  Seq Scan on tenk2 c
 (11 rows)
 
+-- we can't pull up the aggregate sublink into LHS of a left join.
+explain (costs off)
+SELECT * FROM tenk1 A LEFT JOIN tenk2 B
+ON A.hundred in (SELECT min(c.hundred) FROM tenk2 C WHERE c.odd = b.odd);
+                           QUERY PLAN                            
+-----------------------------------------------------------------
+ Nested Loop Left Join
+   Join Filter: (SubPlan 2)
+   ->  Seq Scan on tenk1 a
+   ->  Materialize
+         ->  Seq Scan on tenk2 b
+   SubPlan 2
+     ->  Result
+           InitPlan 1 (returns $1)
+             ->  Limit
+                   ->  Index Scan using tenk2_hundred on tenk2 c
+                         Index Cond: (hundred IS NOT NULL)
+                         Filter: (odd = b.odd)
+(12 rows)
+
+explain (costs off)
+SELECT * FROM tenk1 A LEFT JOIN tenk2 B
+ON A.hundred in (SELECT count(c.hundred) FROM tenk2 C group by (c.odd));
+            QUERY PLAN             
+-----------------------------------
+ Nested Loop Left Join
+   Join Filter: (hashed SubPlan 1)
+   ->  Seq Scan on tenk1 a
+   ->  Materialize
+         ->  Seq Scan on tenk2 b
+   SubPlan 1
+     ->  HashAggregate
+           Group Key: c.odd
+           ->  Seq Scan on tenk2 c
+(9 rows)
+
diff --git a/src/test/regress/sql/subselect.sql b/src/test/regress/sql/subselect.sql
index 5d33eb39baa..c02513c4989 100644
--- a/src/test/regress/sql/subselect.sql
+++ b/src/test/regress/sql/subselect.sql
@@ -1001,12 +1001,20 @@ explain (costs off)
 SELECT * FROM tenk1 A INNER JOIN tenk2 B
 ON A.hundred in (SELECT c.hundred FROM tenk2 C WHERE c.odd = b.odd);
 
--- we can pull up the aggregate sublink into the subquery scan because of got one row.
+-- we can pull up the aggregate sublink into RHS of a left join.
 explain (costs off)
 SELECT * FROM tenk1 A LEFT JOIN tenk2 B
 ON B.hundred in (SELECT min(c.hundred) FROM tenk2 C WHERE c.odd = b.odd);
 
--- we can pull up the aggregate sublink into the JoinExpr.
 explain (costs off)
 SELECT * FROM tenk1 A LEFT JOIN tenk2 B
-ON B.hundred in (SELECT count(c.hundred) FROM tenk2 C group by (c.odd));
\ No newline at end of file
+ON B.hundred in (SELECT count(c.hundred) FROM tenk2 C group by (c.odd));
+
+-- we can't pull up the aggregate sublink into LHS of a left join.
+explain (costs off)
+SELECT * FROM tenk1 A LEFT JOIN tenk2 B
+ON A.hundred in (SELECT min(c.hundred) FROM tenk2 C WHERE c.odd = b.odd);
+
+explain (costs off)
+SELECT * FROM tenk1 A LEFT JOIN tenk2 B
+ON A.hundred in (SELECT count(c.hundred) FROM tenk2 C group by (c.odd));
\ No newline at end of file

Re: A new strategy for pull-up correlated ANY_SUBLINK

Reply via email to