sgrebnov opened a new pull request, #13132:
URL: https://github.com/apache/datafusion/pull/13132

   ## Which issue does this PR close?
   
   With filters pushdown optimization enabled,  `TableScan`  filters and 
`LogicalPlan::Filter`  w/o dedicated Projection incorrectly apply after join 
statement which leads to incorrect result, for example
   
   Original query (TPC-H Q13)
   ```sql
   select
               c_custkey,
               count(o_orderkey)
           from
               customer left join orders on
                           c_custkey = o_custkey
                       and o_comment not like '%special%requests%'
           group by
               c_custkey
   ```
   is unparsed as below. You can see that original filter from Join is unparsed 
after join as `where`. The difference is that original query will contain 
customer records where there is no matching order and the second version does 
not (filter is applied after join).
   
   ```sql
   select
       "customer"."c_custkey",
       count("orders"."o_orderkey")
     from
       "customer"
     left join "orders" on
       ("customer"."c_custkey" = "orders"."o_custkey")
     where
       "orders"."o_comment" not like '%special%requests%'
     group by
       "customer"."c_custkey"
     order by "customer"."c_custkey" asc
   ```
   
   ## What changes are included in this PR?
   
   The PR improves unparsing by determining if the join is a simple table scan 
(not a subquery that will be translated to a full subquery (`SELECT .. FROM 
..`)) so that filters must be applied directly to the join. In the case of a 
full subquery, filters will be applied within the subquery, and this works 
correctly. This fixes TPC-H Q12 (filters on the left join were missing/not 
applied at all) and TPC-H Q13 (the filter was applied after the join instead of 
during/before, as shown above).
   
   ```console
   SELECT "customer"."c_custkey", count("orders"."o_orderkey") FROM "customer" 
LEFT JOIN "orders" ON (("customer"."c_custkey" = "orders"."o_custkey") AND 
"orders"."o_comment" NOT LIKE '%special%requests%') GROUP BY 
"customer"."c_custkey" 
   
   
   sql> explain select
               c_custkey,
               count(o_orderkey)
           from
               customer left join orders on
                           c_custkey = o_custkey
                       and o_comment not like '%special%requests%'
           group by
               c_custkey;
   
+---------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
   | plan_type     | plan                                                       
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                   |
   
+---------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
   | logical_plan  | BytesProcessedNode                                         
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                   |
   |               |   Federated                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                   |
   |               |  Projection: customer.c_custkey, count(orders.o_orderkey)  
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                   |
   |               |   Aggregate: groupBy=[[customer.c_custkey]], 
aggr=[[count(orders.o_orderkey)]]                                               
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                 |
   |               |     Left Join:  Filter: customer.c_custkey = 
orders.o_custkey                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                 |
   |               |       TableScan: customer                                  
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                   |
   |               |       TableScan: orders, full_filters=[orders.o_comment 
NOT LIKE Utf8("%special%requests%")]                                            
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                      |
   | physical_plan | BytesProcessedExec                                         
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                   |
   |               |   SchemaCastScanExec                                       
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                   |
   |               |     VirtualExecutionPlan name=postgres 
compute_context=host=Tcp("localhost"),port=6432,db=tpch_sf1,user=postgres, 
sql=SELECT customer.c_custkey, count(orders.o_orderkey) FROM customer LEFT JOIN 
orders ON ((customer.c_custkey = orders.o_custkey) AND orders.o_comment NOT 
LIKE '%special%requests%') GROUP BY customer.c_custkey rewritten_sql=SELECT 
"customer"."c_custkey", count("orders"."o_orderkey") FROM "customer" LEFT JOIN 
"orders" ON (("customer"."c_custkey" = "orders"."o_custkey") AND 
"orders"."o_comment" NOT LIKE '%special%requests%') GROUP BY 
"customer"."c_custkey" |
   |               |                                                            
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                   |
   
+---------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
   ```
   
   **Note:** The alternative approach considered was always unparsing join 
sub-nodes by adding a wildcard projection if needed. However, it requires 
introducing additional alias for a subquery (and updating corresponding upper 
nodes to match alias added) as not all engines support adding subquery with the 
same name as original table, for example `JOIN (SELECT * FROM "orders" WHERE 
"orders"."o_comment" not like '%special%requests%') as orders ..` - `as orders` 
needs to be replaced with new alias for most engines
   
   ## Are these changes tested?
   
   Added unit test, tested as part of TPC-H and TPC-DS queries unparsing by 
https://github.com/spiceai/spiceai (running benchmarks with some filters 
pushdown optimizations enabled)
   
   ## Are there any user-facing changes?
   
   Fixes unparsing issues related to incorrectly generated 'JOIN' filters 
(incorrect query result).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to