Dandandan commented on issue #15177:
URL: https://github.com/apache/datafusion/issues/15177#issuecomment-2740739642

   > Note that late materialization (the join / semi join rewrite) needs join 
operator support that DataFusion doesn't yet have (we could add it but it will 
take non trivial effort)
   
   I did not fully get this part. DF has semi join support and some rewrites to 
utilize it in similar cases?
   The query transformation in SQL as given by @xudong963 is optimized to a 
SEMI join + TopK, so I think it could be implemented as logical optimization 
rule (i.e. adding a filter with subquery on the ids).
   
   ```
   > CREATE TABLE t (a int, b int, row_id int);
   0 row(s) fetched. 
   Elapsed 0.004 seconds.
   
   > EXPLAIN (WITH ids AS (SELECT row_id, a FROM t ORDER BY a LIMIT 10)
   SELECT t.* FROM t JOIN ids WHERE t.row_id IN (SELECT row_id FROM ids));
   
+---------------+--------------------------------------------------------------------------------------------+
   | plan_type     | plan                                                       
                                |
   
+---------------+--------------------------------------------------------------------------------------------+
   | logical_plan  | LeftSemi Join: t.row_id = __correlated_sq_1.row_id         
                                |
   |               |   Cross Join:                                              
                                |
   |               |     TableScan: t projection=[a, b, row_id]                 
                                |
   |               |     SubqueryAlias: ids                                     
                                |
   |               |       Projection:                                          
                                |
   |               |         Sort: t.a ASC NULLS LAST, fetch=10                 
                                |
   |               |           TableScan: t projection=[a]                      
                                |
   |               |   SubqueryAlias: __correlated_sq_1                         
                                |
   |               |     SubqueryAlias: ids                                     
                                |
   |               |       Projection: t.row_id                                 
                                |
   |               |         Sort: t.a ASC NULLS LAST, fetch=10                 
                                |
   |               |           Projection: t.row_id, t.a                        
                                |
   |               |             TableScan: t projection=[a, row_id]            
                                |
   | physical_plan | CoalesceBatchesExec: target_batch_size=8192                
                                |
   |               |   HashJoinExec: mode=Partitioned, join_type=LeftSemi, 
on=[(row_id@2, row_id@0)]            |
   |               |     CrossJoinExec                                          
                                |
   |               |       DataSourceExec: partitions=1, partition_sizes=[0]    
                                |
   |               |       ProjectionExec: expr=[]                              
                                |
   |               |         SortExec: TopK(fetch=10), expr=[a@0 ASC NULLS 
LAST], preserve_partitioning=[false] |
   |               |           DataSourceExec: partitions=1, 
partition_sizes=[0]                                |
   |               |     ProjectionExec: expr=[row_id@0 as row_id]              
                                |
   |               |       SortExec: TopK(fetch=10), expr=[a@1 ASC NULLS LAST], 
preserve_partitioning=[false]   |
   |               |         DataSourceExec: partitions=1, partition_sizes=[0]  
                                |
   |               |                                                            
                                |
   
+---------------+--------------------------------------------------------------------------------------------+
   2 row(s) fetched. 
   Elapsed 0.005 seconds.
   
   > 
   ````


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Reply via email to