Re: [PR] Implement `DynamicFileSchemaProvider` in the core [datafusion]

via GitHub Fri, 16 Aug 2024 10:04:40 -0700


goldmedal commented on PR #11035:
URL: https://github.com/apache/datafusion/pull/11035#issuecomment-2293855497


   > Yes, I saw something like that in the code: using tmp_table as the default 
alias. But I'm not sure if it is the right way, because it might cause problems 
when resolving column names？
   
   @holicc 
   After some experimentation, I found that it's not straightforward. I tried 
implementing a `TableProvider` with a custom `get_logical_plan` method to set 
an alias for the table by default. However, I found that the internal plan is 
invoked during the analysis phase, which is too late to modify column names 
since all projections have already been planned.
   
   The plan will look like this:
   ```sql
   > EXPLAIN SELECT sum(a) FROM 
'/Users/jax/git/datafusion/datafusion/core/tests/data/2.json'
   
+---------------+-------------------------------------------------------------------------------------------------------------------------+
   | plan_type     | plan                                                       
                                                             |
   
+---------------+-------------------------------------------------------------------------------------------------------------------------+
   | logical_plan  | Aggregate: groupBy=[[]], 
aggr=[[sum(/Users/jax/git/datafusion/datafusion/core/tests/data/2.json.a)]]     
               |
   |               |   SubqueryAlias: 
/Users/jax/git/datafusion/datafusion/core/tests/data/2.json                     
                       |
   |               |     TableScan: ?url? projection=[a]                        
                                                             |
   | physical_plan | AggregateExec: mode=Final, gby=[], 
aggr=[sum(/Users/jax/git/datafusion/datafusion/core/tests/data/2.json.a)]       
     |
   |               |   CoalescePartitionsExec                                   
                                                             |
   |               |     AggregateExec: mode=Partial, gby=[], 
aggr=[sum(/Users/jax/git/datafusion/datafusion/core/tests/data/2.json.a)]      |
   |               |       RepartitionExec: partitioning=RoundRobinBatch(8), 
input_partitions=1                                              |
   |               |         JsonExec: file_groups={1 group: 
[[Users/jax/git/datafusion/datafusion/core/tests/data/2.json]]}, projection=[a] 
|
   |               |                                                            
                                                             |
   
+---------------+-------------------------------------------------------------------------------------------------------------------------+
   ```
   If we want to improve readability, we might need to create an `AnalyzerRule` 
for it. However, this is not easy due to the complexity of column resolution, 
as you mentioned. I think that we could address this issue in a separate pull 
request if needed.
   
   A simpler solution is to manually add an alias when querying:
   ```sql
   > EXPLAIN SELECT sum(a) FROM 
'/Users/jax/git/datafusion/datafusion/core/tests/data/2.json' as t
   
+---------------+-------------------------------------------------------------------------------------------------------------------------+
   | plan_type     | plan                                                       
                                                             |
   
+---------------+-------------------------------------------------------------------------------------------------------------------------+
   | logical_plan  | Aggregate: groupBy=[[]], aggr=[[sum(t.a)]]                 
                                                             |
   |               |   SubqueryAlias: t                                         
                                                             |
   |               |     TableScan: 
/Users/jax/git/datafusion/datafusion/core/tests/data/2.json projection=[a]      
                         |
   | physical_plan | AggregateExec: mode=Final, gby=[], aggr=[sum(t.a)]         
                                                             |
   |               |   CoalescePartitionsExec                                   
                                                             |
   |               |     AggregateExec: mode=Partial, gby=[], aggr=[sum(t.a)]   
                                                             |
   |               |       RepartitionExec: partitioning=RoundRobinBatch(8), 
input_partitions=1                                              |
   |               |         JsonExec: file_groups={1 group: 
[[Users/jax/git/datafusion/datafusion/core/tests/data/2.json]]}, projection=[a] 
|
   |               |                                                            
                                                             |
   
+---------------+-------------------------------------------------------------------------------------------------------------------------+
   ```
   This is a straightforward way to produce a more readable plan without 
complicating the code.
   
   cc @alamb 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] Implement `DynamicFileSchemaProvider` in the core [datafusion]

Reply via email to