milenkovicm opened a new issue, #1164:
URL: https://github.com/apache/datafusion-ballista/issues/1164

   ### Is your feature request related to a problem or challenge?
   
   With apache/datafusion#14079 merged, we're a step closed having support for 
`INSERT INTO` in ballista.
   Latest issue is that scheduler can't find table reference specified in 
`DML.table_name` (type of `TableReference`).
   
   This specific issue is due to having two different un-synchronized session 
contexts ballista has, client and corresponding scheduler context.
   
   ### Describe the solution you'd like
   
   I do not have a good or preferred solution at this point, asking for 
opinions.
   Ideally it should be a solution which would be flexible.
   
   ### Describe alternatives you've considered
   
   I have few alternatives, not of which are ideal, have I missed something?
   
   #### Replace TableReference with actual table in the `LogicalPlan::DML`
   
   Initial idea was to replace `TableReference` with actual table in the plan 
but that would not work due to  
   table provider lookup to create `insert into` exec 
<https://github.com/milenkovicm/arrow-datafusion-fork/blob/dc22b3fc846c23f69325be6e11c8ef204c3dc6be/datafusion/core/src/physical_planner.rs#L550>
   
   I'm not convinced it will work
   
   #### Propagate DDLs Statements to QueryPlanner
   
   `BallistaQueryPlanner` is in charge of client-scheduler communication, at 
the moment it does not propagate DDL statements from client to scheduler. It 
could be modified to handle `DDL` statements, the problem is that 
`SessionContext` will execute `DDL` statements immediately and 
`LogicalPlan::DDL` will be swapped with `LogicalPlan::Empty`, thus no `DDL` 
information will reach the planner.
   
   Looking at datafusion code, I'm not sure that this could be changed on the 
`SessionContext` without major disruption.
   
   #### Synchronize Catalogs Between Client and Scheduler
   
   `INSERT INTO` will work if scheduler catalog has table information, so some 
kind of remote catalog would help. As it would affect user experience if remote 
catalog had to be setup, this option is not the first choice .
   
   We could come up with ballista catalog (schema registry) which could 
synchronize catalog state between client and the scheduler,
   it could be a bit of the work with non async methods exposed by 
`SchemaCatalog`.
   
   At the end, as `SchemaProvider.table` is async, table could be lazy 
registered first time table is needed by a query plan. This would require 
custom `SchemaProvider` on the client side.
   
   #### Synchronize Contexts on ExecuteQuery
   
   Implement some kind of tracking logic, which would be triggered on 
`ExecuteQuery` which would synchronize SchemaRegistry between client and 
scheduler.
   
   I'm not really keen on this solution as I believe it will get very 
complicated very quickly.
   
   #### Modify Ballista Protocol to send PhysicalPlans
   
   At the moment client would send LogicalPlan to scheduler which would be then 
converted to physical plan on the scheduler. At this point we need table 
reference. I was wondering do can we resolve physical plan on the client side, 
but split them to stages on the server side.
   
   This would be quite a big change, so i'm asking if anybody remembers why 
logical plan was selected to be exchange instead of physical plan.
   
   ### Additional context
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Reply via email to