milenkovicm commented on PR #1338: URL: https://github.com/apache/datafusion-ballista/pull/1338#issuecomment-3476863366
Long term plan, and the best plan, would be if we could just plug-in ballista's physical planner to datafusion-python. Ideally, we do not want to maintain many python classes in ballista, we should really rely on pydf . The main idea of this approach is to extend `DataFrame` and `SessionContext` to intercept methods which create `DataFrame` and methods that actually execute the plan, such as `show`, `collect`, `write` ... When 'execute' methods are invoked, we would create a `BallistaSessionContext` and create a `BallistaDataFrame` (which internally has a Ballista physical planner) and execute those methods on ballista context. Regarding your concern @timsaucer, I'm not sure that I fully understand it. Current goal would be for ballista just to expose `BallistaSessionContext` and nothing else, hopefully all other classes could be resused from "single node" work. So full portability of the code is target (well if we can get udf serialised). Current risks I see: - we have two session context (pydf and ballista session context recreated on 'plan execution') - we miss some of `DataFrame` creation methods and plan executes on single node context - to many duplicated code from `pydf` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
