andygrove commented on issue #69: URL: https://github.com/apache/datafusion-ray/issues/69#issuecomment-2692500974
> I think, ideally, this new mode of execution diverges as little as possible from the existing mode of execution and we can use the exact same physical plans in both cases. Yes. > For this to be true, once all stages are created by the `DFRayDataFrame`, we should arrange them into a DAG and execute them leaves up. Yes. > Assumptions, to make sure I understand what you're thinking here: > > * we would execute each stage to completion before moving up the DAG Yes, allow we can execute all leaf plans in parallel, if resource is available. > * we would change (new mode) the `ProcessorService` to, instead of making the stage partitions available for streaming via ArrowFlight, it would be asked to execute and materialize the results fully. Yes. > * these results would be in Arrow IPC format Yes. *> > * The `DFRayStageReaderExec` would be changed (new mode) to, instead of streaming input from the remote stage via Arrow Flight, stream input from the previous materialized values Yes. > Being that this is on Ray we have some choices: > > * Use the Ray object store for intermediate results: I was thinking that each stage would just stream results directly to disk shuffle files, but we can try some different approaches. > Priority wise, I'd like to release `0.1.0` first and then add this functionality. WDYT Absolutely. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org