robtandy commented on issue #69: URL: https://github.com/apache/datafusion-ray/issues/69#issuecomment-2692393525
This makes sense to me. DFRay as written tries to lean into low latency execution at all costs, though it does look like it will provide a lot of utility for longer running distributed execution as well. One of the things DFRay is trying to do is not rewrite or replace any of the `ExecutionPlans` in DataFusion, including `RepartitionExec`s. I am not sure if this is the same approach that Ballista uses. I think, ideally, this new mode of execution diverges as little as possible from the existing mode of execution and we can use the exact same physical plans in both cases. For this to be true, once all stages are created by the `DFRayDataFrame`, we should arrange them into a DAG and execute them leaves up. Assumptions, to make sure I understand what you're thinking here: - we would execute each stage to completion before moving up the DAG - we would change (new mode) the `ProcessorService` to, instead of making the stage partitions available for streaming via ArrowFlight, it would be asked to execute and materialize the results fully. - these results would be in Arrow IPC format - The `DFRayStageReaderExec` would be changed (new mode) to, instead of streaming input from the remote stage via Arrow Flight, stream input from the previous materialized values Being that this is on Ray we have some choices: - Use the Ray object store for intermediate results: - Pros: - results are in memory only - Easy to do from within the python code `DFRayProcessor` - Confirms to Ray Cluster resource allocations and likely won't surprise cluster operators - Cons: - finite size compared to object storage - Notable - We would have `DFRayStageReaderExec` call back into python to get the results from Ray Object store - Use object store interface only - Pros: - larger available size - consumes less cluster resources - Cons: - likely slower, perhaps costlier Being that I used the object store in a previous iteration for DFRay that didn't make it into this repo, I have experience with the first option and would want to try that first. The second option could be a future user facing choice. Priority wise, I'd like to release `0.1.0` first and then add this functionality. WDYT -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org