Re: [I] Add Spark-like scheduling mode with shuffle files between stages [datafusion-ray]

via GitHub Sat, 01 Mar 2025 17:16:41 -0800


andygrove commented on issue #69:
URL: https://github.com/apache/datafusion-ray/issues/69#issuecomment-2692500974


   > I think, ideally, this new mode of execution diverges as little as 
possible from the existing mode of execution and we can use the exact same 
physical plans in both cases.
   
   Yes.
   
   > For this to be true, once all stages are created by the `DFRayDataFrame`, 
we should arrange them into a DAG and execute them leaves up.
   
   Yes.
   
   > Assumptions, to make sure I understand what you're thinking here:
   > 
   > * we would execute each stage to completion before moving up the DAG
   
   Yes, allow we can execute all leaf plans in parallel, if resource is 
available.
   
   > * we would change (new mode) the `ProcessorService` to, instead of making 
the stage partitions available for streaming via ArrowFlight, it would be asked 
to execute and materialize the results fully.
   
   Yes.
   
   > * these results would be in Arrow IPC format
   
   Yes.
   *>  
   > * The `DFRayStageReaderExec` would be changed (new mode) to, instead of 
streaming input from the remote stage via Arrow Flight, stream input from the 
previous materialized values
   
   Yes.
   
   > Being that this is on Ray we have some choices:
   > 
   > * Use the Ray object store for intermediate results:
   
   I was thinking that each stage would just stream results directly to disk 
shuffle files, but we can try some different approaches.
   
   > Priority wise, I'd like to release `0.1.0` first and then add this 
functionality. WDYT
   
   Absolutely. 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [I] Add Spark-like scheduling mode with shuffle files between stages [datafusion-ray]

Reply via email to