robtandy opened a new pull request, #62:
URL: https://github.com/apache/datafusion-ray/pull/62

   Significant time is spent allocating `StageService`s as the python actor, 
`RayStage` and waiting for them to bind to a listening port.  
   
   This changes the semantics of `RayContext` and `RayDataFrame` such that the 
`RayQuerySupervisor` is now created when the `RayContext` is created and a pool 
of `RayStage`s are preallocated.   When a `RayDataFrame` is created by the 
context's `sql()` method, stages are calculated and the number of `RayStage` 
actors are requested from the pool.    When the query is finished, instead of 
tearing down these actors, they are simply returned to the pool.  
   
   The pool size is parameterized by min size and max size values.   The pool 
will preallocate at the minimum size and can grow up to the maximum size.  
Requesting workers beyond the maximum size will raise an exception.   The pool 
is released and ray resources are torn down when the `RayContext` goes out of 
scope. 
   
   This change makes a significant difference on TPCH benchmarks.   Tested on 
SF100, it improved the result by 25% on a machine with very fast disk, such 
that the overhead of creating and tearing down ray resources was a large chunk 
of execution time.
   
   This PR does not handle the pool shrinking back to a minimum size only 
growing, let's handle that in a subsequent change.
   
   The `tpcbench.py` benchmark script, and `tpc.py` script accept 
`--worker-pool-min`
   
   As `RayStage` actors are now longer lived, they were updated to be able to 
accept updated `ExecutionPlan`s to serve.   This meant that debugging issues 
with `RayStage`s is a little more difficult as it no longer makes sense to name 
them after they stage they are hosting, because that can change.    As such, 
they now receive friendly human readable unique names which make reading debug 
and info output much easier.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Reply via email to