robtandy opened a new pull request, #62: URL: https://github.com/apache/datafusion-ray/pull/62
Significant time is spent allocating `StageService`s as the python actor, `RayStage` and waiting for them to bind to a listening port. This changes the semantics of `RayContext` and `RayDataFrame` such that the `RayQuerySupervisor` is now created when the `RayContext` is created and a pool of `RayStage`s are preallocated. When a `RayDataFrame` is created by the context's `sql()` method, stages are calculated and the number of `RayStage` actors are requested from the pool. When the query is finished, instead of tearing down these actors, they are simply returned to the pool. The pool size is parameterized by min size and max size values. The pool will preallocate at the minimum size and can grow up to the maximum size. Requesting workers beyond the maximum size will raise an exception. The pool is released and ray resources are torn down when the `RayContext` goes out of scope. This change makes a significant difference on TPCH benchmarks. Tested on SF100, it improved the result by 25% on a machine with very fast disk, such that the overhead of creating and tearing down ray resources was a large chunk of execution time. This PR does not handle the pool shrinking back to a minimum size only growing, let's handle that in a subsequent change. The `tpcbench.py` benchmark script, and `tpc.py` script accept `--worker-pool-min` As `RayStage` actors are now longer lived, they were updated to be able to accept updated `ExecutionPlan`s to serve. This meant that debugging issues with `RayStage`s is a little more difficult as it no longer makes sense to name them after they stage they are hosting, because that can change. As such, they now receive friendly human readable unique names which make reading debug and info output much easier. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org