andygrove commented on code in PR #53:
URL: https://github.com/apache/datafusion-ray/pull/53#discussion_r1885291564
##########
src/query_stage.rs:
##########
@@ -99,10 +99,14 @@ impl QueryStage {
/// Get the input partition count. This is the same as the number of
concurrent tasks
/// when we schedule this query stage for execution
pub fn get_input_partition_count(&self) -> usize {
Review Comment:
in `context.py` we have this logic:
```
# if the query stage has a single output partition then we need to
execute for the output
# partition, otherwise we need to execute in parallel for each input
partition
concurrency = stage.get_input_partition_count()
output_partitions_count = stage.get_output_partition_count()
```
This is based on the assumption that the query stage is a shuffle write,
which perhaps was always true when running TPC-H, so the existing code worked.
With the new simple `SELECT * FROM table` test that you added, we had a
query stage where the plan was a `CsvExec` and had no children so we had to
handle this as a special case. There is no input partitioning in this case. We
use the output partitioning because DataFusion will have already decided that
based on the files that are available.
This code is all confusing and I would like to make it less so.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]