korowa commented on PR #10095:
URL: https://github.com/apache/datafusion/pull/10095#issuecomment-2073078432
> Maybe we can insert RepartitionExec on top UnionExecs if their output
partition number > config.target_partitions. By this way, we can guarantee this
violation wouldn't propagate to other operators.
Probably better solution would be planning union inputs execution according
to total available partitions -- e.g
```sql
select l_linenumber as f
from lineitem
union all
select l_orderkey as f
from lineitem
```
with target_partitions = 4, could plan 2 threads for each ParquetExec
(ideally we could also use byte/row statistics and plan according to them --
not only 2-2, but probably 1-3 if there is significant data skew across
inputs/files).
And on top of it, when target_partitions is less then number of UNION inputs
(e.g. UNION has 10 inputs, target_partitions = 4, and we need at least 1 thread
for each input) there could be RepartitionExec.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]