korowa commented on PR #10095:
URL: https://github.com/apache/datafusion/pull/10095#issuecomment-2073078432

   > Maybe we can insert RepartitionExec on top UnionExecs if their output 
partition number > config.target_partitions. By this way, we can guarantee this 
violation wouldn't propagate to other operators.
   
   Probably better solution would be planning union inputs execution according 
to total available partitions -- e.g
   ```sql
       select l_linenumber as f
       from lineitem
       union all
       select l_orderkey as f
       from lineitem
   ```
   with target_partitions = 4, could plan 2 threads for each ParquetExec 
(ideally we could also use byte/row statistics and plan according to them -- 
not only 2-2, but probably 1-3 if there is significant data skew across 
inputs/files).
   
   And on top of it, when target_partitions is less then number of UNION inputs 
(e.g. UNION has 10 inputs, target_partitions = 4, and we need at least 1 thread 
for each input) there could be RepartitionExec.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to