Re: [PR] Always add round robin repartitioning to leaves (data sources), benefitting unbalanced / small datasets [datafusion]

via GitHub Wed, 11 Dec 2024 03:59:07 -0800


Dandandan commented on PR #13707:
URL: https://github.com/apache/datafusion/pull/13707#issuecomment-2535795740


   > The performance improvement is impressive 🚀
   > 
   > I feel this change benefits some workload, but might harm locality in 
others: for example input is almost sorted with few inversions, if it's simply 
split evenly, locality can be exploited (when doing aggregation each partition 
can have lower cardinality), but round-robin shuffling would do the opposite
   > 
   > However it's fine to proceed since we can add a configuration for it in 
the future
   
   Yes I agree it might hurt performance in those cases (lower cardinality / 
data correlation). Even without it, we might hurt performance a bit by 
repartitioning (memory / cpu locality), but couldn't find yet a benchmark 
showing this. Round robin can be disabled (e.g. before executing a query) 
already if someone would like to tweak it, and we can avoid it if 
sorting/partitioning is known.
   
   I think it might be interesting as well to look in improving round-robin 
repartitionexec in the future: if all partitions still are yielding batches, we 
can consider keeping them to the current partition, changing to round-robin (or 
similar) only when one of the partitions is exhausted, so we can benefit from 
both better execution as benefitting from lower cardinality in case of 
correlated data.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] Always add round robin repartitioning to leaves (data sources), benefitting unbalanced / small datasets [datafusion]

Reply via email to