Dandandan commented on PR #13707: URL: https://github.com/apache/datafusion/pull/13707#issuecomment-2535795740
> The performance improvement is impressive 🚀 > > I feel this change benefits some workload, but might harm locality in others: for example input is almost sorted with few inversions, if it's simply split evenly, locality can be exploited (when doing aggregation each partition can have lower cardinality), but round-robin shuffling would do the opposite > > However it's fine to proceed since we can add a configuration for it in the future Yes I agree it might hurt performance in those cases (lower cardinality / data correlation). Even without it, we might hurt performance a bit by repartitioning (memory / cpu locality), but couldn't find yet a benchmark showing this. Round robin can be disabled (e.g. before executing a query) already if someone would like to tweak it, and we can avoid it if sorting/partitioning is known. I think it might be interesting as well to look in improving round-robin repartitionexec in the future: if all partitions still are yielding batches, we can consider keeping them to the current partition, changing to round-robin (or similar) only when one of the partitions is exhausted, so we can benefit from both better execution as benefitting from lower cardinality in case of correlated data. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
