ctsk opened a new pull request, #15981: URL: https://github.com/apache/datafusion/pull/15981
## Which issue does this PR close? Helps https://github.com/apache/datafusion/issues/6822 a bit. ## Rationale for this change Before this PR, hash partitioning worked roughly like this: ``` for each partition { for each column { take(column, indices of partition) } } ``` This PR changes it to ``` for each column { for each partition { take(column, indices of partition) } } ``` Reasoning being, that it might play nicer with the CPU's cache. Especially when the number of columns is large, the old approach would need to load each column `number_of_partition` times into cache. ## What changes are included in this PR? Layered on the change above are a bunch of micro-optimizations: - Pack the indices for each partition into a single vector (hopefully nice prefetching behaviour) - Reuse the allocation for the indices - Avoid modulo if the number of partitions is a power of 2 (quite common). Mostly throwing out ideas and seeing what sticks. ## Are these changes tested? Covered by existing tests - I hope. ## Are there any user-facing changes? No. cc: @Dandandan @goldmedal -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org