ctsk opened a new pull request, #15981:
URL: https://github.com/apache/datafusion/pull/15981

   ## Which issue does this PR close?
   
   Helps https://github.com/apache/datafusion/issues/6822 a bit.
   
   ## Rationale for this change
   
   Before this PR, hash partitioning worked roughly like this:
   
   ```
   for each partition {
     for each column {
        take(column, indices of partition)
     }
   }
   ```
   
   This PR changes it to
   
   ```
   for each column {
     for each partition {
        take(column, indices of partition)
     }
   }
   ```
   
   Reasoning being, that it might play nicer with the CPU's cache. Especially 
when the number of columns is large, the old approach would need to load each 
column `number_of_partition` times into cache.
   
   ## What changes are included in this PR?
   
   Layered on the change above are a bunch of micro-optimizations:
   
   - Pack the indices for each partition into a single vector (hopefully nice 
prefetching behaviour)
   - Reuse the allocation for the indices
   - Avoid modulo if the number of partitions is a power of 2 (quite common).
   
   Mostly throwing out ideas and seeing what sticks.
   
   ## Are these changes tested?
   
   Covered by existing tests - I hope.
   
   ## Are there any user-facing changes?
   
   No.
   
   cc: @Dandandan @goldmedal 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Reply via email to