[I] Support zero copy hash repartitioning inside Hash Aggregate [datafusion]

via GitHub Mon, 24 Mar 2025 03:22:41 -0700


Dandandan opened a new issue, #15383:
URL: https://github.com/apache/datafusion/issues/15383


   ### Is your feature request related to a problem or challenge?
   
   ### Is your feature request related to a problem or challenge?
   
   Currently `RepartitionExec: partitioning=Hash` will be added whenever for 
aggregates in `FinalPartitioned` and `SinglePartitioned`
    
   The benefit is increased parallelism, but at the cost of copying the entire 
table (in a not-so efficient way).
   
   We should consider lowering the cost of repartitioning by not having to copy 
the input.
   
   ### Describe the solution you'd like
   
   Instead of repartitioning the input in `RepartitionExec`, support 
repartitioning the inputs based on a selection vector.
   
   Instead of `taking` the `RecordBatch`, we can consider doing the following:
   
   * Add a (boolean) selection vector as output column for each output 
partition. I.e. `true` means the row is selected for the partition.
   * The rest of the `RecordBatch` remains unchanged (i.e. no copy).
   * CoalesceBatchesExec is no longer needed for the output (reducing another 
copy) 
   * In the hash aggregate code handle the selection vector.
   
   ### Describe alternatives you've considered
   
   The partitioning could be done inside the hash aggregate (at the cost of 
more complexity inside it).
    
   ### Additional context
   
   _No response_
   
   ### Describe the solution you'd like
   
   _No response_
   
   ### Describe alternatives you've considered
   
   _No response_
   
   ### Additional context
   
   _No response_


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

[I] Support zero copy hash repartitioning inside Hash Aggregate [datafusion]

Reply via email to