Re: [I] Support zero copy hash repartitioning for Hash Join [datafusion]

via GitHub Tue, 25 Mar 2025 02:17:31 -0700


Dandandan commented on issue #15382:
URL: https://github.com/apache/datafusion/issues/15382#issuecomment-2750596093


   Yes, I agree with @alamb I don't consider this super easy. However, feel 
free to try and implement some steps and let us know when you need help @zebsme 
.
   
   Probably it makes sense to split this issue in some subtasks, these are the 
ones I can think of and I think the individually they might be doable for an 
"advanced beginner":
   
   * Add a mode that outputs selection vectors (for now let's use dense boolean 
arrays so it can be added to `RecordBatch`) in `RepartitionExec`. The array 
outputs `true` for each row that has `hash % partition == 0` (and false if not).
   * Support using this selection vector in hash join (only matching indices in 
the selection vector)
   * Support planning with the repartition mode (probably behind an option), 
run some benchmarks to decide on the default.
   
   We probably want to think about some general way of representing / using 
selection vectors in `RecordBatch` (or as wrapped `RecordBatch`) as well but 
this requires some more design / discussion.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [I] Support zero copy hash repartitioning for Hash Join [datafusion]

Reply via email to