Dandandan commented on issue #15382: URL: https://github.com/apache/datafusion/issues/15382#issuecomment-2750596093
Yes, I agree with @alamb I don't consider this super easy. However, feel free to try and implement some steps and let us know when you need help @zebsme . Probably it makes sense to split this issue in some subtasks, these are the ones I can think of and I think the individually they might be doable for an "advanced beginner": * Add a mode that outputs selection vectors (for now let's use dense boolean arrays so it can be added to `RecordBatch`) in `RepartitionExec`. The array outputs `true` for each row that has `hash % partition == 0` (and false if not). * Support using this selection vector in hash join (only matching indices in the selection vector) * Support planning with the repartition mode (probably behind an option), run some benchmarks to decide on the default. We probably want to think about some general way of representing / using selection vectors in `RecordBatch` (or as wrapped `RecordBatch`) as well but this requires some more design / discussion. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org