I don't have any concrete data to test this against, but using 64-bit offsets sounds like an obvious improvement to me.

Regards

Antoine.


Le 01/08/2024 à 13:05, Ruoxi Sun a écrit :
Hello everyone,

We've identified an issue with Acero's hash join/aggregation, which is
currently limited to processing only up to 4GB data due to the use of
`uint32_t` for row offsets. This limitation not only impacts our ability to
handle large datasets but also makes typical solutions like splitting the
data into smaller batches ineffective.

* Proposed solution
We are considering upgrading the row offsets from 32-bit to 64-bit. This
change would allow us to process larger datasets and expand Arrow's
application possibilities.

* Trade-offs to consider
** Pros: Allows handling of larger datasets, breaking the current 4GB limit.
** Cons: Each row would consume an additional 4 bytes of memory, and there
might be slightly more CPU instructions involved in processing.

Preliminary benchmarks indicate that the impact on CPU performance is
minimal, so the main consideration is the increased memory consumption.

* We need your feedback
** How would this change affect your current usage of Arrow, especially in
terms of memory consumption?
** Do you have any concerns or thoughts about this proposal?

Please review the detailed information in [1] and [2] and share your
feedback. Your input is crucial as we gather community insights to decide
whether or not to proceed with this change.

Looking forward to your feedback and working together to enhance Arrow.
Thank you!

*Regards,*
*Rossi SUN*

Reply via email to