Thanks for driving this forward. I didn't see the links in my email client
so I'm adding those below in case helps others:

Issue: https://github.com/apache/arrow/issues/43495
PR: https://github.com/apache/arrow/pull/43389

On Thu, Aug 1, 2024 at 4:06 AM Ruoxi Sun <zanmato1...@gmail.com> wrote:

> Hello everyone,
>
> We've identified an issue with Acero's hash join/aggregation, which is
> currently limited to processing only up to 4GB data due to the use of
> `uint32_t` for row offsets. This limitation not only impacts our ability to
> handle large datasets but also makes typical solutions like splitting the
> data into smaller batches ineffective.
>
> * Proposed solution
> We are considering upgrading the row offsets from 32-bit to 64-bit. This
> change would allow us to process larger datasets and expand Arrow's
> application possibilities.
>
> * Trade-offs to consider
> ** Pros: Allows handling of larger datasets, breaking the current 4GB
> limit.
> ** Cons: Each row would consume an additional 4 bytes of memory, and there
> might be slightly more CPU instructions involved in processing.
>
> Preliminary benchmarks indicate that the impact on CPU performance is
> minimal, so the main consideration is the increased memory consumption.
>
> * We need your feedback
> ** How would this change affect your current usage of Arrow, especially in
> terms of memory consumption?
> ** Do you have any concerns or thoughts about this proposal?
>
> Please review the detailed information in [1] and [2] and share your
> feedback. Your input is crucial as we gather community insights to decide
> whether or not to proceed with this change.
>
> Looking forward to your feedback and working together to enhance Arrow.
> Thank you!
>
> *Regards,*
> *Rossi SUN*
>

Reply via email to