+1 as well.  32 bit keys were chosen because the expectation was that
hashtable spilling would come along soon.  Since it didn't, I think it's a
good idea to use 64-bit keys until spilling is added.

On Mon, Aug 5, 2024 at 6:05 AM Antoine Pitrou <anto...@python.org> wrote:

>
> I don't have any concrete data to test this against, but using 64-bit
> offsets sounds like an obvious improvement to me.
>
> Regards
>
> Antoine.
>
>
> Le 01/08/2024 à 13:05, Ruoxi Sun a écrit :
> > Hello everyone,
> >
> > We've identified an issue with Acero's hash join/aggregation, which is
> > currently limited to processing only up to 4GB data due to the use of
> > `uint32_t` for row offsets. This limitation not only impacts our ability
> to
> > handle large datasets but also makes typical solutions like splitting the
> > data into smaller batches ineffective.
> >
> > * Proposed solution
> > We are considering upgrading the row offsets from 32-bit to 64-bit. This
> > change would allow us to process larger datasets and expand Arrow's
> > application possibilities.
> >
> > * Trade-offs to consider
> > ** Pros: Allows handling of larger datasets, breaking the current 4GB
> limit.
> > ** Cons: Each row would consume an additional 4 bytes of memory, and
> there
> > might be slightly more CPU instructions involved in processing.
> >
> > Preliminary benchmarks indicate that the impact on CPU performance is
> > minimal, so the main consideration is the increased memory consumption.
> >
> > * We need your feedback
> > ** How would this change affect your current usage of Arrow, especially
> in
> > terms of memory consumption?
> > ** Do you have any concerns or thoughts about this proposal?
> >
> > Please review the detailed information in [1] and [2] and share your
> > feedback. Your input is crucial as we gather community insights to decide
> > whether or not to proceed with this change.
> >
> > Looking forward to your feedback and working together to enhance Arrow.
> > Thank you!
> >
> > *Regards,*
> > *Rossi SUN*
> >
>

Reply via email to