I appreciate all the kind feedback. And so far we haven't seen any concerns
from user ML, so I think we can proceed with PR review/merge.

Thanks all!

*Regards,*
*Rossi SUN*


On Wed, Aug 7, 2024 at 8:08 AM Vibhatha Abeykoon <vibha...@gmail.com> wrote:

> +1 Thanks for the proposal and it is a good move.
>
>
> On Tue, Aug 6, 2024 at 6:35 AM Jacob Wujciak <assignu...@apache.org>
> wrote:
>
> > Thanks for the clear write up in the issue and PR!
> >
> > +1 it is clear that this is something that users want and the downsides
> > seem minimal.
> >
> > Velox also switched to 64bit about a year ago.
> >
> > Am Mo., 5. Aug. 2024 um 16:48 Uhr schrieb Weston Pace <
> > weston.p...@gmail.com
> > >:
> >
> > > +1 as well.  32 bit keys were chosen because the expectation was that
> > > hashtable spilling would come along soon.  Since it didn't, I think
> it's
> > a
> > > good idea to use 64-bit keys until spilling is added.
> > >
> > > On Mon, Aug 5, 2024 at 6:05 AM Antoine Pitrou <anto...@python.org>
> > wrote:
> > >
> > > >
> > > > I don't have any concrete data to test this against, but using 64-bit
> > > > offsets sounds like an obvious improvement to me.
> > > >
> > > > Regards
> > > >
> > > > Antoine.
> > > >
> > > >
> > > > Le 01/08/2024 à 13:05, Ruoxi Sun a écrit :
> > > > > Hello everyone,
> > > > >
> > > > > We've identified an issue with Acero's hash join/aggregation, which
> > is
> > > > > currently limited to processing only up to 4GB data due to the use
> of
> > > > > `uint32_t` for row offsets. This limitation not only impacts our
> > > ability
> > > > to
> > > > > handle large datasets but also makes typical solutions like
> splitting
> > > the
> > > > > data into smaller batches ineffective.
> > > > >
> > > > > * Proposed solution
> > > > > We are considering upgrading the row offsets from 32-bit to 64-bit.
> > > This
> > > > > change would allow us to process larger datasets and expand Arrow's
> > > > > application possibilities.
> > > > >
> > > > > * Trade-offs to consider
> > > > > ** Pros: Allows handling of larger datasets, breaking the current
> 4GB
> > > > limit.
> > > > > ** Cons: Each row would consume an additional 4 bytes of memory,
> and
> > > > there
> > > > > might be slightly more CPU instructions involved in processing.
> > > > >
> > > > > Preliminary benchmarks indicate that the impact on CPU performance
> is
> > > > > minimal, so the main consideration is the increased memory
> > consumption.
> > > > >
> > > > > * We need your feedback
> > > > > ** How would this change affect your current usage of Arrow,
> > especially
> > > > in
> > > > > terms of memory consumption?
> > > > > ** Do you have any concerns or thoughts about this proposal?
> > > > >
> > > > > Please review the detailed information in [1] and [2] and share
> your
> > > > > feedback. Your input is crucial as we gather community insights to
> > > decide
> > > > > whether or not to proceed with this change.
> > > > >
> > > > > Looking forward to your feedback and working together to enhance
> > Arrow.
> > > > > Thank you!
> > > > >
> > > > > *Regards,*
> > > > > *Rossi SUN*
> > > > >
> > > >
> > >
> >
>

Reply via email to