I appreciate all the kind feedback. And so far we haven't seen any concerns from user ML, so I think we can proceed with PR review/merge.
Thanks all! *Regards,* *Rossi SUN* On Wed, Aug 7, 2024 at 8:08 AM Vibhatha Abeykoon <vibha...@gmail.com> wrote: > +1 Thanks for the proposal and it is a good move. > > > On Tue, Aug 6, 2024 at 6:35 AM Jacob Wujciak <assignu...@apache.org> > wrote: > > > Thanks for the clear write up in the issue and PR! > > > > +1 it is clear that this is something that users want and the downsides > > seem minimal. > > > > Velox also switched to 64bit about a year ago. > > > > Am Mo., 5. Aug. 2024 um 16:48 Uhr schrieb Weston Pace < > > weston.p...@gmail.com > > >: > > > > > +1 as well. 32 bit keys were chosen because the expectation was that > > > hashtable spilling would come along soon. Since it didn't, I think > it's > > a > > > good idea to use 64-bit keys until spilling is added. > > > > > > On Mon, Aug 5, 2024 at 6:05 AM Antoine Pitrou <anto...@python.org> > > wrote: > > > > > > > > > > > I don't have any concrete data to test this against, but using 64-bit > > > > offsets sounds like an obvious improvement to me. > > > > > > > > Regards > > > > > > > > Antoine. > > > > > > > > > > > > Le 01/08/2024 à 13:05, Ruoxi Sun a écrit : > > > > > Hello everyone, > > > > > > > > > > We've identified an issue with Acero's hash join/aggregation, which > > is > > > > > currently limited to processing only up to 4GB data due to the use > of > > > > > `uint32_t` for row offsets. This limitation not only impacts our > > > ability > > > > to > > > > > handle large datasets but also makes typical solutions like > splitting > > > the > > > > > data into smaller batches ineffective. > > > > > > > > > > * Proposed solution > > > > > We are considering upgrading the row offsets from 32-bit to 64-bit. > > > This > > > > > change would allow us to process larger datasets and expand Arrow's > > > > > application possibilities. > > > > > > > > > > * Trade-offs to consider > > > > > ** Pros: Allows handling of larger datasets, breaking the current > 4GB > > > > limit. > > > > > ** Cons: Each row would consume an additional 4 bytes of memory, > and > > > > there > > > > > might be slightly more CPU instructions involved in processing. > > > > > > > > > > Preliminary benchmarks indicate that the impact on CPU performance > is > > > > > minimal, so the main consideration is the increased memory > > consumption. > > > > > > > > > > * We need your feedback > > > > > ** How would this change affect your current usage of Arrow, > > especially > > > > in > > > > > terms of memory consumption? > > > > > ** Do you have any concerns or thoughts about this proposal? > > > > > > > > > > Please review the detailed information in [1] and [2] and share > your > > > > > feedback. Your input is crucial as we gather community insights to > > > decide > > > > > whether or not to proceed with this change. > > > > > > > > > > Looking forward to your feedback and working together to enhance > > Arrow. > > > > > Thank you! > > > > > > > > > > *Regards,* > > > > > *Rossi SUN* > > > > > > > > > > > > > > >