+1 as well. 32 bit keys were chosen because the expectation was that hashtable spilling would come along soon. Since it didn't, I think it's a good idea to use 64-bit keys until spilling is added.
On Mon, Aug 5, 2024 at 6:05 AM Antoine Pitrou <anto...@python.org> wrote: > > I don't have any concrete data to test this against, but using 64-bit > offsets sounds like an obvious improvement to me. > > Regards > > Antoine. > > > Le 01/08/2024 à 13:05, Ruoxi Sun a écrit : > > Hello everyone, > > > > We've identified an issue with Acero's hash join/aggregation, which is > > currently limited to processing only up to 4GB data due to the use of > > `uint32_t` for row offsets. This limitation not only impacts our ability > to > > handle large datasets but also makes typical solutions like splitting the > > data into smaller batches ineffective. > > > > * Proposed solution > > We are considering upgrading the row offsets from 32-bit to 64-bit. This > > change would allow us to process larger datasets and expand Arrow's > > application possibilities. > > > > * Trade-offs to consider > > ** Pros: Allows handling of larger datasets, breaking the current 4GB > limit. > > ** Cons: Each row would consume an additional 4 bytes of memory, and > there > > might be slightly more CPU instructions involved in processing. > > > > Preliminary benchmarks indicate that the impact on CPU performance is > > minimal, so the main consideration is the increased memory consumption. > > > > * We need your feedback > > ** How would this change affect your current usage of Arrow, especially > in > > terms of memory consumption? > > ** Do you have any concerns or thoughts about this proposal? > > > > Please review the detailed information in [1] and [2] and share your > > feedback. Your input is crucial as we gather community insights to decide > > whether or not to proceed with this change. > > > > Looking forward to your feedback and working together to enhance Arrow. > > Thank you! > > > > *Regards,* > > *Rossi SUN* > > >