Thanks for driving this forward. I didn't see the links in my email client so I'm adding those below in case helps others:
Issue: https://github.com/apache/arrow/issues/43495 PR: https://github.com/apache/arrow/pull/43389 On Thu, Aug 1, 2024 at 4:06 AM Ruoxi Sun <zanmato1...@gmail.com> wrote: > Hello everyone, > > We've identified an issue with Acero's hash join/aggregation, which is > currently limited to processing only up to 4GB data due to the use of > `uint32_t` for row offsets. This limitation not only impacts our ability to > handle large datasets but also makes typical solutions like splitting the > data into smaller batches ineffective. > > * Proposed solution > We are considering upgrading the row offsets from 32-bit to 64-bit. This > change would allow us to process larger datasets and expand Arrow's > application possibilities. > > * Trade-offs to consider > ** Pros: Allows handling of larger datasets, breaking the current 4GB > limit. > ** Cons: Each row would consume an additional 4 bytes of memory, and there > might be slightly more CPU instructions involved in processing. > > Preliminary benchmarks indicate that the impact on CPU performance is > minimal, so the main consideration is the increased memory consumption. > > * We need your feedback > ** How would this change affect your current usage of Arrow, especially in > terms of memory consumption? > ** Do you have any concerns or thoughts about this proposal? > > Please review the detailed information in [1] and [2] and share your > feedback. Your input is crucial as we gather community insights to decide > whether or not to proceed with this change. > > Looking forward to your feedback and working together to enhance Arrow. > Thank you! > > *Regards,* > *Rossi SUN* >