It is my bad for forgetting to add the links. Much appreciated!
*Regards,* *Rossi SUN* Bryce Mecum <bryceme...@gmail.com>于2024年8月2日 周五01:06写道: > Thanks for driving this forward. I didn't see the links in my email client > so I'm adding those below in case helps others: > > Issue: https://github.com/apache/arrow/issues/43495 > PR: https://github.com/apache/arrow/pull/43389 > > On Thu, Aug 1, 2024 at 4:06 AM Ruoxi Sun <zanmato1...@gmail.com> wrote: > >> Hello everyone, >> >> We've identified an issue with Acero's hash join/aggregation, which is >> currently limited to processing only up to 4GB data due to the use of >> `uint32_t` for row offsets. This limitation not only impacts our ability >> to >> handle large datasets but also makes typical solutions like splitting the >> data into smaller batches ineffective. >> >> * Proposed solution >> We are considering upgrading the row offsets from 32-bit to 64-bit. This >> change would allow us to process larger datasets and expand Arrow's >> application possibilities. >> >> * Trade-offs to consider >> ** Pros: Allows handling of larger datasets, breaking the current 4GB >> limit. >> ** Cons: Each row would consume an additional 4 bytes of memory, and there >> might be slightly more CPU instructions involved in processing. >> >> Preliminary benchmarks indicate that the impact on CPU performance is >> minimal, so the main consideration is the increased memory consumption. >> >> * We need your feedback >> ** How would this change affect your current usage of Arrow, especially in >> terms of memory consumption? >> ** Do you have any concerns or thoughts about this proposal? >> >> Please review the detailed information in [1] and [2] and share your >> feedback. Your input is crucial as we gather community insights to decide >> whether or not to proceed with this change. >> >> Looking forward to your feedback and working together to enhance Arrow. >> Thank you! >> >> *Regards,* >> *Rossi SUN* >> >