I don't have any concrete data to test this against, but using 64-bit offsets sounds like an obvious improvement to me.
Regards Antoine. Le 01/08/2024 à 13:05, Ruoxi Sun a écrit :
Hello everyone, We've identified an issue with Acero's hash join/aggregation, which is currently limited to processing only up to 4GB data due to the use of `uint32_t` for row offsets. This limitation not only impacts our ability to handle large datasets but also makes typical solutions like splitting the data into smaller batches ineffective. * Proposed solution We are considering upgrading the row offsets from 32-bit to 64-bit. This change would allow us to process larger datasets and expand Arrow's application possibilities. * Trade-offs to consider ** Pros: Allows handling of larger datasets, breaking the current 4GB limit. ** Cons: Each row would consume an additional 4 bytes of memory, and there might be slightly more CPU instructions involved in processing. Preliminary benchmarks indicate that the impact on CPU performance is minimal, so the main consideration is the increased memory consumption. * We need your feedback ** How would this change affect your current usage of Arrow, especially in terms of memory consumption? ** Do you have any concerns or thoughts about this proposal? Please review the detailed information in [1] and [2] and share your feedback. Your input is crucial as we gather community insights to decide whether or not to proceed with this change. Looking forward to your feedback and working together to enhance Arrow. Thank you! *Regards,* *Rossi SUN*