It is my bad for forgetting to add the links.

Much appreciated!

*Regards,*
*Rossi SUN*


Bryce Mecum <bryceme...@gmail.com>于2024年8月2日 周五01:06写道:

> Thanks for driving this forward. I didn't see the links in my email client
> so I'm adding those below in case helps others:
>
> Issue: https://github.com/apache/arrow/issues/43495
> PR: https://github.com/apache/arrow/pull/43389
>
> On Thu, Aug 1, 2024 at 4:06 AM Ruoxi Sun <zanmato1...@gmail.com> wrote:
>
>> Hello everyone,
>>
>> We've identified an issue with Acero's hash join/aggregation, which is
>> currently limited to processing only up to 4GB data due to the use of
>> `uint32_t` for row offsets. This limitation not only impacts our ability
>> to
>> handle large datasets but also makes typical solutions like splitting the
>> data into smaller batches ineffective.
>>
>> * Proposed solution
>> We are considering upgrading the row offsets from 32-bit to 64-bit. This
>> change would allow us to process larger datasets and expand Arrow's
>> application possibilities.
>>
>> * Trade-offs to consider
>> ** Pros: Allows handling of larger datasets, breaking the current 4GB
>> limit.
>> ** Cons: Each row would consume an additional 4 bytes of memory, and there
>> might be slightly more CPU instructions involved in processing.
>>
>> Preliminary benchmarks indicate that the impact on CPU performance is
>> minimal, so the main consideration is the increased memory consumption.
>>
>> * We need your feedback
>> ** How would this change affect your current usage of Arrow, especially in
>> terms of memory consumption?
>> ** Do you have any concerns or thoughts about this proposal?
>>
>> Please review the detailed information in [1] and [2] and share your
>> feedback. Your input is crucial as we gather community insights to decide
>> whether or not to proceed with this change.
>>
>> Looking forward to your feedback and working together to enhance Arrow.
>> Thank you!
>>
>> *Regards,*
>> *Rossi SUN*
>>
>

Reply via email to