[DISCUSS][Acero] Upgrading to 64-bit row offsets in row table

2024-08-01 Thread Ruoxi Sun
Hello everyone, We've identified an issue with Acero's hash join/aggregation, which is currently limited to processing only up to 4GB data due to the use of `uint32_t` for row offsets. This limitation not only impacts our ability to handle large datasets but also makes typical solutions like split

Re: [DISCUSS][Acero] Upgrading to 64-bit row offsets in row table

2024-08-01 Thread Bryce Mecum
Thanks for driving this forward. I didn't see the links in my email client so I'm adding those below in case helps others: Issue: https://github.com/apache/arrow/issues/43495 PR: https://github.com/apache/arrow/pull/43389 On Thu, Aug 1, 2024 at 4:06 AM Ruoxi Sun wrote: > Hello everyone, > > We'

Being a dictionary columns is not preserved

2024-08-01 Thread Jacek Pliszka
Hi! Could someone tell me if this is a feature or a bug (pyarrow 17.0.0). When I have a dictionary column: In [15]: a1 Out[15]: pyarrow.Table a: dictionary a: [ -- dictionary: [1] -- indices: [0]] I store it in parquet file: In [16]: pq.write_table(a1, 'dict.parquet') And read it back: a

Re: [DISCUSS][Acero] Upgrading to 64-bit row offsets in row table

2024-08-01 Thread Ruoxi Sun
It is my bad for forgetting to add the links. Much appreciated! *Regards,* *Rossi SUN* Bryce Mecum 于2024年8月2日 周五01:06写道: > Thanks for driving this forward. I didn't see the links in my email client > so I'm adding those below in case helps others: > > Issue: https://github.com/apache/arrow/iss