Re: Chunk Table into RecordBatches of at most 512MB each

2024-02-26 Thread Bryce Mecum
I filed a minor PR [1] to improve the documentation so it's clear what units are involved as I think the current language is vague. [1] https://github.com/apache/arrow/pull/40251 On Sun, Feb 25, 2024 at 9:08 PM Kevin Liu wrote: > > Hey folks, > > I'm working with the PyArrow API for Tables and R

Re:Chunk Table into RecordBatches of at most 512MB each

2024-02-26 Thread Aldrin
Hi Kevin, Shoumyo is correct that the chunk size of to_batches is row-based (logical) and not byte-based (physical), see the example in the documentation [1]. And for more clarity on the "...depending on the chunk layout of individual columns" portion, a Table column is a `ChunkedArray`, which

Re:Chunk Table into RecordBatches of at most 512MB each

2024-02-26 Thread Shoumyo Chakravorti (BLOOMBERG/ 731 LEX)
Hi Kevin, I'm not an Arrow dev so take everything I say with a grain a salt. I just wanted to point out that the `max_chunksize` appears to refer to the max number of *rows* per batch rather than the number of *bytes* per batch: https://github.com/apache/arrow/blob/b8fff043c6cb351b1fad87fa0eeaf