I filed a minor PR [1] to improve the documentation so it's clear what units are involved as I think the current language is vague.
[1] https://github.com/apache/arrow/pull/40251 On Sun, Feb 25, 2024 at 9:08 PM Kevin Liu <kevin.jq....@gmail.com> wrote: > > Hey folks, > > I'm working with the PyArrow API for Tables and RecordBatches. And I'm trying > to chunk a Table into a list of RecordBatches each with a default chunk size. > For example, 10 GB into several 512MB chunks. > > I'm having a hard time doing this using the existing API. The > Table.to_batches method has an optional parameter `max_chunksize` which is > documented as "Maximum size for RecordBatch chunks. Individual chunks may be > smaller depending on the chunk layout of individual columns." It seems > exactly like what I want but I've run into a couple of edge cases. > > Edge case 1, Table created using many RecordBatches > ``` > pylist = [{'n_legs': 2, 'animals': 'Flamingo'}, > {'n_legs': 4, 'animals': 'Dog'}] > pylist_tbl = pa.Table.from_pylist(pylist) > # pylist_tbl.nbytes > # > 35 > multiplier = 2048 > bigger_pylist_tbl = pa.Table.from_batches(example_tbl.to_batches() * > multiplier) > # bigger_pylist_tbl.nbytes > # 591872 / 578.00 KB > > target_batch_size = 512 * 1024 * 1024 # 512 MB > len(bigger_pylist_tbl.to_batches(target_batch_size)) > # > 2048 > # expected, 1 RecordBatch > ``` > > Edge case 2, really big Table with 1 RecordBatch > ``` > # file already saved on disk > with pa.memory_map('table_10000000.arrow', 'r') as source: > huge_arrow_tbl = pa.ipc.open_file(source).read_all() > > huge_arrow_tbl.nbytes > # 7188263146 / 6.69 GB > len(huge_arrow_tbl) > # 10_000_000 > > target_batch_size = 512 * 1024 * 1024 # 512 MB > len(huge_arrow_tbl.to_batches(target_batch_size)) > # > 1 > # expected (6.69 GB // 512 MB) + 1 RecordBatches > ``` > > I'm currently exploring the underlying implementation for to_batches and > TableBatchReader::ReadNext. > Please let me know if anyone knows a canonical way to satisfy the chunking > behavior described above. > > Thanks, > Kevin > > > >