Re: Chunk Table into RecordBatches of at most 512MB each

Bryce Mecum Mon, 26 Feb 2024 13:05:31 -0800

I filed a minor PR [1] to improve the documentation so it's clear what
units are involved as I think the current language is vague.


[1] https://github.com/apache/arrow/pull/40251

On Sun, Feb 25, 2024 at 9:08 PM Kevin Liu <kevin.jq....@gmail.com> wrote:
>
> Hey folks,
>
> I'm working with the PyArrow API for Tables and RecordBatches. And I'm trying 
> to chunk a Table into a list of RecordBatches each with a default chunk size. 
> For example, 10 GB into several 512MB chunks.
>
> I'm having a hard time doing this using the existing API. The 
> Table.to_batches method has an optional parameter `max_chunksize` which is 
> documented as "Maximum size for RecordBatch chunks. Individual chunks may be 
> smaller depending on the chunk layout of individual columns." It seems 
> exactly like what I want but I've run into a couple of edge cases.
>
> Edge case 1, Table created using many RecordBatches
> ```
> pylist = [{'n_legs': 2, 'animals': 'Flamingo'},
>           {'n_legs': 4, 'animals': 'Dog'}]
> pylist_tbl = pa.Table.from_pylist(pylist)
> # pylist_tbl.nbytes
> # > 35
> multiplier = 2048
> bigger_pylist_tbl = pa.Table.from_batches(example_tbl.to_batches() * 
> multiplier)
> # bigger_pylist_tbl.nbytes
> # 591872 / 578.00 KB
>
> target_batch_size = 512 * 1024 * 1024  # 512 MB
> len(bigger_pylist_tbl.to_batches(target_batch_size))
> # > 2048
> # expected, 1 RecordBatch
> ```
>
> Edge case 2, really big Table with 1 RecordBatch
> ```
> # file already saved on disk
> with pa.memory_map('table_10000000.arrow', 'r') as source:
>     huge_arrow_tbl = pa.ipc.open_file(source).read_all()
>
> huge_arrow_tbl.nbytes
> # 7188263146 / 6.69 GB
> len(huge_arrow_tbl)
> # 10_000_000
>
> target_batch_size = 512 * 1024 * 1024  # 512 MB
> len(huge_arrow_tbl.to_batches(target_batch_size))
> # > 1
> # expected (6.69 GB // 512 MB) + 1 RecordBatches
> ```
>
> I'm currently exploring the underlying implementation for to_batches and 
> TableBatchReader::ReadNext.
> Please let me know if anyone knows a canonical way to satisfy the chunking 
> behavior described above.
>
> Thanks,
> Kevin
>
>
>
>

Re: Chunk Table into RecordBatches of at most 512MB each

Reply via email to