Hi all,

I found my issue: I was not actually passing down the
ArrowReaderProperties. I can now see that lowering batch_size meaningfully
reduces memory usage [1]. I still see more memory used when reading files
with larger row groups, keeping the batch size constant.

Overall I found that users who want to keep memory usage down when reading
Parquet should: (1) Turn off prebuffering, (2) Read data in batches, and
(3) turn on buffered_stream.

If there's no further input, I may add these suggestions to our docs.

[1]
https://github.com/wjones127/arrow-parquet-memory-bench/blob/7e0d740a09c8042da647a0de1f285b6bb8a7f4db/readme_files/figure-gfm/group-size-1.png

On Tue, Aug 9, 2022 at 4:11 PM Will Jones <will.jones...@gmail.com> wrote:

> I did some experiments to try to understand what controls a user has to
> constrain how much memory our Parquet readers use, at least as measured by
> the memory pools max_memory() method.
>
> I was surprised to find that parquet::ArrowReaderProperties.batch_size
> didn't have much of an effect at all on the peak memory usage [1]. The code
> I ran was [2].
>
> Two questions:
>
> 1. Is this expected? Or does it sound like I did something wrong?
> 2. Is there a way we could make it so that setting a smaller batch size
> reduced the memory required to read into a record batch stream?
>
> I created a repo for these tests at [3].
>
> [1]
> https://github.com/wjones127/arrow-parquet-memory-bench/blob/5434f9f642c452470aa18ca872e9acd0d7462a1a/readme_files/figure-gfm/group-size-1.png
> [2]
> https://github.com/wjones127/arrow-parquet-memory-bench/blob/5434f9f642c452470aa18ca872e9acd0d7462a1a/src/main.cc#L51-L66
> [3] https://github.com/wjones127/arrow-parquet-memory-bench
>

Reply via email to